7 Replies Latest reply: Sep 21, 2006 11:52 PM by 807569 RSS

    read doc or text file

    807569
      Hi all,
      I have an application where I'm reading and parsing text files, then saving the results. I figured out how to make it work with BufferedReader for the text file, but I can't seem to figure out how to use InputStreams so I can use .doc files as well. What I'd like to do, in pseudocode, is:
      String theString = "the name of the file"
      
      if( the file has a .doc extension) {
      read it with an inputstream
      } else {
      read it with a bufferedreader
      }
      
      while( the line in the file isn't null ) {
      // parse the file and save the results
      }
      I don't know if you can use the same ( while != null ) syntax with inputStreams, it doesn't look like it.

      I hope I was clear enough, if anyone can help me out that would be great!

      Thanks!
      Jezzica85
        • 1. Re: read doc or text file
          807569
          How exactly do you plan on "parsing" the doc file? Because it's a little more complicated than you are aware I fear.
          • 2. Re: read doc or text file
            807569
            Hi Cotton, nice to hear from you again!
            I'm parsing the file with a StringTokenizer, then saving the tokens in a few different lists.
            • 3. Re: read doc or text file
              807569
              Hi Cotton, nice to hear from you again!'
              Nice to see you again too.
              I'm parsing the file with a StringTokenizer, then
              saving the tokens in a few different lists.
              That's how you're parseing the text file right? Parsing a doc will be more involved. By doc you mean Word document right?
              • 4. Re: read doc or text file
                807569
                Yes, I do mean a Word document, and I'm parsing the text file with the StringTokenizer. I basically got this idea because I have a lot of files to parse, they're all .doc files and it's a real pain to convert them to text every time I want to check one. My Batch Conversion wizard in Word doesn't seem to work, or I would do that...stupid Microsoft.
                • 5. Re: read doc or text file
                  camickr
                  Yes, I do mean a Word document,
                  You can't just parse a Word document using StringTokenizier (or any other tokenizer). Its far more complicated than that.
                  • 6. Re: read doc or text file
                    807569
                    Yes, I do mean a Word document, and I'm parsing the
                    text file with the StringTokenizer. I basically got
                    this idea because I have a lot of files to parse,
                    they're all .doc files and it's a real pain to
                    convert them to text every time I want to check one.
                    My Batch Conversion wizard in Word doesn't seem to
                    work, or I would do that...stupid Microsoft.
                    Well I have some bad news for you.... it may be easier to convert each one.

                    Word documents use a proprietary format. This means the data isn't stored in nice little text bits but is full of non-human-readable bytes. There are API's for reading docs from Java but it's going to be more involved than you think.

                    On the one hand it would make for a good learning experience in using different API's etc so I don't want to discourage you from going that route but I would suspect that unless the volume of docs to convert is really quite high it will take you more effort to do it all in Java vs manually converting.

                    Here is something to look at http://jakarta.apache.org/poi/

                    Another route would be to write a Word macro that does the conversion for you. This might be easier. I dunno.
                    • 7. Re: read doc or text file
                      807569
                      Ah, that's what I was afraid of. No matter then, I guess--I can still do it with text files and deal with the manual conversions. Thanks for letting me know.