This discussion is archived
1 2 Previous Next 15 Replies Latest reply: Sep 19, 2006 11:35 AM by 807598 RSS

Extracting selected part of text file

807598 Newbie
Currently Being Moderated
Hi there,

New to this forum and to JAVA.

My problem is this:
I need to extract a series of selected paragraphs from a text file. I have been able to write a code so that I can write and read from files, but using my code I can only read the entire file into the Buffered (code below):
import java.io.*;

class DtoS
{
     public static void main(String args[])
     throws IOException
     {
          FileReader fr = new FileReader("test.txt");
          BufferedReader br = new BufferedReader(fr);
          String s;
               
          while((s = br.readLine()) != null) {
          System.out.println(s);
          }
          fr.close();
     }
}


What I need to do is finding specific points in the text file and extract the text between to points (e.g <abstract> text <\abstract>) to a new file. It is a big file with 162,000 abstracts. I have a version of the file using ASCII and one using XML (1.5 GB). I have never used XML before, but am wondering if it would be smart to do here.

I was thinking about using String.compareTo to find the points, but since they are in plain text I don't know how. Wondering if it would be a good idea to use a String Tokenizer?
Also considered using indexOf(string), but I would have multiple identical strings. Is it possible to delete passages in the original file as you go along? If so how?

Hope sincerely that someone can help me.
Thank you in advance
  • 1. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    I can only read the entire file into the Buffered (code below):
    No, you can do whatever you want. All you've done so far is read the entire file in. Now attempt to do what you WANT to do and post issues (if you have them) here.
  • 2. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    What I need to do is finding specific points in the
    text file and extract the text between to points (e.g
    <abstract> text <\abstract>) to a new file. It is a
    big file with 162,000 abstracts. I have a version of
    the file using ASCII and one using XML (1.5 GB). I
    have never used XML before, but am wondering if it
    would be smart to do here.
    Probably.

    How does the non-XML version of the file indicate the points in the file that delineate the parts of text that you're looking for?
    I was thinking about using String.compareTo to find
    the points, but since they are in plain text I don't
    know how. Wondering if it would be a good idea to use
    a String Tokenizer?
    Also considered using indexOf(string), but I would
    have multiple identical strings. Is it possible to
    delete passages in the original file as you go along?
    If so how?
    You can do all these things. You can also use regular expressions. Basically you're parsing the file. There might be characteristics of the file that allow you to simplify the parsing.

    But then ultimately you may just be reinventing the wheel. XML is a way to mark up text, and there are existing libraries to operate on XML. It should be pretty simple to use SAX, say, to turn a flag "doOutput" to true when you get a <abstract> tag, and turn it off when you get a </abstract> tag, and then to print all character elements when the flag is true.
  • 3. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Thank you both very much. I will try that now.
  • 4. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Besides String Tokenizer Pattern can also help u to do the above function.....
  • 5. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Hey,
    I dont know how much helpful is my suggestion to you, but i would have done this with either of the following two approach...

    1. Positive Approach: -
    By using String Class methods eg: indexOf() to get the index of <abstract> & </abstract> or StringTokenizer to separate the tokens, substring() etc..

    2. Negative Approach: -
    The way u r doing is...reading file line by line....try to replace <abstract> & </abstract> by " " (use replace() of string class) the resultant text will be the desired output


    Try it out..
  • 6. Re: Extracting selected part of text file
    800308 Newbie
    Currently Being Moderated
    I'd forget handy-crafted parsing... and go straight to an industrial SAX parser like Xalan & Xerces... you could even skip the coding alltogether and just write an XSLT (?but that's DOM parsed isn't it, so you'd need 600mb free ram for 150mb document.?)
  • 7. Re: Extracting selected part of text file
    800308 Newbie
    Currently Being Moderated
    I just checked... the simpleDomParser example for XalanC uses DOM the parser, so you'd need 6Gig... so the no code option is probably not an option.
  • 8. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Thank you all for the input.

    I have been trying to use the SAX as suggested by paulcw. Without much success, but I have yet to give up on that approach (since my problem is my limited CS- skills and not the failing of SAX). However if I understand corlettk correct the SAX option may be problematic due to space limitations.

    paulcw: The non-XML file is formatted as:

    PMID- 14723818
    OWN - NLM
    ...
    TI - [Relationship between single nucleotide polymorphisms in thiopurine
         methyltransferase gene and tolerance to thiopurines in acute leukemia]
    PG - 929-33
    AB - OBJECTIVE: For the purpose of clarifying the influence of thiopurine
    methyltransferase (TPMT) ...
    ....
    AD - Hematology center Beijing Children's Hospital, Capital University of
    Medical sciences, Beijing 100045, China.
    FAU - Ma, Xiao-li
    ......
    PST - ppublish
    SO - Zhonghua Er Ke Za Zhi 2003 Dec;41(12):929-33.

    where AB - ... Denotes the equivalant of the <abstract> text <\abstract> in the XML format

    Ganapathy.S: Could you elaborate? Do you know of any good tutorials?

    d.suhas: Trying to use your positive approach. I am however running into a problem already in trying to use the indexOf().

    import java.io.*;

    class DtoS
    {
         public static void main(String args[])
              throws IOException
              {
                   FileReader fr = new FileReader("test.txt");
                   BufferedReader br = new BufferedReader(fr);
                   String str;
                   int index;
                   
                   
                   while((str = br.readLine()) != null)
              
                   {
                        System.out.println(str);          
                   }
                   index = str.indexOf("<abstract>");
                             System.out.println("Index is" +index);
                             
                   fr.close();
              }
    }

    I get a nullpointererror.

    "This is a test
    <abstract> Abstract1 <\abstract>
    <abstract> Abstract2 <\abstract>
    <abstract> Abstract3 <\abstract>

    Exception in thread "main" java.lang.NullPointerException
    at DtoS.main(input_output.java:20)
    Press any key to continue..."

    I think that means that there is no value in index, but there should be (I think).
    Without having tested it, it seems that your negative approach would help me "delete" already copied passages. Thanks.


    I would truly appreciate any further help/suggestions from all of you.
  • 9. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    You are trying to access "str" object after the while loop, it would be null at this point. And hence str.indexOf() throws null pointer exception...

    If you want to find the index of <abstract> string, insert the below code inside the loop...
    index = str.indexOf("<abstract>");
    System.out.println("Index is" +index);                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 10. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    The biggest problem I'd see with the non-XML format is that it may be hard to tell when a string of text is a tag and when it's part of the text. For example, if the text "prescription of FDA-approved medication" might confuse the parser if there's a newline after "of".

    Otherwise it looks like a reasonably consistent format and relatively easy to parse. You could basically read the file line by line, check to see if it starts with a known tag, and if it does close up the previous block of text and open a new one for the new tag.

    Still, the XML-based solution would probably be better long-term. SAX isn't much of a memory hog. It also comes with the standard library these days. What problems are you having?
  • 11. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Here's a simple SAX-based version:
    import java.io.File;
    
    import javax.xml.parsers.SAXParser;
    import javax.xml.parsers.SAXParserFactory;
    import org.xml.sax.Attributes;
    import org.xml.sax.SAXException;
    import org.xml.sax.helpers.DefaultHandler;
    
    
    public class SaxTest {
    
      public static void main(String[] argv) {
        try {
          SAXParserFactory fac = SAXParserFactory.newInstance();
          SAXParser parser = fac.newSAXParser();
    
          SimpleHandler sh = new SimpleHandler();
          parser.parse(new File(argv[0]), sh);
        } catch(Exception e) {
          e.printStackTrace();
        }
      }
    
      private static class SimpleHandler extends DefaultHandler {
    
        private boolean doOutput = false;
    
        public void startElement(String uri,
                          String localName,
                          String qName,
                          Attributes atts)
                            throws SAXException {
          if ("abstract".equalsIgnoreCase(qName)) {
            doOutput = true;
          } else {
            doOutput = false;
          }
        }
    
        public void endElement(String uri,
                        String localName,
                        String qName)
                          throws SAXException {
          doOutput = false;
        }
    
        public void characters(char[] ch, int start, int length) throws SAXException {
          if (doOutput)
            System.out.println(new String(ch, start, length));
        }
      }
    }
  • 12. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    This problem cries out for the KMP algorithm using Readers and Writers that I posted here a couple of months back. I can't find it but I'm sure someone will.

    Found a copy in my CVS repository -
    The main program
    import com.edison.library.kmp.*;
    import java.io.*;
    import java.util.regex.*;
    
    public class Fred704
    {
        private static final Pattern encodingExtractionRegex = Pattern.compile("encoding *?= *?\"([^\"]+)\"");
        
        public static String getXMLFileEncoding(File xmlFile) throws IOException
        {
            // By default an XML document uses UTF-8 character encoding
            // so assume that for the moment.
            String encoding = "UTF-8";
            BufferedReader reader = null;
            try
            {
                // The first line of an XML file has to be able to be read as ASCII
                // so read the first line as ASCII
                reader = new BufferedReader(new InputStreamReader(new FileInputStream(xmlFile), "ASCII"));
                final String firstLine = reader.readLine();
                reader.close();
                
                // Now look for the encoding
                // and extract it if found.
                final Matcher matcher = encodingExtractionRegex.matcher(firstLine);
                if (matcher.find())
                {
                    encoding = matcher.group(1).trim();
                }
            }
            finally
            {
                if (reader != null)
                    reader.close();
            }
            return encoding;
        }
        
        public static void main(final String[] args) throws Exception
        {
            final File source = new File(System.getProperty("user.home") + "/work/dev/stow-longa/church/graves.xml");
            final File destination = new File(System.getProperty("user.home") + "/xxxx.txt");
            
            final String encoding = getXMLFileEncoding(source);
            
            final String beginPattern = "<line>";
            final String endPattern = "</line>";
            final ReaderToWriterKMP beginKMP = new ReaderToWriterKMP(beginPattern);
            final ReaderToWriterKMP endKMP = new ReaderToWriterKMP(endPattern);
            
            final Reader reader = new BufferedReader(new InputStreamReader(new FileInputStream(source), encoding));
            final Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(destination), encoding));
            
            for (boolean eof = false; !eof;)
            {
                eof = beginKMP.search(reader);
                if (!eof)
                {
                    writer.write(beginPattern);
                    eof = endKMP.search(reader, writer);
                    writer.write("\n");
                }
            }
            
            writer.close();
            reader.close();
        }
    }
    and the KMP class
    import java.io.*;
    
    /**
     * A KMP implementation that copies chars from a Reader to a Writer until
     * a match is found or EOF is found.
     */
    public class ReaderToWriterKMP
    {
        private final char[] pattern_;
        private final int[] next_;
        
        /**
         * Constructs a ReaderToWriterKMP for a given pattern to match.
         *
         * @param pattern the pattern to match in the Reader.
         */
        public ReaderToWriterKMP(String pattern)
        {
            pattern_ = pattern.toCharArray();
            next_ = new int[pattern_.length];
            next_[0] = -1;
            for (int len = next_.length-1, i = 0, j = -1; i < len; next_[++i] = ++j)
            {
                while((j >= 0) && (pattern_ != pattern_[j]))
    {
    j = next_[j];
    }
    }
    }

    /**
    * Copies from the reader until a match is obtained
    * or EOF is found.
    *
    * @param reader the reader from which to read the characters.
    * @return 'true' if EOF found before a match.
    * @throws IOException if there is a read or write error
    */
    public boolean search(final Reader reader) throws IOException
    {
    return search(reader, null);
    }

    /**
    * Copies from the reader to the writer until a match is obtained
    * or EOF is found.
    *
    * @param reader the reader from which to read the characters.
    * @param writer the writer to which to write the characters.
    * @return 'true' if EOF found before a match.
    * @throws IOException if there is a read or write error
    */
    public boolean search(final Reader reader, final Writer writer) throws IOException
    {
    for (int j = 0; j < next_.length; j++)
    {
    final int ch = reader.read();
    if (ch == -1)
    return true;
    if (writer != null)
    writer.write(ch);
    while ((j >= 0) && (ch != pattern_[j]))
    {
    j = next_[j];
    }
    }
    return false;
    }
    }


    You should be able to use this main program with minimal changes for the XML file and with a few more changes for the text file.

    Message was edited by:
    sabre150
  • 13. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Thank you both. This kind of code is extremely helpful.

    Sabre150: I get a complilation error since I don't have com.edison.library.kmp.* tried (unsuccesfull) to Google for it but could not find it. Do you know where to find it? Or how to create it?

    paulcw: I tried compiling your code and got following error:
    javac: invalid flag: C:\Documents and Settings\maa7012\Desktop\xml_tamtam.xml
    Usage: javac <options> <source files>
    where possible options include:
    .....

    I think that this may be because I may not have the files which is imported. Do you know where to find them and where to store them? I am using JCreator Pro version 3.50.010.

    I do apologize for continuing to bother you helpfull people, but I just can't seem to get it right.
  • 14. Re: Extracting selected part of text file
    807598 Newbie
    Currently Being Moderated
    Sabre150: I get a complilation error since I don't
    have com.edison.library.kmp.* tried (unsuccesfull) to
    Google for it but could not find it. Do you know
    where to find it? Or how to create it?
    That is because that is the package the ReaderToWriterKMP class is in on my sytem. You can put it in any package you like!

    If you put all the code in the same package then you won't need that particular import or even one like it.

    Message was edited by:
    sabre150
1 2 Previous Next