This discussion is archived
1 2 Previous Next 26 Replies Latest reply: Feb 7, 2008 9:25 AM by 796440 RSS

Processing large number of text files - efficiency and memory errors

807603 Newbie
Currently Being Moderated
OK, first I'll describe at a high level what I'm trying to do:

Read in up to 15,000 text files, none of which are more than 10kb (mostly <5kb).
Remove some unwanted characters and regular expression patterns.
Count the number of words in each file.
Create a document-term matrix http://en.wikipedia.org/wiki/Document-term_matrix (the headers aren't needed).
Output matrix to a single text file.

I've written code which does this. Thing is, I've never been a strong programmer and it's probably inefficient. With 100 files it does it in less than second, with a 1000 in 14s and with 2000 in 90s. My first question is: does this sound like the kind of time you would expect when doing such processing? If you think this can be done a lot quicker let me know, I'll come back with my algorithms and hopefully we can improve them.

Secondly, when I try it with 3000+ I get the following error: java.lang.OutOfMemoryError: Java heap space

I did a quick Google and learnt that you can increase the size of the memory Java uses. I use Netbeans 6 and I've modified the \NetBeans 6.0\etc\netbeans.conf file to make the max 512mb. The default options line now looks like this:
netbeans_default_options="-J-Xms128m -J-Xmx512m -J-client -J-Xss2m -J-Xms32m -J-XX:PermSize=32m -J-XX:MaxPermSize=200m -J-Xverify:none -J-Dapple.laf.useScreenMenuBar=true"

Unfortunately I still get the same error. Is there another way around this, or is this just an unavoidable problem with dealing with a large number of files, or is it a result of slopppy coding?

Cheers in advance!

Edited by: snappyfool on Feb 6, 2008 6:41 PM
  • 1. Re: Processing large number of text files - efficiency and memory errors
    DrClap Expert
    Currently Being Moderated
    I think the hacking you did there increases the amount of memory available to Netbeans. But Netbeans runs your code in a separate JVM so you haven't affected that all. I expect Netbeans has a way of allowing you to specify how much memory to allocate to a JVM it creates.
  • 2. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    Unfortunately I still get the same error. Is there another way around this, or is this just an unavoidable problem with dealing with a large number of files, or is it a result of slopppy coding?
    Raw data size is estimated 15000 * 5K = 75MB, so it would seem you're using an excessive amount of memory.
  • 3. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    Thanks! I managed to find an option in Netbeans for that application. It can now process larger datasets.

    On second thoughts I think my programming is pretty tight, the only thing I didn't consider is how big the output matrix would be. Across 3000 documents around 40,000 separate words were found, so that's a 40k x 3k matrix... with commas, that's over 200mb.

    Perhaps a dataset of 15k is a bit excessive :D
  • 4. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    Thanks! I managed to find an option in Netbeans for that application. It can now process larger datasets.

    On second thoughts I think my programming is pretty tight, the only thing I didn't consider is how big the output matrix would be. Across 3000 documents around 40,000 separate words were found, so that's a 40k x 3k matrix... with commas, that's over 200mb.

    Perhaps a dataset of 15k is a bit excessive :D
  • 5. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    snappyfool wrote:
    Thanks! I managed to find an option in Netbeans for that application. It can now process larger datasets.

    On second thoughts I think my programming is pretty tight, the only thing I didn't consider is how big the output matrix would be. Across 3000 documents around 40,000 separate words were found, so that's a 40k x 3k matrix... with commas, that's over 200mb.

    Perhaps a dataset of 15k is a bit excessive :D
    That's going to be a pretty sparse matrix, which should give you ideas on how to reduce memory usage.
  • 6. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    The thing is I'm outputting this to a text file so that it can be manipulated in MATLAB. I'm unfamiliar with any Java methods for efficiently storing sparce matrices, and I'll give it a look, but in the end won't I still need to output the whole lot, 0s and all?

    In fact it's never really stored as a matrix in Java, the matrix comes together in the text file using data stored in TreeMaps. Now I'm beginning to doubt my methods again... I will come back later today to explain how I'm approaching this task in more detail.
  • 7. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    snappyfool wrote:
    The thing is I'm outputting this to a text file so that it can be manipulated in MATLAB. I'm unfamiliar with any Java methods for efficiently storing sparce matrices, and I'll give it a look, but in the end won't I still need to output the whole lot, 0s and all?
    There aren't really any Java-specific ways, just general techniques. It's not a problem that you have to output a lot of zeros, so long as you don't actually have to store them in memory.
  • 8. Re: Processing large number of text files - efficiency and memory errors
    791266 Explorer
    Currently Being Moderated
    @Op. You can probably read a file with one read, close the file, process the data and then repeat. So I don't see why your execution times aren't linear.
  • 9. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    kajbj wrote:
    @Op. You can probably read a file with one read, close the file, process the data and then repeat. So I don't see why your execution times aren't linear.
    The data (matrix) is growing in two directions: Rows are added when files are read, and columns are added when new words are found. So I think some non-linearity is expected. Indexing the columns would help reduce the cost of searching the columns, but resizing is still a periodic cost.
  • 10. Re: Processing large number of text files - efficiency and memory errors
    791266 Explorer
    Currently Being Moderated
    paul.miner wrote:
    kajbj wrote:
    @Op. You can probably read a file with one read, close the file, process the data and then repeat. So I don't see why your execution times aren't linear.
    The data (matrix) is growing in two directions: Rows are added when files are read, and columns are added when new words are found. So I think some non-linearity is expected. Indexing the columns would help reduce the cost of searching the columns, but resizing is still a periodic cost.
    But you never need to keep a matrix in memory? I would keep everything in a dynamic structure with fast lookup times. E.g. Map<String, Map<String, Integer>>

    Where the first string is the word, and the second string is the filename. The integer is a counter.
  • 11. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    kajbj wrote:
    paul.miner wrote:
    kajbj wrote:
    @Op. You can probably read a file with one read, close the file, process the data and then repeat. So I don't see why your execution times aren't linear.
    The data (matrix) is growing in two directions: Rows are added when files are read, and columns are added when new words are found. So I think some non-linearity is expected. Indexing the columns would help reduce the cost of searching the columns, but resizing is still a periodic cost.
    But you never need to keep a matrix in memory? I would keep everything in a dynamic structure with fast lookup times. E.g. Map<String, Map<String, Integer>>

    Where the first string is the word, and the second string is the filename. The integer is a counter.
    That would be one way of implementing the sparse matrix suggestion. Even then, that's not strictly linear.
  • 12. Re: Processing large number of text files - efficiency and memory errors
    791266 Explorer
    Currently Being Moderated
    That would be one way of implementing the sparse matrix suggestion. Even then, that's not strictly linear.
    No not strictly linear, but I wouldn't expect execution times to degrade as fast as it did for the OP.
  • 13. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    kajbj wrote:
    That would be one way of implementing the sparse matrix suggestion. Even then, that's not strictly linear.
    No not strictly linear, but I wouldn't expect execution times to degrade as fast as it did for the OP.
    Definitely agreed :-)

    EDIT: To the OP: kajbj's excellent suggestion of Map<String, Map<String, Integer>> is about as simple as it gets for an optimized solution; further optimizations will get more complex. If you're looking for a quick fix, that's probably it; a TreeMap would be ideal for the outer map, and maybe a HashMap for the inner Map.

    Edited by: paul.miner on Feb 7, 2008 8:54 AM
  • 14. Re: Processing large number of text files - efficiency and memory errors
    807603 Newbie
    Currently Being Moderated
    OK

    A TreeMap and an array of Treemaps are created...
            TreeMap<Object, Integer> all = new TreeMap<Object, Integer>();
            TreeMap<Object, Integer>[] each = (TreeMap<Object, Integer>[]) new TreeMap[sn];      
    sn = the number of files to be processed.

    For each i between 0 to sn, a corresponding file is read. Whilst there are still lines in the file, some patterns are removed and a StringTokenizer 'st' is used, and whilst the tokenizer has more tokens...
    while (st.hasMoreTokens()) {
                            Object a = st.nextToken();
                            Integer freq = all.get(a);
                            all.put(a, (freq == null) ? 1 : freq + 1);
                            all.put(a, (freq == null) ? 1 : freq + 1);
                            freq = each.get(a);
    each[i].put(a, (freq == null) ? 1 : freq + 1);
    }
    As you can see the Treemap 'all' holds the words across the whole corpus and the array of Treemaps 'each' hold the words for each file. 
    
    Then to output the matrix...
    BufferedWriter tdm = new BufferedWriter(new FileWriter("tdm.txt"));
    BufferedWriter terms = new BufferedWriter(new FileWriter("terms.txt"));
    Iterator it = all.keySet().iterator();
    while (it.hasNext()) {
    Object a = it.next();
    terms.write(a+"\r\n");
    for (int i=0;i<sn;i++){
    if (each[i].get(a) != null)
    tdm.write((each[i].get(a)+","));
    else
    tdm.write(0+",");
    }
    tdm.write("\r\n");
    }
    First a list of all the words in 'all' is output.
    
    Then for all the 'each' Treemaps, it looks up these words and either puts the number of times it occurred in that song, or if it didn't occur then it puts 0.
    
    In this matrix, words are on the y axis and songs on the x axis.
    
    Now... I realise the problem is probably with the Treemaps, and the fact there are sn+1 of them and it has to cycle through all of them. But I can't see another way of outputting a sparse matrix.
    
    Your suggestions seem like a better approach but I can't quite get my head around them. If you could explain them a bit better, and perhaps compare the way it works with my current method then that would be great. I'm going to mark this as unanswered again :)
    
    Thanks again!
    
    
    Edited by: snappyfool on Feb 7, 2008 8:00 AM
    
    Edited by: snappyfool on Feb 7, 2008 8:01 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
1 2 Previous Next