This discussion is archived
14 Replies Latest reply: Jun 17, 2010 10:56 AM by 843793 RSS

Reading Large text files into ArrayLists

843793 Newbie
Currently Being Moderated
Hello

I am sorry if this issue has been addressed in this network before .

I have an application that i am trying to load a relatively large data file using BufferedReader but during the execution , i am getting this kind error
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
.

could someone please give me some pointers as to how to get out of this situation
  • 1. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    So why is it necessary hold the whole file in memory?
  • 2. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    I am dealing with a text that has hierarchal set of identification items , (these are geographical units , all the way to the House hold level ) i am ideally supposed to search through each enumeration Area( which is higher than the House hold ) and determine cases where there are duplicate House hold numbers and edit .

    For me to do this , i need the file in memory( or so i think)
  • 3. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    Holding the whole a file in memory at one time is normally a bad idea - as you are finding, it does no scale well. Since you say you are "to search through each enumeration Area" then can this be done while going through the file sequentially? If not then maybe you should consider creating a database to hold the data. This would allow you to build indexes that could make searching very very fast.
  • 4. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    What have you set the maximum memory size to? Try -mx1g on the command line as see what happens.

    How big is the data file? You can expect use about three times this much in memory.
  • 5. Re: Reading Large text files into ArrayLists
    800387 Newbie
    Currently Being Moderated
    I agree with Sabre. There are nightmare scenarios where you must read everything into memory to perform a given piece of work, but these are rare. You normally end up doing one or more of the following:
    Reading data from an input source, transforming it, and outputting
    Calculating statistics as records are processed (e.g., sum, count, average, etc.)
    For both of these very common use cases, you rarely need to have all the data in memory. In the first instance, you simply transform each record as you read it. In the second instance, simply declare a few instance variables that are updated as records are read. It is very common to have to do both at once (e.g., read a file record by record, keep track of some statistics, permute the output, and then store the results of calculations). However, none of these use cases require the entire file to normally be in memory. When this does in fact occur, I agree, put it in the database or write out a temporary file to the filesystem or find some other solution.

    - Saish
  • 6. Re: Reading Large text files into ArrayLists
    800387 Newbie
    Currently Being Moderated
    I guess I'll add a couple of real life examples. We routinely receive files from third party vendors that are hundreds of megabytes in size that are transformed to a common format before sending it on to another third-party system for processing. Although there are complex calculations and summaries being generated, and each record is being transformed to another format, each source record can be read in one at a time. There is no need to keep everything in memory.

    A simpler example. Someone wants to download a PDF from your filesystem to their browser. Say a given PDF is 50 or so megabytes, and multiple users are concurrently accessing your site. There is no reason to read 50 megabytes into memory and then write 50 megabytes back to the browser. Far more efficient (in terms of memory) to read, say, 2K at a time and then stream it to the browser. You save orders of magnitude of memory.

    Finally, the bad example. We have to generate Excel summaries that cannot be output as CSV or something that can be streamed (due to formatting requirements by the client). We use POI. That API requires (or at least did as of a few versions ago) that the entire sheet be built before you can flush it to an output stream. Here, there is no really satisfying solution. The API itself prevents us from streaming, and the content may grow very large in memory. For this reason, we limit exports to, say, 10,000 rows. After that, a business exception is thrown and the user prompted to include more filtering criteria to make the sheet manageable.

    - Saish
  • 7. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    I appreciate the insights espoused here and i have to admit after reading the responses here and 'googling' i am persuaded that i might have to go the database way . I am expecting relatively large data sets from each of the 8 regions some of them going to about 2 GB ( i was testing the app with relatively smaller size) and frankly the kind of calcs i have to do on the files need to load the file in memory . Now given that this ain't possible , i have to explore other alternatives one of which is the database way .

    Thanks people , you r the greatest
  • 8. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    I am expecting relatively large data sets from each of the 8 regions some of them going to about 2 GB ( i was testing the app with relatively smaller size) and frankly the kind of calcs i have to do on the files need to load the file in memory . Now given that this ain't possible
    Why? You either need to do this or you don't. If your data base to be in memory and you place it on disk it will be many, many times slower. (10x to 1000x slower) Also, don't expect a database to be more compact in placing the data in memory. Its often far more expensive.

    You can get a server with 2x Quad core and 24 GB of memory for around £2,500. If you are careful with your memory use you should be able to fit it all into memory.

    Note: you can get
    - a 48 GB server for about £5,000
    - a 96 GB server for about £9,000
    - a 192 GB server for about £14,000

    What is your budget for this project? Can it not a afford a server which will do the job?
  • 9. Re: Reading Large text files into ArrayLists
    800387 Newbie
    Currently Being Moderated
    @OP: You have not convinced me that you need to load everything into memory. What are your actual requirements? 98% of the time when someone tells me they "must" load everything into memory, it is a poorly conceived design.

    @Peter: Seriously? Just throw more hardware at the problem?

    - Saish
  • 10. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    Saish wrote:
    @Peter: Seriously? Just throw more hardware at the problem?
    I worked for some time for a very large Japanese bank in London and after receiving an estimate for the likely cost of improving the performance of some dealing software they just bought all the dealers the very top of the range multi-processor Sun workstations. Much much much cheaper than trying to fix some very very poor software.
  • 11. Re: Reading Large text files into ArrayLists
    800387 Newbie
    Currently Being Moderated
    sabre150 wrote:
    Saish wrote:
    @Peter: Seriously? Just throw more hardware at the problem?
    I worked for some time for a very large Japanese bank in London and after receiving an estimate for the likely cost of improving the performance of some dealing software they just bought all the dealers the very top of the range multi-processor Sun workstations. Much much much cheaper than trying to fix some very very poor software.
    I can imagine cases like that. It would not be my default position though. Seems a pity, doesn't it?

    - Saish
  • 12. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    Saish wrote:
    @Peter: Seriously? Just throw more hardware at the problem?
    For the system I work on, not even 1% of the data is in memory at a given moment and it performs very well, if I say so myself. :) BTW I don't use a database.

    However, it is not possible to do this in all cases and a 24 GB system can be leased for less than £5/day. You have to balance the cost of development time against the cost of the hardware it runs on. At the end of the day, we have to make a sensible commercial decision not just the nicest technical one.
  • 13. Re: Reading Large text files into ArrayLists
    800387 Newbie
    Currently Being Moderated
    Peter__Lawrey wrote:
    Saish wrote:
    @Peter: Seriously? Just throw more hardware at the problem?
    For the system I work on, not even 1% of the data is in memory at a given moment and it performs very well, if I say so myself. :) BTW I don't use a database.

    However, it is not possible to do this in all cases and a 24 GB system can be leased for less than £5/day. You have to balance the cost of development time against the cost of the hardware it runs on. At the end of the day, we have to make a sensible commercial decision not just the nicest technical one.
    I agree that there are sometimes better non-technical (or at least non-development) solutions to be used. For example, a XML structure that is heterogenous and very large that is not amenable to I/O streaming or processing per record. Or my favorite example, from the Pragmatic Programmer, walking a backup tape across the street to a facility instead of transmitting the data over a WAN. However, my strong hunch is that the OP is not in one of these categories. It is one thing to have to refactor an existing process that is now not performing under load, it is another, IMO, at the design stage to simply opt to process in memory without at least attempting to have a low footprint. That's all.

    - Saish
  • 14. Re: Reading Large text files into ArrayLists
    843793 Newbie
    Currently Being Moderated
    It is one thing to have to refactor an existing process that is now not performing under load, it is another, IMO, at the design stage to simply opt to process in memory without at least attempting to have a low footprint. That's all.
    I agree. The OP needs to determine whether he really needs to load all of the data into memory or not. If he does need to keep it all in memory, that is an option, possibly with new hardware. If he doesn't he might find he needs to keep only a small portion in memory at any given moment. e.g. if he can use efficient indexing.