1 2 Previous Next 15 Replies Latest reply: Mar 5, 2011 4:18 PM by YoungWinston RSS

    Problem reading big file. No, bigger than that. Bigger.

    844979
      I am trying to read a file roughly 340 GB in size. Yes, that's "Three hundred forty". Yes, gigabytes. (I've been doing searches on "big file java reading" and I keep finding things like "I have this huge file, it's 600 megabytes!". )

      "Why don't you split it, you moron?" you ask. Well, I'm trying to.

      Specifically, I need a slice "x" rows in. It's nicely delimited, so, in theory:

      (pseudocode)

      BufferedFileReader fr=new BufferedFileReader(new FileReader(new File(myhugefile)));
      int startLine=70000000;
      String line;
      linesRead=0;
      while ((line=fr.ReadLine()!=null)&&(linesRead<startLine))
      {
      linesRead++; //we don't care about this
      }
      //ok, we're where we want to be, start caring
      int linesWeWant=100;
      linesRead=0;
      while ((line=fr.ReadLine()!=null)&&(linesRead<linesWeWant))
      {
      doSomethingWith(line);
      linesRead++'
      }

      (Please assume the real code is better written and has been proven to work with hundreds of "small" files (under a gigabyte or two). I'm happy with my file read/file slice logic, overall.)

      Here's the problem. No matter how I try reading the file, whether I start with a specific line or not, whether I am saving out a line to a string or not, it always dies with an OEM at around row 793,000,000. the OEM is thrown from BufferedReader->ReadLine. Please note I'm not trying to read the whole file into a buffer, just one line at a time. Further, the file dies at the same point no matter how high or low (with reason) I set my heap size, and watching the memory allocation shows it's not coming close to filling memory. I suspect the problem is occurring when I've read more than int bytes into a file.

      Now -- the problem is that it's not just this one file -- the program needs to handle a general class of comma- or tab- delimited files which may have any number of characters per row and any number of rows, and it needs to do so in a moderately sane timeframe. So this isn't a one-off where we can hand-tweak an algorithm because we know the file structure. I am trying it now using RandomAccessFile.readLine(), since that's not buffered (I think...), but, my god, is it slow... my old code read 79 million lines and crashed in under about three minutes, the RandomAccessFile() code has taken about 45 minutes and has only read 2 million lines.

      Likewise, we might start at line 1 and want a million lines, or start at line 50 million and want 2 lines. Nothing can be assumed about where we start caring about data or how much we care about, the only assumption is that it's a delimited (tab or comma, might be any other delimiter, actually) file with one record per line.

      And if I'm missing something brain-dead obvious...well, fine, I'm a moron. I'm a moron who needs to get files of this size read and sliced on a regular basis, so I'm happy to be told I'm a moron if I'm also told the answer. Thank you.
        • 1. Re: Problem reading big file. No, bigger than that. Bigger.
          EJP
          I agree it shouldn't happen but I would question the entire design. Data this large should be in a database, not a sequential file.
          • 2. Re: Problem reading big file. No, bigger than that. Bigger.
            Kayaman
            LizardSF wrote:
            I suspect the problem is occurring when I've read more than int bytes into a file.
            Why? I think if going over 2GB caused BufferedReader to choke, it would be more widely known by now.

            Of course it could be that buggy BufferedReader chokes on larger files. Maybe try using a different mechanism, like FileChannel.
            • 3. Re: Problem reading big file. No, bigger than that. Bigger.
              YoungWinston
              LizardSF wrote:
              And if I'm missing something brain-dead obvious...well, fine, I'm a moron. I'm a moron who needs to get files of this size read and sliced on a regular basis, so I'm happy to be told I'm a moron if I'm also told the answer.
              Like EJP, I'd start looking for the moron that handed you this "file" with a pair of rusty secateurs.

              In 30 years, I've worked on precisely two databases that were bigger than that; one of them being the fleet maintenance database for an airline group that included Northwestern, Delta, Canadian and several others.

              Winston

              Edited by: YoungWinston on Mar 5, 2011 1:47 PM

              Are these log files by any chance?
              • 4. Re: Problem reading big file. No, bigger than that. Bigger.
                YoungWinston
                LizardSF wrote:
                Here's the problem. No matter how I try reading the file, whether I start with a specific line or not, whether I am saving out a line to a string or not, it always dies with an OEM at around row 793,000,000. the OEM is thrown from BufferedReader->ReadLine. Please note I'm not trying to read the whole file into a buffer, just one line at a time. Further, the file dies at the same point no matter how high or low (with reason) I set my heap size, and watching the memory allocation shows it's not coming close to filling memory. I suspect the problem is occurring when I've read more than int bytes into a file.
                That would suggest that each "line" is only about 3 bytes long, which seems unlikely.

                Here's a possibility: Have you tried skipping, rather than reading?

                You could try a simple test: read up to line 750,000,000, keeping track of the character offset for each line, and store/print the last valid value when you finish; then try another program which skips that number of characters (skip() takes a long, not an int) and then attempt to read lines from that point. If you still have problems when you get to line 43,000,000, then clearly you will need to take another tack; but if not, you can repeat that test until you run out of room again. If it allows you a similar number of lines, just repeat the process and you may have your first cut at "chopping up" the file. Even a 340Gb file can't have that many 750,000,000 million line chunks.

                Winston
                • 5. Re: Problem reading big file. No, bigger than that. Bigger.
                  844979
                  The files I need to work with are, for the most part, reports OUT of a database... that for reasons too complex to go into here, we can't access directly. (And while that makes it sound like I'm trying to parse Secret Hacked Data, the truth is a thousand times more mundane and boring... sigh.)

                  I'll look at some of the other suggestions posted here and see if I can make them work. If anyone has had experience with files of this size, please, chime in.
                  • 6. Re: Problem reading big file. No, bigger than that. Bigger.
                    844979
                    Youngwinston... I mistyped something rather important, which is what happens when I post late at night after hammering on the same problem since early in the morning... the choke point is around 79,300,000, so I was off by an order of magnitude. Probably doesn't change your main point, but I wanted to provide accurate data.

                    Edited by: LizardSF on Mar 5, 2011 5:29 AM
                    • 7. Re: Problem reading big file. No, bigger than that. Bigger.
                      YoungWinston
                      LizardSF wrote:
                      Youngwinston... I mistyped something rather important, which is what happens when I post late at night after hammering on the same problem since early in the morning... the choke point is around 79,300,000, so I was off by an order of magnitude. Probably doesn't change your main point, but I wanted to provide accurate data.
                      Ah, OK. 30 bytes per line does seem reasonable. You could still try my suggestion (with decimal points appropriately adjusted). I suspect you'll find out on the first iteration whether it's an absolute limit or not.

                      Winston

                      Edited by: YoungWinston on Mar 5, 2011 2:34 PM

                      PS: Don't forget to include the line terminator in your offset calculations. You may also need to discover whether it's a newline (1 byte) or good old Windows CRLF (2).
                      • 8. Re: Problem reading big file. No, bigger than that. Bigger.
                        jschellSomeoneStoleMyAlias
                        LizardSF wrote:
                        And if I'm missing something brain-dead obvious...well, fine, I'm a moron. I'm a moron who needs to get files of this size read and sliced on a regular basis, so I'm happy to be told I'm a moron if I'm also told the answer. Thank you.
                        Streaming file reads from files that don't fit into available memory are not new to java. They go way back before that.
                        So it is really unlikely that java has a problem with that.

                        So what you have is an application that has a leak.

                        I suggest you prove it by writing a new application which does nothing but read lines.

                        It should be short enough that if if does still have a memory problem that you can post the code.

                        You should also use the command line options to reduce the maximum memory to a value that is much lower. Say 64 meg or 128 meg. That will cause the problem to happen much faster.

                        As one other thought you are assuming that each line is 'small'. What specifically prevents a single line in the file from being say 1 gig in size (1 gig of ascii chars takes 2 gigs of string space.)
                        • 9. Re: Problem reading big file. No, bigger than that. Bigger.
                          844979
                          FWIW, here's the exact error message. I tried this one with RandomAccessFile instead of BufferedReader because, hey, maybe the problem was the buffering. So it took about 14 hours and crashed at the same point anyway.

                          Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
                               at java.util.Arrays.copyOf(Unknown Source)
                               at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
                               at java.lang.AbstractStringBuilder.append(Unknown Source)
                               at java.lang.StringBuffer.append(Unknown Source)
                               at java.io.RandomAccessFile.readLine(Unknown Source)
                               at utility.FileSlicer.slice(FileSlicer.java:65)

                          Still haven't tried the other suggestions, wanted to let this run.
                          • 10. Re: Problem reading big file. No, bigger than that. Bigger.
                            YoungWinston
                            LizardSF wrote:
                            FWIW, here's the exact error message. I tried this one with RandomAccessFile instead of BufferedReader because, hey, maybe the problem was the buffering. So it took about 14 hours and crashed at the same point anyway.

                            Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
                                 at java.util.Arrays.copyOf(Unknown Source)
                                 at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
                                 at java.lang.AbstractStringBuilder.append(Unknown Source)
                                 at java.lang.StringBuffer.append(Unknown Source)
                                 at java.io.RandomAccessFile.readLine(Unknown Source)
                                 at utility.FileSlicer.slice(FileSlicer.java:65)

                            Still haven't tried the other suggestions, wanted to let this run.
                            Rule 1: When you're testing, especially when you don't know what the problem is, change ONE thing at a time.
                            Now you've introduced RandomAccessFile into the equation you still have no idea what's causing the problem, and neither do we (unless there's someone here who's been through this before).

                            Unless you can see any better posts (and there may well be; some of these guys are Gods to me too), try what I suggested with your original class (or at least a modified copy). If it fails, chances are that there IS some absolute limit that you can't cross; in which case, try Kayaman's suggestion of a FileChannel.

                            But at least give yourself the chance of KNOWING what or where the problem is happening.

                            Winston
                            • 11. Re: Problem reading big file. No, bigger than that. Bigger.
                              YoungWinston
                              LizardSF wrote:
                              FWIW, here's the exact error message. I tried this one with RandomAccessFile instead of BufferedReader because, hey, maybe the problem was the buffering. So it took about 14 hours and crashed at the same point anyway.
                              Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
                                   at java.util.Arrays.copyOf(Unknown Source)
                                   at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
                                   at java.lang.AbstractStringBuilder.append(Unknown Source)
                                   at java.lang.StringBuffer.append(Unknown Source)
                                   at java.io.RandomAccessFile.readLine(Unknown Source)
                                   at utility.FileSlicer.slice(FileSlicer.java:65)
                              Next time, it would be a good idea to furnish this straight away. Looking at it, the error is now, unfortunately, coming from RandomAccessFile, not BufferedReader (although the two may be similar). I can understand it (maybe) from the point of view of a RandomAccessFile, but NOT from a BufferedReader.

                              In this case, the AbstractStringBuilder.expandCapacity() seems to be the problem, probably because its trying to allocate an array that is too big for your current settings.

                              Try skip(). PLEASE.

                              Winston
                              • 12. Re: Problem reading big file. No, bigger than that. Bigger.
                                EJP
                                I would suspect somehing wrong with the file: it has a seriously long line in there, without newline characters.
                                • 13. Re: Problem reading big file. No, bigger than that. Bigger.
                                  YoungWinston
                                  EJP wrote:
                                  I would suspect somehing wrong with the file: it has a seriously long line in there, without newline characters.
                                  Aha! Someone who knows the underlying source, I suspect. I guess the test for that would be to use read().

                                  Winston
                                  • 14. Re: Problem reading big file. No, bigger than that. Bigger.
                                    844979
                                    @Winston

                                    I did the skip(), and then read char-by-char, and it was, as someone else suggested, a data error -- no line terminator after that point in the file, so the obvious occurred. It's killing me that this was the first thing I thought of, well before I posted here, and I thought I'd written a test that verified it wasn't, but clearly, my original test was poorly constructed.

                                    Thanks to everyone for their help. The simply act of articulating the problem clearly enough to get assistance is often useful in itself, and your questions and comments helped me focus on the most likely causes. Now to go bounce this back up the food chain to the original source of the file...
                                    1 2 Previous Next