11 Replies Latest reply on Oct 31, 2011 10:18 PM by jschellSomeoneStoleMyAlias

    File I/O - Performance design

    888728
      I would like a design solution on File I/O

      I have a nighlty batch process, which reads records from a flat file, validates the records, creates error for records if any and update the file status in database.
      The text file can have 500,000 or more records.

      Design I have thought of:
      1) Iterate through records using BufferedReader, create a POJO, pass POJO to validation framework, update POJO with errors details.
      Drawback: I think it will take a performance/memory hit if we are create POJO for each record read.

      2) Iterate through records using BufferedReader, create a ArrayList of String (records - may be 1000 records), pass ArrayList to validation framework, create another arraylist with errors details and return.

      Can you please suggest any other design ? Would appreciate if you can be detailed with you replies.
        • 1. Re: File I/O - Performance design
          TPD-Opitz
          If you'r interested in errors only you should only collect them.
          So my suggestion is to use a single parser object that is able to validate a single line of the file and have it collecting the errors in a Collection.

          bye
          TPD
          • 2. Re: File I/O - Performance design
            796440
            885725 wrote:
            I would like a design solution on File I/O

            I have a nighlty batch process, which reads records from a flat file, validates the records, creates error for records if any and update the file status in database.
            The text file can have 500,000 or more records.

            Design I have thought of:
            1) Iterate through records using BufferedReader, create a POJO, pass POJO to validation framework, update POJO with errors details.
            Drawback: I think it will take a performance/memory hit if we are create POJO for each record read.
            Creating an object will be much less time overhead than reading that object's contents from disk. Also, don't assume it will be too slow. Write a test program.

            As for memory, it really depends on how much stuff you're storing with each one. 500,000 objects is not a lot if the objects don't have lots of fields. However, you have to ask yourself whether you need to keep them all in memory at once, or whether you can process them independently one at a time.
            • 3. Re: File I/O - Performance design
              888728
              Thank you for your response.

              Yes, I would definitely have a test program but just wanted to know my design options.

              My design was not to keep all in memory but one record - one object. We are saying creating object for every record
              should NOT be a overhead, but do you have any other design solution which you must have implemented for flat file processing.


              With POJO my code would look something like this:

              class ErrorObject{
                   string errorCode;
                   string errorDesc;
              }

              class RecordObject{
                   String recordData;
                   List<ErrorObject> errorObjList;
                   
                   RecordObject(String recordData){
                        this.recordData = recordData;          
                   }
              }

              class Batch{
                   String recordStr;
                   bReader = new BufferedReader(file);
                   while ((recordStr = bReader.readLine()) != null) {
                        new RecordObject(recordStr);
                   }

              }
                             
              While I was reading through many article on performance with file i/O, it did mention creating object for every record will degrade the performance
              so that created doubts for me.
              • 4. Re: File I/O - Performance design
                796440
                885725 wrote:
                My design was not to keep all in memory but one record - one object. We are saying creating object for every record
                So why would you think holding one object at a time would be a memory problem?
                should NOT be a overhead,
                But in your original post you said you felt it would be.
                but do you have any other design solution which you must have implemented for flat file processing.
                My two basic designs for reading a row-based text file are 1) Read a line, process it, discard it, and 2) Read all lines, then process them all. I don't know enough about your requirements (nor do I care to know) to provide any refinements that might be relevant to your particular case.
                While I was reading through many article on performance with file i/O, it did mention creating object for every record will degrade the performance
                Well, yes, creating objects costs finite CPU, which is more than the zero CPU cost for not creating them. But in general, transferring X bytes from disk to memory will be orders of magnitude more expensive than turning those X bytes into an object. So basically the cost goes from 1 unit per file row to 1.01 units. (These numbers are, of course, made up. Just trying to get the point across.)
                • 5. Re: File I/O - Performance design
                  TPD-Opitz
                  How about this:
                  class ErrorObject{
                    string errorCode;
                    string errorDesc;
                  }
                  
                  class ObjectException extends Exception{
                      List<ErrorObject> errorObjList;
                      addError(ErrorObject  pErrorObject){ errorObjList.add(pErrorObject);}
                  }
                  
                  
                  class RecordObject{
                    RecordObject(){
                      void validate(String recordData) throws ObjectException { ... }
                   }
                  }
                  
                  class Batch{
                    List<ErrorObject> errorObjList;
                    String recordStr;
                    bReader = new BufferedReader(file);
                    RecordObject validator = new RecordObject();
                    while ((recordStr = bReader.readLine()) != null) {
                      try{
                        validator.validate(recordStr);
                      }catch(ObjectException ex){
                        errorObjList.addAll(ex.errorObjList);
                      }
                    }
                  }
                  Tere is only one <tt>RecordObject</tt>. Also you only have to handle errors only when they occure.

                  bye
                  TPD
                  1 person found this helpful
                  • 6. Re: File I/O - Performance design
                    walterln
                    Since you're validating and expecting error I would not use exceptions for your control flow. Just return an empty list if no error.
                    Record parse(String line);
                    List<ValidationResult> validate(Record record);
                    • 7. Re: File I/O - Performance design
                      TPD-Opitz
                      Walter Laan wrote:
                      Since you're validating and expecting error I would not use exceptions for your control flow. Just return an empty list if no error.
                      Good point!
                      so the while would reduce to:
                       while ((recordStr = bReader.readLine()) != null) {
                            errorObjList.addAll(validator.validate(recordStr));
                        }
                      which looks a lot better.

                      bye
                      TPD
                      • 8. Re: File I/O - Performance design
                        jschellSomeoneStoleMyAlias
                        885725 wrote:
                        I would like a design solution on File I/O

                        I have a nighlty batch process, which reads records from a flat file, validates the records, creates error for records if any and update the file status in database.
                        The text file can have 500,000 or more records.
                        So you get to record 495,223 and you determine that the file is corrupt.
                        Exactly what are you going to do then?
                        Can you please suggest any other design ?
                        If this is a one shot it doesn't matter much but if isn't then a more robust solution is needed.

                        Parse the file. And create another file. The second file has clean data because the process of creating it you insured that it means some minimal set of criteria. Note that it doesn't have to be maximal.

                        The above process can fail completely or partially. For example you might get to record 27 and quit without doing any real work expect to indicate that record 27 is corrupt. To a certain extent this depends on business rules around what the data actually is.

                        Or you might complete the entire process and determine that records 27 (and 103,377, etc) has bad data. In that case you should write the data to an error file because it valid in terms of record form but invalid in some sort a data format way. For example maybe a name has an invalid character.

                        The differentiation allows for different manual processes to handle errors.

                        Now that you have a clean file your options are to insert it directly into the database or to further process it (maximal cleanliness). This file might even be of a form that allows you to use a database import tool to import it directly into a work table. That would allow you to do final checks on it via SQL rather than java. This is ideal if the final checks require additional information from the database for validation. It is almost required if extensive database validation is requires.

                        Records that are successfully processed in the work table are moved to their final destination. Records that do not pass the final validation are marked as invalid.

                        For all successes and failures provide a record in the database to track it.
                        Provide a notification system that reports on failures and reports successes as well.
                        Tracking successes are necessary so that when someone claims that July 12th wasn't processed that you can look up that it was processed (or wasn't) and how many records were processed.
                        • 9. Re: File I/O - Performance design
                          452196
                          AFAIKS the only reason why you'd use a temporary "clean" file would be if you wanted to do any database updates until you'd verified the whole batch because, why waste time writing, then reading the second file?

                          I'd probably "pipeline" the operation, having one thread reading and validating rows which would then either be passed to error recording or to database update threads, probably by blocking queues in either case. This will, hopefully, effectively throw more cores at the problem.

                          It might be marginally faster to "recycle" the data objects via a pool, rather than always creating new ones, but new is pretty snappy these days.

                          You might want to try and avoid the file reading and database access using the same disc controller, certainly try and do it with different physical disc drives. Otherwise the process time could be dominated by seek time.
                          • 10. Re: File I/O - Performance design
                            dadams07
                            This may be overkill, since you've already gotten a lot of good replies, but I thought I'd throw in my two cents worth if you run out of things to consider.

                            Reading & processing text files is an common task that the majority of programmers face some time in their career. In my job, I found myself re-inventing the wheel so often (without making it rounder) that I eventually bit the bullet & wrote a TextFile class to handle the the tedium of reading, buffering, error handling, line number counts & other stats, etc. I instantiate it and pass it an object that implements a LineProcessor interface, and the TextFile object calls the LineProcessor once for each line. There are a lot more details than this simple description, but as a design it works pretty well, separates responsibilities & lends itself to clean, easily understood code.

                            Something to consider if you find yourself writing file processing code time after time.
                            • 11. Re: File I/O - Performance design
                              jschellSomeoneStoleMyAlias
                              malcolmmc wrote:
                              AFAIKS the only reason why you'd use a temporary "clean" file would be if you wanted to do any database updates until you'd verified the whole batch because, why waste time writing, then reading the second file?
                              1. If one uses a database import tool then a stream is unlikely to be supported as input.

                              2. There are 500k records, and if there is a failure during processing cleaning up based on clean data is probably going to be easier than starting with the original source.

                              3. I doubt there is a significance impact in reading the file versus, processing and database inserts. While streaming increases the complexity of dealing with the error scenarios.