1 2 Previous Next 16 Replies Latest reply on Dec 30, 2008 8:27 PM by 807589

    Text file parsing and performance

    807589
      I currently parse csv files using a split on the comma. I load each row of the file into mysql.
      A file of about 200,000 rows takes about 8 minutes. This file contains stock prices:

      example
      ibm,$150.00,2008-12-24

      I have another table with client data:
      John Doe,ibm,20 shares,current price=$120.00,2008-12-23

      I was told that it would be more efficient to pull from the file only the stock info I need.
      Wouldn't pulling the file apart,scanning and such to match what is in the db be more overhead?
      I currently load the entire stock price file, then follow up this process with a matching program
      that goes over each customers record, acquires the latest price, and calculates the latest value.

      I think I should just load the whole file instead of picking it apart? Can someone lend some advice?

      Edited by: iketurna on Dec 24, 2008 7:27 AM
        • 1. Re: Text file parsing and performance
          807589
          I agree with your guesses about what would take more time and what would not, but the only way to find out is to do it both ways and measure.
          • 2. Re: Text file parsing and performance
            807589
            What data structures are you using?
            • 3. Re: Text file parsing and performance
              807589
              I read the lines into a hashmap, they come in a funky format. So i remove the footer, and header after parsing them for some info about the file and data.
              Then I use a hibernate based saveOrUpdate.
              • 4. Re: Text file parsing and performance
                807589
                Doesn't Hibernate have some special stuff to deal with batching? Would that apply here?
                • 5. Re: Text file parsing and performance
                  807589
                  Well, the argument I am facing is why load all the rows from the text file, why not just scan the file for data that I need that match
                  clients in the database.

                  I am currently batching and such with hibernate, but Im trying to argue that I should just load the file, not scan the file.
                  Im trying to get good guesses/opinions as to which approach is better.

                  Edited by: iketurna on Dec 24, 2008 8:48 AM
                  • 6. Re: Text file parsing and performance
                    807589
                    How hard is it to time both?
                    • 7. Re: Text file parsing and performance
                      807589
                      iketurna wrote:
                      A file of about 200,000 rows takes about 8 minutes. This file contains stock prices:
                      ignoring database operations, that time seems completely wrong to me.
                      you could probably load every line into an object, save the objects in a list,
                      and search the list in under a second or two.

                      if i understand you correctly, you are loading the file info into a database and
                      then doing the searches through the database? why not just search for the
                      info as you load the file? or in other words, i dont see what the database has
                      to do with anything.
                      • 8. Re: Text file parsing and performance
                        807589
                        Ok, so I have the run time down to :
                        244,600 rows in 5 minutes and 23 seconds.

                        Is this fast enough, or do I need to improve this?
                        • 9. Re: Text file parsing and performance
                          807589
                          iketurna wrote:
                          Ok, so I have the run time down to :
                          244,600 rows in 5 minutes and 23 seconds.

                          Is this fast enough, or do I need to improve this?
                          Why are you asking us?
                          • 10. Re: Text file parsing and performance
                            807589
                            ok, i take it back.
                            • 11. Re: Text file parsing and performance
                              807589
                              iketurna wrote:
                              ok, i take it back.
                              Too late!
                              • 12. Re: Text file parsing and performance
                                DrClap
                                iketurna wrote:
                                Im trying to get good guesses/opinions as to which approach is better.
                                My opinion is that the database updating is going to take far longer than the file reading. So trying to optimize the file reading isn't going to improve things much.

                                Of course it should be possible to put some timing code into the program to see whether my opinion is worth anything in this case.
                                • 13. Re: Text file parsing and performance
                                  jschellSomeoneStoleMyAlias
                                  iketurna wrote:
                                  I currently parse csv files using a split on the comma. I load each row of the file into mysql.
                                  A file of about 200,000 rows takes about 8 minutes. This file contains stock prices:

                                  example
                                  ibm,$150.00,2008-12-24

                                  I have another table with client data:
                                  John Doe,ibm,20 shares,current price=$120.00,2008-12-23

                                  I was told that it would be more efficient to pull from the file only the stock info I need.
                                  Wouldn't pulling the file apart,scanning and such to match what is in the db be more overhead?
                                  I currently load the entire stock price file, then follow up this process with a matching program
                                  that goes over each customers record, acquires the latest price, and calculates the latest value.
                                  What does that mean exactly?

                                  For example are you proposing that for a given list of items that you would parse the entire file looking for each individual item?
                                  If yes then how many items? Where do you get that list from?

                                  Or perhaps you use every line in the file?

                                  And for the "update" do you do nothing more than extract the value from the line and update a table? If that isn't what you do then what is involved in that process (in particular does a single update require more than one line from the file?)

                                  And what are the error scenarios? In other words, what happens on Friday when someone claims that at price was changed on Tuesday and that your program didn't update it correctly? What process would you (or do you) undertake to find a problem and correct it if needed?
                                  • 14. Re: Text file parsing and performance
                                    807589
                                    Two tips for you: Firstly don't load everything into the database unless you need to load the data for some other reason. Doing everything in Java object collections will be much more faster. If you are really really worried about performance ( which I presume not, but still a thought for you) then carefully choose which collection you are going to use. Each collection store objects in a different way and so the search response time will vary. You can probably use a tree if that suits you. Going really far, if you are going to use that collection of 200K for n number of searches try using a sorted collection or something which will improve the overall performance. You can find the algo and response time in java documentation or use "trail and error" method to choose.

                                    Secondly, if you decide to take it to database, make a collection and use hibernate to load them in a single transaction. Please seek some help in hibernate forums if you need more help on this.

                                    Edited by: sri83 on Dec 30, 2008 12:04 PM
                                    1 2 Previous Next