8 Replies Latest reply on Apr 7, 2008 1:35 PM by alwu-Oracle

    Slow bulk add rates for Oracle-Jena

    618501
      I am experiencing what I perceive to be slow add() rates into my Oracle 11g knowledge store. I am running a Java application using the Oracle-Jena API. Basically, I am using the LUBM50 data set, and for each generated RDF/XML file, I am calling the addInBulk() method found in the class BulkUpdateHandler (I am following what was done in http://www.oracle.com/technology/tech/semantic_technologies/documentation/jenadrv_readme.txt). After adding the nearly 6.9 million triples, I have an overall data rate of around 270 triples/second. Doing the same for LUBM1, I get data rates of roughly 3600 triples/second. Also, my inference maintenance mode is set to NO_UPDATE. Any ideas on how I can improve this performance?
        • 1. Re: Slow bulk add rates for Oracle-Jena
          618501
          I want to add some more information regarding my slow bulk-add performance. I ran my process again, and this time I broke it into chunks to get a better idea of what is happening. Basically, as the number of triples increases, the insertion rate DRASTICALLY decreases. For instance, here are some of my results (I am roughly adding 67,000 triples each time):

          NumTriplesInStore..........AddRate (triples/sec)
          61,339...........................3202.4
          125,559.........................2622.9
          917,060.........................969.9
          984,012.........................843.2
          1,457,552......................683.5
          1,521,956......................602.2
          2,191,529......................338.6
          3,309,224......................210.7
          5,441,194......................192.4
          6,654,859......................132.7

          As you can see, the add-in-bulk performance is not so great, and I hate to imagine what it would be as the RDF store reaches 1 billion triples.

          Oracle claims that by using the Bulk-loader (which is a combination of SQL loader and the bulk_load API) one can load LUBM 50 in about 13.5 minutes. Since this is more of a “command-line” approach, I assume that this represents close to a max bulk-load rate. For my company’s particular project, we need more of a programmatic approach, which is why we are using the Oracle-Jena API. I have not seen any published rates for this API, but since it is an Oracle implemented interface, I would have thought that the rates would be reasonably close. Can someone please help me understand what is going on? Is there some kind of duplicate triple checking happening in the background? If so, why does it not affect the SQL bulk-load approach? Are there some settings that I can tweak to get better add performance? Is there a better approach that I should be using?

          Thanks,
          Joe
          • 2. Re: Slow bulk add rates for Oracle-Jena
            alwu-Oracle
            That is a bit too slow. The addInBulk API has been used with satisfactory results.

            Could you please try the following two things?
            1) use SQL*loader followed by a PL/SQL API sem_apis.bulk_load_from_staging_table
            Please refer to user's manual for details. After you get the results, please post them
            on this thread. Please start with an empty model.

            This way, we can know whether the performance problem is in the Java code
            or is on the database side.

            2) could you please post the code snippet that you used to addInBulk?

            As a separate thing, please post your database setting for SGA/PGA. What kind of
            machine do you have? Memory, hard disk, etc.

            Thanks,
            • 3. Re: Slow bulk add rates for Oracle-Jena
              618501
              1) I ran the bulk load as described in the documentation. It took me 9 minutes 34 seconds to load the staging table and 3 min 8 seconds to load the application table from the staging table.

              I also ran this in other ways to verify that subsequent bulk adds did not suffer in performance, and they did not.


              2) Here is some highly generic code. This code happens throughout different methods in several classes, but I consolidated it here to give you a general idea of what we are doing. Also, I started with a blank model. I did not call GraphOracleSem.dropApplicationTableIndex() since it does not even exist when I start.


              Attachment attachment =
              Attachment.createInstance( Attachment.NO_ADDITIONAL_MODELS,
              new String[0],
              InferenceMaintenanceMode.NO_UPDATE,
              QueryOptions.ALLOW_QUERY_INVALID);

              this.graph = new GraphOracleSem(this.oracle, "LUBM_50", attachment);

              for( File f : fileList ) // 999 LUBM50 files
              {

              Model model = ModelFactory.createDefaultModel();

              model.read( new FileInputStream( f ) );

              GraphOracleSem graph = this.getGraphOracleSem();

              OracleBulkUpdateHandler obuh =
              (OracleBulkUpdateHandler) graph.getBulkUpdateHandler();

              ExtendedIterator eiter = GraphUtil.findAll(model.getGraph());

              obuh.addInBulk(eiter, this.tableNamespace);

              }


              3) Machine:
              Dual Quad Core Intel Xeon 3.0 GHz
              16 GB RAM (not sure of the speed, but I'm sure it's the latest and greatest)
              1 TB Hard Drives (x2) (not sure of the speed, but I'm sure it's the latest and greatest)
              OS is 64-bit Redhat Enterprise Linux Version 5
              Java 6, update 4

              I did not do any tweaking of the SGA and PGA settings. If there is something not listed here, please let me know.

              Automatic Memory Management is Enabled
              Total memory size is 6464 MB
              Max memory size is 6464 MB
              SGA = 0 (from show parameter sga_target)
              Aggregate PGA Target is 0


              I appreciate you and your group looking into this.

              Thank you,
              Joe
              • 4. Re: Slow bulk add rates for Oracle-Jena
                alwu-Oracle
                Thanks for trying out the SQL version. It looks like the database side is running ok. Your machine appears to be very powerful.

                Two things in my mind.
                1) what do you see if you have GraphOracleSem.dropApplicationTableIndex() in your code? You can put it in a try { ...dropApplicationTableIndex() ... } catch(Exception e) { // ignore } block.

                2) in the code you posted, you do addInBulk almost 1000 times. That is actually quite different from the way you run bulk loader in SQL directly.

                Could you please use another in-memory Jena model, as a holder, and keep adding
                those 1000 small RDF graphs into this in-memory model?

                Once all those small RDF graphs are added, run addInBulk just once to load that in-memory Jena model. Choose an empty Oracle model for that please.

                You probably need to set java -Xmx to a decent value so that LUBM50 can fit in memory.

                Please add two lines of code before and after that addInBulk API to print out time stamps so that we know exactly how much time is spent on addInBulk.
                • 5. Re: Slow bulk add rates for Oracle-Jena
                  618501
                  1) I added the dropApplicationTableIndex() method call to the code, but I did not see any change in the performance.

                  2) I tried this out, and the insertion time was roughly 22 minutes! However (and I greatly apologize for not mentioning this earlier), my company’s project is not interested in one mega bulk load but rather “small” bulk adds over time. The idea is to have a system that can handle continuous insertions of several thousand triples every second over some length of time (ideally, the time would be “forever”). This is why I was doing 1000 bulk adds – I needed to simulate a continuous stream of data in order to “see” what our sustained insertion rates would be. This is how I discovered the issue at hand. We would love to get to a point by mid-year where we can be using LUBM 8000 (i.e., 1 billion triples), but the degrading insertion rate will cause a problem. I hope this makes sense, and, again, I apologize for not stating this earlier. Do you have any ideas on what we can do or even what might be causing this? Has your group ever seen this before?

                  Thank you for your assistance,
                  Joe
                  • 6. Re: Slow bulk add rates for Oracle-Jena
                    alwu-Oracle
                    The problem with incremental appending is the overhead of incremental index maintenance. To see this, just create a simple table with an index, and then keep
                    inserting data into the table, and you will see the overhead as more data are added.

                    I would like you to try sem_perf.gather_stats (check manual for details) and then
                    re-run your original tests. Does that make a difference?

                    Once that is done, you can try using addInBatch instead of addInBulk for your original
                    tests. Please use dropApplicationTableIndex.

                    From what you have described, it seems possible to accumulate data over a period of
                    time and bulk load it. I guess it depends on how your application plans to use (query) the data.
                    • 7. Re: Slow bulk add rates for Oracle-Jena
                      618501
                      If I read the first part correctly, does this mean that there is indexing happening in the background when I use the Oracle-Jena interface? I am relatively new to RDF stores, and maybe indexing is something that all RDF stores do??

                      I tried each of the things you suggested, and I saw no performance increase regarding the sustained add rates. I did get a hard performance number for my particular project – we need a system that can handle roughly a sustained rate of 15,000 triples/second. Whoa! Yes, this is quite high and may not be obtainable this year, but I think we would be pleased to see any sustained rate above a couple thousand triples/second. However, based on my timing data, before I even cross 1 million triples my rate has plummeted from above 3000 triples/second to below 700 triples/second. Others in my group have run their own independent tests and have verified my results. Their data indicates that Oracle 11g does reach a sustained rate, but it is in the neighborhood of 130 triples/second. We would love to know if there’s any hope at all for getting some reasonable sustained rate from Oracle 11g.

                      Thank you,
                      Joe
                      • 8. Re: Slow bulk add rates for Oracle-Jena
                        alwu-Oracle
                        Index is happening when you load data into Oracle store. It does not matter whether Oracle Jena API is used or not.

                        If you can batch up the triples into an in-memory Jena model, and then periodically bulk load a large model into Oracle store, you should be able to achieve satisfactory results.

                        You can query the in-memory Jena model so it won't be a disruption in service for your application.

                        Does this sound feasible?

                        Thanks,