7 Replies Latest reply: Oct 10, 2012 11:46 AM by 757652 RSS

    CONTEXT index creation - performance!

    757652
      Hi,
      I have a table with about 5Million rows. The content that needs to be indexed is of RAW datatype. The average size (length) of this field is about 50 characters (it could be more).

      I am trying to index this column to perform a keyword search. DEtails are furnished below.

      table:
      SQL> desc kwtai
      Name Null? Type
      ----------------------------------------- -------- ----------------------------
      TSD_HH24 DATE
      COUNTRY_CODE_ALPHA_2 VARCHAR2(2)
      ONETWORK NUMBER(6)
      OADDRESS VARCHAR2(25)
      DNETWORK NUMBER(6)
      DADDRESS VARCHAR2(25)
      MESSAGE_LENGTH NUMBER
      MESSAGE_CONTENT RAW(2000)

      Preferences:-
      begin

      Ctx_Ddl.Create_Preference('mc_storage', 'BASIC_STORAGE');

      ctx_ddl.set_attribute('mc_storage','I_TABLE_CLAUSE',
      'tablespace large_index storage (initial 10M next 10M)');

      ctx_ddl.set_attribute('mc_storage', 'K_TABLE_CLAUSE',
      'tablespace large_index storage (initial 10M next 10M)');

      ctx_ddl.set_attribute('mc_storage', 'R_TABLE_CLAUSE',
      'tablespace large_index storage (initial 1M) lob (data) store as (cache)');

      ctx_ddl.set_attribute('mc_storage', 'N_TABLE_CLAUSE',
      'tablespace large_index storage (initial 1M)');

      ctx_ddl.set_attribute('mc_storage', 'I_INDEX_CLAUSE',
      'tablespace large_index storage (initial 1M) compress 2');

      ctx_ddl.create_preference('mc_lex', 'BASIC_LEXER');
      ctx_ddl.set_attribute('mc_lex', 'skipjoins', '_-"''`~!@#$%^&*()+=|}{[]\:;<>?/.,');
      ctx_ddl.set_attribute('mc_lex', 'INDEX_STEMS','NONE');
      end;


      create index kwtaidx on kwtai (message_content) indextype is ctxsys.context
      parameters (' lexer mc_lex storage mc_storage memory 500M ')
      parallel 16;

      This create index takes about 4 hours to complete on a 8CPU dual core machine.
      This is on Oracle 10g (10.2.0.4)

      The reason i am creating the index as opposed to syncing it is because the data gets loaded into this table only once a day and it gets cleared once my keyword analysis is done.

      Any pointers to speed up the index creation will be really appreciated! Thanks in advance!
        • 1. Re: CONTEXT index creation - performance!
          Roger Ford-Oracle
          Why a RAW column if it contains character data?

          I think that a RAW column will be treated like a BLOB and AUTO_FILTER will be invoked. Try adding "filter ctxsys.null_filter" to the parameters clause to bypass it.
          • 2. Re: CONTEXT index creation - performance!
            757652
            RAW was used because the message_content has hex data. Is there any way to tokenize hex data? From what I notice hex is being converted to ascii when the index is constructed, please correct me if I'm mistaken.

            Let me try the null filter though.
            • 3. Re: CONTEXT index creation - performance!
              757652
              Hello Roger-
              I added the NULL_FILTER clause and it seem to have made the index creation much faster. So was it the filters that was causing this?
              • 4. Re: CONTEXT index creation - performance!
                Roger Ford-Oracle
                Yes. AUTO_FILTER is used for binary columns, and this involves creating an OS file and invoking an external executable on that file, for each row. The overheads of doing that are significant when the amount of data in each row is small.

                When you say it contains "hex data", do you mean hexadecimal numbers in the form of text, such as A9FF or 0xA9FF? Or do you mean binary data which doesn't represent any particular characters? If it's the first of these, you probably shouldn't be using RAW at all. If it's the second, it would be worth checking what's going into the $I table (select token_text from dr$your-index-name$i) as it might be creating garbage tokens.
                • 5. Re: CONTEXT index creation - performance!
                  757652
                  My base table has the text that needs to be indexed stored in the "MESSAGE_CONTENT" column which for now is RAW data type. The data stored in this table are in hex representation.

                  Some examples -


                  MESSAGE_CONTENT
                  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  616C70686120626574612067616D6D612064656C746120657073696C6F6E207A657461206E69F16F
                  616C70686120626574612067616D6D612064656C746120657073696C6F6E207A657461
                  616C70686120626574612067616D6D612064656C746120657073696C6F6E207A657461206E69C3B16F
                  6865792E2C2C77686174277320676F696E67206F6E2E2E2E7066206368616E67277320697320736F6D652072657374617572616E742E2074686579206172652070736564756F2D636F6F6C
                  54686520477265656B20616C7068616265742069732074686520736372697074207468617420686173206265656E
                  54686520477265656B20616C7068616265742069732074686520736372697074207468617420686173206265656E20706F73742D64617461

                  Now with your suggestion i tried to bypass this. So what i did was added a format column to my base table and updated it to "TEXT". My database is in UTF8.
                  Now when i create the index with the following preferences it takes less than a minute.


                  begin
                  Ctx_Ddl.Create_Preference('kwta_storage', 'BASIC_STORAGE');
                  ctx_ddl.set_attribute('kwta_storage','I_TABLE_CLAUSE',
                  'tablespace TEXT_INDEX storage (initial 10M next 10M)');

                  ctx_ddl.set_attribute('kwta_storage', 'K_TABLE_CLAUSE',
                  'tablespace TEXT_INDEX storage (initial 10M next 10M)');

                  ctx_ddl.set_attribute('kwta_storage', 'R_TABLE_CLAUSE',
                  'tablespace TEXT_INDEX storage (initial 1M) lob (data) store as (cache)');

                  ctx_ddl.set_attribute('kwta_storage', 'N_TABLE_CLAUSE',
                  'tablespace TEXT_INDEX storage (initial 1M)');

                  ctx_ddl.set_attribute('kwta_storage', 'I_INDEX_CLAUSE',
                  'tablespace TEXT_INDEX storage (initial 1M) compress 2');

                  ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
                  ctx_ddl.set_attribute('mylex', 'skipjoins', '_-"''`~!@#$%^&*()+=|}{[]\:;<>?/,');
                  ctx_ddl.set_attribute('mylex','punctuations','.?!');
                  ctx_ddl.set_attribute('mylex', 'INDEX_STEMS','NONE');
                  ctx_ddl.set_attribute('mylex', 'continuation','\-');


                  Ctx_Ddl.Create_Stoplist ( 'mystop' );
                  Ctx_Ddl.Add_Stopword ( 'mystop', 'is' );
                  Ctx_Ddl.Add_Stopword ( 'mystop', 'has' );
                  Ctx_Ddl.Add_Stopword ( 'mystop', 'the' );
                  Ctx_Ddl.Add_Stopword ( 'mystop', 'that' );
                  end;


                  create index kwtaidx on kwtai (message_content) indextype is ctxsys.context
                  parameters ('filter ctxsys.auto_filter format column fmt stoplist mystop lexer mylex storage kwta_storage memory 500M')
                  parallel 16;

                  When i select distinct tokens from the $I table i get the following
                  TOKEN_TEXT
                  ----------------------------------------------------------------
                  RESTAURANT
                  GREEK
                  WHATS
                  ARE
                  DELTA
                  ZETA
                  ALPHA
                  ALPHABET
                  EPSILON
                  PF
                  PSEDUOCOOL
                  SOME
                  CHANGS
                  NIÃO
                  ON
                  POSTDATA
                  SCRIPT
                  BEEN
                  GAMMA
                  GOING
                  HEY
                  NI
                  O
                  THEY
                  BETA

                  Now what i am also wondering is if the text (message_content column) is being converted to UTF8 (database characterset) by using AUTO_FILTER. Is my assumption correct? Not sure how to validate this?

                  And, would you kindly share of why RAW must not be used in this case?

                  Thanks for all your pointers!
                  • 6. Re: CONTEXT index creation - performance!
                    Roger Ford-Oracle
                    You say: "The data stored in this table are in hex representation". I don't think it is. It's just binary data, and the database uses hex representation to display raw-column binary data in a human-readable format.

                    If you've used a format column set to TEXT I would expect it to act exactly the same as using NULL_FILTER, but there might be some differences in character set conversion.

                    Why do I think RAW is the wrong datatype? Because you're storing textual data in it. There are no zero bytes, or anything else that would require RAW to store. If the database character set is AL32UTF8, why not just store UTF8 text in a VARCHAR2 column? The only reason I can think of for using RAW is that you might sometimes use the same code on a database with an 8-bit character set, and you want to maintain full control over any character set conversions that might happen.
                    • 7. Re: CONTEXT index creation - performance!
                      757652
                      Thanks!