This discussion is archived
7 Replies Latest reply: Oct 10, 2012 9:46 AM by 757652 RSS

CONTEXT index creation - performance!

757652 Newbie
Currently Being Moderated
Hi,
I have a table with about 5Million rows. The content that needs to be indexed is of RAW datatype. The average size (length) of this field is about 50 characters (it could be more).

I am trying to index this column to perform a keyword search. DEtails are furnished below.

table:
SQL> desc kwtai
Name Null? Type
----------------------------------------- -------- ----------------------------
TSD_HH24 DATE
COUNTRY_CODE_ALPHA_2 VARCHAR2(2)
ONETWORK NUMBER(6)
OADDRESS VARCHAR2(25)
DNETWORK NUMBER(6)
DADDRESS VARCHAR2(25)
MESSAGE_LENGTH NUMBER
MESSAGE_CONTENT RAW(2000)

Preferences:-
begin

Ctx_Ddl.Create_Preference('mc_storage', 'BASIC_STORAGE');

ctx_ddl.set_attribute('mc_storage','I_TABLE_CLAUSE',
'tablespace large_index storage (initial 10M next 10M)');

ctx_ddl.set_attribute('mc_storage', 'K_TABLE_CLAUSE',
'tablespace large_index storage (initial 10M next 10M)');

ctx_ddl.set_attribute('mc_storage', 'R_TABLE_CLAUSE',
'tablespace large_index storage (initial 1M) lob (data) store as (cache)');

ctx_ddl.set_attribute('mc_storage', 'N_TABLE_CLAUSE',
'tablespace large_index storage (initial 1M)');

ctx_ddl.set_attribute('mc_storage', 'I_INDEX_CLAUSE',
'tablespace large_index storage (initial 1M) compress 2');

ctx_ddl.create_preference('mc_lex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mc_lex', 'skipjoins', '_-"''`~!@#$%^&*()+=|}{[]\:;<>?/.,');
ctx_ddl.set_attribute('mc_lex', 'INDEX_STEMS','NONE');
end;


create index kwtaidx on kwtai (message_content) indextype is ctxsys.context
parameters (' lexer mc_lex storage mc_storage memory 500M ')
parallel 16;

This create index takes about 4 hours to complete on a 8CPU dual core machine.
This is on Oracle 10g (10.2.0.4)

The reason i am creating the index as opposed to syncing it is because the data gets loaded into this table only once a day and it gets cleared once my keyword analysis is done.

Any pointers to speed up the index creation will be really appreciated! Thanks in advance!
  • 1. Re: CONTEXT index creation - performance!
    Roger Ford Expert
    Currently Being Moderated
    Why a RAW column if it contains character data?

    I think that a RAW column will be treated like a BLOB and AUTO_FILTER will be invoked. Try adding "filter ctxsys.null_filter" to the parameters clause to bypass it.
  • 2. Re: CONTEXT index creation - performance!
    757652 Newbie
    Currently Being Moderated
    RAW was used because the message_content has hex data. Is there any way to tokenize hex data? From what I notice hex is being converted to ascii when the index is constructed, please correct me if I'm mistaken.

    Let me try the null filter though.
  • 3. Re: CONTEXT index creation - performance!
    757652 Newbie
    Currently Being Moderated
    Hello Roger-
    I added the NULL_FILTER clause and it seem to have made the index creation much faster. So was it the filters that was causing this?
  • 4. Re: CONTEXT index creation - performance!
    Roger Ford Expert
    Currently Being Moderated
    Yes. AUTO_FILTER is used for binary columns, and this involves creating an OS file and invoking an external executable on that file, for each row. The overheads of doing that are significant when the amount of data in each row is small.

    When you say it contains "hex data", do you mean hexadecimal numbers in the form of text, such as A9FF or 0xA9FF? Or do you mean binary data which doesn't represent any particular characters? If it's the first of these, you probably shouldn't be using RAW at all. If it's the second, it would be worth checking what's going into the $I table (select token_text from dr$your-index-name$i) as it might be creating garbage tokens.
  • 5. Re: CONTEXT index creation - performance!
    757652 Newbie
    Currently Being Moderated
    My base table has the text that needs to be indexed stored in the "MESSAGE_CONTENT" column which for now is RAW data type. The data stored in this table are in hex representation.

    Some examples -


    MESSAGE_CONTENT
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    616C70686120626574612067616D6D612064656C746120657073696C6F6E207A657461206E69F16F
    616C70686120626574612067616D6D612064656C746120657073696C6F6E207A657461
    616C70686120626574612067616D6D612064656C746120657073696C6F6E207A657461206E69C3B16F
    6865792E2C2C77686174277320676F696E67206F6E2E2E2E7066206368616E67277320697320736F6D652072657374617572616E742E2074686579206172652070736564756F2D636F6F6C
    54686520477265656B20616C7068616265742069732074686520736372697074207468617420686173206265656E
    54686520477265656B20616C7068616265742069732074686520736372697074207468617420686173206265656E20706F73742D64617461

    Now with your suggestion i tried to bypass this. So what i did was added a format column to my base table and updated it to "TEXT". My database is in UTF8.
    Now when i create the index with the following preferences it takes less than a minute.


    begin
    Ctx_Ddl.Create_Preference('kwta_storage', 'BASIC_STORAGE');
    ctx_ddl.set_attribute('kwta_storage','I_TABLE_CLAUSE',
    'tablespace TEXT_INDEX storage (initial 10M next 10M)');

    ctx_ddl.set_attribute('kwta_storage', 'K_TABLE_CLAUSE',
    'tablespace TEXT_INDEX storage (initial 10M next 10M)');

    ctx_ddl.set_attribute('kwta_storage', 'R_TABLE_CLAUSE',
    'tablespace TEXT_INDEX storage (initial 1M) lob (data) store as (cache)');

    ctx_ddl.set_attribute('kwta_storage', 'N_TABLE_CLAUSE',
    'tablespace TEXT_INDEX storage (initial 1M)');

    ctx_ddl.set_attribute('kwta_storage', 'I_INDEX_CLAUSE',
    'tablespace TEXT_INDEX storage (initial 1M) compress 2');

    ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
    ctx_ddl.set_attribute('mylex', 'skipjoins', '_-"''`~!@#$%^&*()+=|}{[]\:;<>?/,');
    ctx_ddl.set_attribute('mylex','punctuations','.?!');
    ctx_ddl.set_attribute('mylex', 'INDEX_STEMS','NONE');
    ctx_ddl.set_attribute('mylex', 'continuation','\-');


    Ctx_Ddl.Create_Stoplist ( 'mystop' );
    Ctx_Ddl.Add_Stopword ( 'mystop', 'is' );
    Ctx_Ddl.Add_Stopword ( 'mystop', 'has' );
    Ctx_Ddl.Add_Stopword ( 'mystop', 'the' );
    Ctx_Ddl.Add_Stopword ( 'mystop', 'that' );
    end;


    create index kwtaidx on kwtai (message_content) indextype is ctxsys.context
    parameters ('filter ctxsys.auto_filter format column fmt stoplist mystop lexer mylex storage kwta_storage memory 500M')
    parallel 16;

    When i select distinct tokens from the $I table i get the following
    TOKEN_TEXT
    ----------------------------------------------------------------
    RESTAURANT
    GREEK
    WHATS
    ARE
    DELTA
    ZETA
    ALPHA
    ALPHABET
    EPSILON
    PF
    PSEDUOCOOL
    SOME
    CHANGS
    NIÃO
    ON
    POSTDATA
    SCRIPT
    BEEN
    GAMMA
    GOING
    HEY
    NI
    O
    THEY
    BETA

    Now what i am also wondering is if the text (message_content column) is being converted to UTF8 (database characterset) by using AUTO_FILTER. Is my assumption correct? Not sure how to validate this?

    And, would you kindly share of why RAW must not be used in this case?

    Thanks for all your pointers!
  • 6. Re: CONTEXT index creation - performance!
    Roger Ford Expert
    Currently Being Moderated
    You say: "The data stored in this table are in hex representation". I don't think it is. It's just binary data, and the database uses hex representation to display raw-column binary data in a human-readable format.

    If you've used a format column set to TEXT I would expect it to act exactly the same as using NULL_FILTER, but there might be some differences in character set conversion.

    Why do I think RAW is the wrong datatype? Because you're storing textual data in it. There are no zero bytes, or anything else that would require RAW to store. If the database character set is AL32UTF8, why not just store UTF8 text in a VARCHAR2 column? The only reason I can think of for using RAW is that you might sometimes use the same code on a database with an 8-bit character set, and you want to maintain full control over any character set conversions that might happen.
  • 7. Re: CONTEXT index creation - performance!
    757652 Newbie
    Currently Being Moderated
    Thanks!

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points