This discussion is archived
8 Replies Latest reply: Sep 17, 2012 12:00 PM by 20020 RSS

Seeing an inconsistency in results from NDATA

20020 Pro
Currently Being Moderated
Hello all,

I did the steps in this very helpful post: Achieving functionality of many preferences using one context index and I'm seeing one strange inconsistency.

DB Version is 11.2.0.3.0.

My table & data looks like this:
CREATE TABLE DEMO_TEXT
(
  RECORD_ID  NUMBER(9)                          NOT NULL,
  TEXT       VARCHAR2(200 BYTE)                 NOT NULL
)

INSERT INTO demo_text(record_id, text) VALUES (1, 'INTERNATIONAL BUSINESS MACHINES');
INSERT INTO demo_text(record_id, text) VALUES (2, 'FORD MOTOR COMPANY');
INSERT INTO demo_text(record_id, text) VALUES (3, 'GENERAL MOTORS');
INSERT INTO demo_text(record_id, text) VALUES (4, 'INTERNATIONAL HARVESTER');
INSERT INTO demo_text(record_id, text) VALUES (5, 'GENERAL ELECTRIC');
INSERT INTO demo_text(record_id, text) VALUES (6, 'CROSS ROADS ENTERTAINMENT');
INSERT INTO demo_text(record_id, text) VALUES (7, 'CROSSROADS ENTERTAINMENT');
INSERT INTO demo_text(record_id, text) VALUES (8, 'GOT HAM COMPANY');
INSERT INTO demo_text(record_id, text) VALUES (9, 'GOTHAM COMPANY');
INSERT INTO demo_text(record_id, text) VALUES (10, 'JK HOLDING');
INSERT INTO demo_text(record_id, text) VALUES (11, 'JK HOLDINGS');
INSERT INTO demo_text(record_id, text) VALUES (12, 'GOT THE GOODS');
INSERT INTO demo_text(record_id, text) VALUES (13, 'GET THE GOODS');
INSERT INTO demo_text(record_id, text) VALUES (14, 'YOU KNOW IT');
INSERT INTO demo_text(record_id, text) VALUES (15, 'YOU KNEW IT');
INSERT INTO demo_text(record_id, text) VALUES (16, 'MOVE AWAY');
INSERT INTO demo_text(record_id, text) VALUES (17, 'MOVED AWAY');
INSERT INTO demo_text(record_id, text) VALUES (18, 'MOVING AWAY');
INSERT INTO demo_text(record_id, text) VALUES (19, 'GO GREEN');
INSERT INTO demo_text(record_id, text) VALUES (20, 'GOING GREEN');
I ran the following (taken from the above post):
BEGIN
   ctx_ddl.create_preference('nd_mcds', 'multi_column_datastore');
   ctx_ddl.set_attribute('nd_mcds', 'columns', 'text nd, text text');
   ctx_ddl.create_section_group('nd_sg', 'basic_section_group');
   ctx_ddl.add_ndata_section('nd_sg', 'nd', 'nd');
   ctx_ddl.create_preference('test_lex', 'basic_lexer');
   ctx_ddl.set_attribute('test_lex', 'whitespace', '/\|-_+');
END;

CREATE INDEX ix_demo_text
   ON demo_text(text)
   INDEXTYPE IS ctxsys.context
   PARAMETERS ( 'datastore  nd_mcds
                 section    group nd_sg
                 lexer      test_lex' );
I'm running the following query (taken from the same post but not using synonyms):
SELECT *
  FROM demo_text dte
 WHERE contains(
          dte.text,
          'ndata (nd,' || :search_string || ') or
        ?!$' || REPLACE(:search_string, ' ', ';?!$')
       ) > 0;
If I run it with a search_string of either "crossroads" or "cross roads", I get both "CROSS ROADS ENTERTAINMENT" and "CROSSROADS ENTERTAINMENT", perfect.

If I run it with a search_string of "got ham", I get "GOT HAM COMPANY", "GOTHAM COMPANY" and "GOT THE GOODS" - again just what I'd expect.

But if I run it for "gotham", I get only "GOTHAM COMPANY". I wouldn't necessarily expect it to find "GOT THE GOODS", but I'm puzzled why it wouldn't return "GOT HAM COMPANY". It seems to me to be no different than searching for "crossroads" and getting "CROSS ROADS ENTERTAINMENT" back.

Any thoughts on this?

Thanks much!
John
  • 1. Re: Seeing an inconsistency in results from NDATA
    20020 Pro
    Currently Being Moderated
    Bump. Any thoughts on this issue?

    I'm new to Oracle Text - evaluating it as a solution for determining name uniqueness, and one of our requirements is that joining words in a name does not make a name unique. For example, "CROSSROADS" would be seen as equivalent to "CROSS ROADS" and "GOTHAM" would be seen as equivalent to "GOT HAM". It seems like the NDATA feature of Text is just the thing, but I can't imagine why the two cases above are being treated differently.

    I tried other searches with/without joined terms against my small test set, and all appear to return what I'd expect except for the "GOTHAM"/"GOT HAM" example. Then I thought perhaps it was because "GOTHAM" isn't really a word? But a search on both "jk holdings" and "jkholdings" returns "JK HOLDING" and "JK HOLDINGS" which is correct. I also added two new rows, "ACMEWIDGETS" and "ACME WIDGETS", which I would think should totally replicate my "GOTHAM" example, and a search on both "acme widgets" and "acmewidgets" returns both, which is correct.

    I'm really at a loss here. I have to think that if the way I'm implementing NDATA won't handle "GOTHAM"/"GOT HAM", there must be other cases it won't handle. Could someone please look over/try out what I've done and let me know what I might be missing? Maybe some switch or preference? Or some way to augment NDATA to give consistent results?

    I really appreciate it - thanks!
    John
  • 2. Re: Seeing an inconsistency in results from NDATA
    Roger Ford Expert
    Currently Being Moderated
    There are two complicating factors here.

    Firstly, your query has "or ?!$' || REPLACE(:search_string, ' ', ';?!$')" which means that the results aren't necessarily from the NDATA operator at all.

    Secondly, "GOT" is a stopword by default, so will not be indexed. This would mean that "GOT HAM COMPANY" would be indexed as "X HAM COMPANY" where "X" represents a "missing" word - the index knows there's a word in that position but doesn't know what it is.

    However, if I add "stoplist ctxsys.empty_stoplist" to your parameters clause, and simplify the query to
    SELECT *
      FROM demo_text dte
     WHERE contains(
              dte.text,
              'ndata (nd,' || :search_string || ')'
           ) > 0
    Then I still get the same results. "GOT HAM" matches three rows, "GOTHAM" only matches one.

    I can't really explain why this is. I suspect it's something to do with word length - the three letters in GOT aren't considered good enough to be a match for GOTHAM, and it therefore doesn't get as far as trying to match "HAM". Indeed, if we change the example to "GOTT HAM" and "GOTTHAM" then we get the hits we expect.
  • 3. Re: Seeing an inconsistency in results from NDATA
    20020 Pro
    Currently Being Moderated
    Roger - thank you so much for looking into this. I was completely stumped.

    That was a great catch on "GOT" - I would've been hunting for hours before realizing "got" could be one of the default stop words!

    I added more variations to my test set and what you're suspecting re: word length sure seems to be consistent with what I'm seeing. Interesting thing is searches on "jk holding" and "jkholdings" return what I expect. But maybe it's hitting on the longer "holding" word and the "jk" is irrelevant?

    I'll have to look into whether there is some setting or parameter than can control what Text views as a significant-enough match - perhaps four characters is a default minimum? If not I'll have to think about some sort of workaround. Since I'm dealing with business names, two or three character terms are very common due to use of acronyms/initials so I'll have to accommodate it somehow.

    Thanks again Roger - this was very helpful.

    John
  • 4. Re: Seeing an inconsistency in results from NDATA
    20020 Pro
    Currently Being Moderated
    Hi Roger,

    I've continued to investigate this issue, and I haven't been able to find any setting/parameter that would tell Text how many characters in a term should match to consider it a good enough match when using NDATA. I also don't see anything in the documentation mentioning a lower limit on the number of characters Text will consider a match.

    My data consists of business names so there is a heavy frequency of one-, two- or three-character terms ("X Y Z", "XYZ", "XY & Z", etc.). The user requirement is that a search on any of those should return all three. I came across NDATA and figured it would be the perfect solution for this.

    I have been getting excellent results as long as there is at least one term of more than three characters in the name. Is there anything that you can think of to try to get consistent results if the terms are shorter? Some parameter/setting/technique I'm overlooking? I've looked into other options - going as far as storing every spacing variation of a name (e.g. "X Y Z" to "X Y Z", "XY Z", "X YZ", "XYZ") and not using NDATA - but that seems like such a waste when NDATA has me almost there. If this truly is limitation, can you (or anyone else) suggest a solid workaround?

    Thanks so much,
    John
  • 5. Re: Seeing an inconsistency in results from NDATA
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    Removing all spacing from the ndata like so:

    ctx_ddl.set_attribute('nd_mcds', 'columns', 'replace (text, '' '', '''') nd, text text');

    seems to solve your problems, as demonstrated below.
    SCOTT@orcl_11gR2> CREATE TABLE DEMO_TEXT
      2    (RECORD_ID  NUMBER(9)            NOT NULL,
      3       TEXT        VARCHAR2(200 BYTE)  NOT NULL)
      4  /
    
    Table created.
    
    SCOTT@orcl_11gR2> SET DEFINE OFF
    SCOTT@orcl_11gR2> INSERT ALL
      2  INTO demo_text(record_id, text) VALUES (1, 'INTERNATIONAL BUSINESS MACHINES')
      3  INTO demo_text(record_id, text) VALUES (2, 'FORD MOTOR COMPANY')
      4  INTO demo_text(record_id, text) VALUES (3, 'GENERAL MOTORS')
      5  INTO demo_text(record_id, text) VALUES (4, 'INTERNATIONAL HARVESTER')
      6  INTO demo_text(record_id, text) VALUES (5, 'GENERAL ELECTRIC')
      7  INTO demo_text(record_id, text) VALUES (6, 'CROSS ROADS ENTERTAINMENT')
      8  INTO demo_text(record_id, text) VALUES (7, 'CROSSROADS ENTERTAINMENT')
      9  INTO demo_text(record_id, text) VALUES (8, 'GOT HAM COMPANY')
     10  INTO demo_text(record_id, text) VALUES (9, 'GOTHAM COMPANY')
     11  INTO demo_text(record_id, text) VALUES (10, 'JK HOLDING')
     12  INTO demo_text(record_id, text) VALUES (11, 'JK HOLDINGS')
     13  INTO demo_text(record_id, text) VALUES (12, 'GOT THE GOODS')
     14  INTO demo_text(record_id, text) VALUES (13, 'GET THE GOODS')
     15  INTO demo_text(record_id, text) VALUES (14, 'YOU KNOW IT')
     16  INTO demo_text(record_id, text) VALUES (15, 'YOU KNEW IT')
     17  INTO demo_text(record_id, text) VALUES (16, 'MOVE AWAY')
     18  INTO demo_text(record_id, text) VALUES (17, 'MOVED AWAY')
     19  INTO demo_text(record_id, text) VALUES (18, 'MOVING AWAY')
     20  INTO demo_text(record_id, text) VALUES (19, 'GO GREEN')
     21  INTO demo_text(record_id, text) VALUES (20, 'GOING GREEN')
     22  INTO demo_text(record_id, text) VALUES (21, 'X Y Z')
     23  INTO demo_text(record_id, text) VALUES (22, 'XYZ')
     24  INTO demo_text(record_id, text) VALUES (23, 'XY & Z')
     25  SELECT * FROM DUAL
     26  /
    
    23 rows created.
    
    SCOTT@orcl_11gR2> BEGIN
      2       ctx_ddl.create_preference('nd_mcds', 'multi_column_datastore');
      3       ctx_ddl.set_attribute('nd_mcds', 'columns', 'replace (text, '' '', '''') nd, text text');
      4       ctx_ddl.create_section_group('nd_sg', 'basic_section_group');
      5       ctx_ddl.add_ndata_section('nd_sg', 'nd', 'nd');
      6       ctx_ddl.create_preference('test_lex', 'basic_lexer');
      7       ctx_ddl.set_attribute('test_lex', 'whitespace', '/\|-_+');
      8  END;
      9  /
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> CREATE INDEX ix_demo_text
      2       ON demo_text(text)
      3       INDEXTYPE IS ctxsys.context
      4       PARAMETERS ( 'datastore  nd_mcds
      5                  section      group nd_sg
      6                  lexer      test_lex
      7                  stoplist      ctxsys.empty_stoplist' );
    
    Index created.
    
    SCOTT@orcl_11gR2> 
    SCOTT@orcl_11gR2> VARIABLE search_string VARCHAR2(30)
    SCOTT@orcl_11gR2> EXEC :search_string := 'gotham'
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> COLUMN text FORMAT A30
    SCOTT@orcl_11gR2> SELECT *
      2    FROM demo_text dte
      3   WHERE contains(
      4              dte.text,
      5              'ndata (nd,' || :search_string || ') or
      6            ?!$' || REPLACE(:search_string, ' ', ';?!$')
      7           ) > 0
      8  /
    
     RECORD_ID TEXT
    ---------- ------------------------------
             8 GOT HAM COMPANY
             9 GOTHAM COMPANY
    
    2 rows selected.
    
    SCOTT@orcl_11gR2> EXEC :search_string := 'XYZ'
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> /
    
     RECORD_ID TEXT
    ---------- ------------------------------
            21 X Y Z
            22 XYZ
            23 XY & Z
    
    3 rows selected.
  • 6. Re: Seeing an inconsistency in results from NDATA
    20020 Pro
    Currently Being Moderated
    Hi Barbara,

    Thank you so much for this information - I recreated my preferences and index based on your example and it has corrected my original issue. It appears I've lost something though:

    - if I search on either "XYZ", "X YZ" or "XY Z", I get all three variations back (correct), however if I search on "X Y Z" I only get "X Y Z" back

    - if I search on "GOTHAM" I get "GOT HAM COMPANY" and "GOTHAM COMPANY" (correct), however if I search on "GOT HAM" I only get "GOT HAM COMPANY" back

    Do you see that on your side as well? Here's the output of a call to ctx_report.describe_index('ix_demo_text'), in case I still don't have this set-up correctly on my side:
    ===========================================================================
                                 INDEX DESCRIPTION
    ===========================================================================
    index name:                    "CORP"."IX_DEMO_TEXT"
    index id:                      1185
    index type:                    context
    
    base table:                    "CORP"."DEMO_TEXT"
    primary key column:            RECORD_ID
    text column:                   TEXT
    text column type:              VARCHAR2(200)
    language column:               
    format column:                 
    charset column:                
    configuration column:          
    Query Stats Enabled:           NO
    
    
    status:                        INDEXED
    full optimize token:           
    full optimize count:           
    docid count:                   24
    nextid:                        25
    
    ===========================================================================
                                   INDEX OBJECTS
    ===========================================================================
    datastore:                     MULTI_COLUMN_DATASTORE
       columns:                       replace (text, ' ', '') nd, text text
    
    filter:                        NULL_FILTER
    
    section group:                 BASIC_SECTION_GROUP
       ndata section:                 ND
          section tag:                   ND
    
    lexer:                         BASIC_LEXER
       whitespace:                    /\|-_+
    
    wordlist:                      BASIC_WORDLIST
       stemmer:                       ENGLISH
       fuzzy_match:                   GENERIC
    
    stoplist:                      BASIC_STOPLIST
    
    storage:                       BASIC_STORAGE
       r_table_clause:                lob (data) store as (cache)
       i_index_clause:                compress 2
    Thanks again - it's so close!

    John
  • 7. Re: Seeing an inconsistency in results from NDATA
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    Please try also removing all spacing from the search string in the contains clause in the query:

    REPLACE (:search_string, ' ', '')

    as shown below.
    SCOTT@orcl_11gR2> CREATE TABLE DEMO_TEXT
      2    (RECORD_ID  NUMBER(9)            NOT NULL,
      3       TEXT        VARCHAR2(200 BYTE)  NOT NULL)
      4  /
    
    Table created.
    
    SCOTT@orcl_11gR2> SET DEFINE OFF
    SCOTT@orcl_11gR2> INSERT ALL
      2  INTO demo_text(record_id, text) VALUES (1, 'INTERNATIONAL BUSINESS MACHINES')
      3  INTO demo_text(record_id, text) VALUES (2, 'FORD MOTOR COMPANY')
      4  INTO demo_text(record_id, text) VALUES (3, 'GENERAL MOTORS')
      5  INTO demo_text(record_id, text) VALUES (4, 'INTERNATIONAL HARVESTER')
      6  INTO demo_text(record_id, text) VALUES (5, 'GENERAL ELECTRIC')
      7  INTO demo_text(record_id, text) VALUES (6, 'CROSS ROADS ENTERTAINMENT')
      8  INTO demo_text(record_id, text) VALUES (7, 'CROSSROADS ENTERTAINMENT')
      9  INTO demo_text(record_id, text) VALUES (8, 'GOT HAM COMPANY')
     10  INTO demo_text(record_id, text) VALUES (9, 'GOTHAM COMPANY')
     11  INTO demo_text(record_id, text) VALUES (10, 'JK HOLDING')
     12  INTO demo_text(record_id, text) VALUES (11, 'JK HOLDINGS')
     13  INTO demo_text(record_id, text) VALUES (12, 'GOT THE GOODS')
     14  INTO demo_text(record_id, text) VALUES (13, 'GET THE GOODS')
     15  INTO demo_text(record_id, text) VALUES (14, 'YOU KNOW IT')
     16  INTO demo_text(record_id, text) VALUES (15, 'YOU KNEW IT')
     17  INTO demo_text(record_id, text) VALUES (16, 'MOVE AWAY')
     18  INTO demo_text(record_id, text) VALUES (17, 'MOVED AWAY')
     19  INTO demo_text(record_id, text) VALUES (18, 'MOVING AWAY')
     20  INTO demo_text(record_id, text) VALUES (19, 'GO GREEN')
     21  INTO demo_text(record_id, text) VALUES (20, 'GOING GREEN')
     22  INTO demo_text(record_id, text) VALUES (21, 'X Y Z')
     23  INTO demo_text(record_id, text) VALUES (22, 'XYZ')
     24  INTO demo_text(record_id, text) VALUES (23, 'XY & Z')
     25  SELECT * FROM DUAL
     26  /
    
    23 rows created.
    
    SCOTT@orcl_11gR2> BEGIN
      2       ctx_ddl.create_preference('nd_mcds', 'multi_column_datastore');
      3       ctx_ddl.set_attribute('nd_mcds', 'columns', 'replace (text, '' '', '''') nd, text text');
      4       ctx_ddl.create_section_group('nd_sg', 'basic_section_group');
      5       ctx_ddl.add_ndata_section('nd_sg', 'nd', 'nd');
      6       ctx_ddl.create_preference('test_lex', 'basic_lexer');
      7       ctx_ddl.set_attribute('test_lex', 'whitespace', '/\|-_+');
      8  END;
      9  /
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> CREATE INDEX ix_demo_text
      2       ON demo_text(text)
      3       INDEXTYPE IS ctxsys.context
      4       PARAMETERS ( 'datastore  nd_mcds
      5                  section      group nd_sg
      6                  lexer      test_lex
      7                  stoplist      ctxsys.empty_stoplist' );
    
    Index created.
    
    SCOTT@orcl_11gR2> 
    SCOTT@orcl_11gR2> VARIABLE search_string VARCHAR2(30)
    SCOTT@orcl_11gR2> EXEC :search_string := 'gotham'
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> COLUMN text FORMAT A30
    SCOTT@orcl_11gR2> SELECT *
      2    FROM demo_text dte
      3   WHERE contains(
      4              dte.text,
      5              'ndata (nd,' || REPLACE (:search_string, ' ', '') || ') or
      6            ?!$' || REPLACE(:search_string, ' ', ';?!$')
      7           ) > 0
      8  /
    
     RECORD_ID TEXT
    ---------- ------------------------------
             8 GOT HAM COMPANY
             9 GOTHAM COMPANY
    
    2 rows selected.
    
    SCOTT@orcl_11gR2> EXEC :search_string := 'got ham'
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> /
    
     RECORD_ID TEXT
    ---------- ------------------------------
             8 GOT HAM COMPANY
             9 GOTHAM COMPANY
    
    2 rows selected.
    
    SCOTT@orcl_11gR2> EXEC :search_string := 'XYZ'
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> /
    
     RECORD_ID TEXT
    ---------- ------------------------------
            21 X Y Z
            22 XYZ
            23 XY & Z
    
    3 rows selected.
    
    SCOTT@orcl_11gR2> EXEC :search_string := 'X Y Z'
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> /
    
     RECORD_ID TEXT
    ---------- ------------------------------
            21 X Y Z
            22 XYZ
            23 XY & Z
    
    3 rows selected.
  • 8. Re: Seeing an inconsistency in results from NDATA
    20020 Pro
    Currently Being Moderated
    That did it Barbara - thanks so much again for your help!

    John

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points