1 Reply Latest reply: May 1, 2012 6:44 AM by Roger Ford-Oracle RSS

    Arabic PDF files stored in reverse index in the DR tables

    Omar M Sawalhah
      Hi All,
      I hope somebody will help in this case.

      I have this case . I need to build a domain index to search withing PDF (Arabic, English, or mixed) files stored in BLOB columns in the db.
      here is some contents of my file
      استعمل مصطلح التنمية في اللغتين الإنجليزية والفرنسية، للدلالة على
      زيادة الشيء وتوسعه عبر مراحل مختلفة، و في مجالات معرفية متعددة،

      here what I have done, I copied this from your messages.

      1. CREATE table RPT_SOURCE
      (SOURCE_ID NUMBER (10,0) NOT NULL ENABLE,
      BODY_FORMAT VARCHAR2 (10) NOT NULL ENABLE,
      BODY     BLOB     default empty_blob(),
      CONSTRAINT pk_rpt_source_id PRIMARY KEY (source_id));
      2. begin load_file(1,'8187.pdf'); end;

      create or replace PROCEDURE load_file (
      p_id number,
      pfname VARCHAR2) IS

      src_file BFILE;
      dst_file BLOB;
      lgh_file BINARY_INTEGER;
      BEGIN
      src_file := bfilename('BLOBDIR',pfname);

      -- insert a NULL record to lock
      INSERT INTO rpt_source (source_id, body_format, body) VALUES
      (p_id, 'BINARY', EMPTY_BLOB())
      RETURNING body INTO dst_file;

      -- lock record
      SELECT body
      INTO dst_file
      FROM rpt_source
      WHERE source_id = p_id
      FOR UPDATE;

      -- open the file
      dbms_lob.fileopen(src_file, dbms_lob.file_readonly);

      -- determine length
      lgh_file := dbms_lob.getlength(src_file);

      -- read the file
      dbms_lob.loadfromfile(dst_file, src_file, lgh_file);

      -- update the blob field
      UPDATE rpt_source
      SET body = dst_file
      WHERE source_id = p_id;

      -- close file
      dbms_lob.fileclose(src_file);
      END load_file;

      3.CREATE INDEX idx_txt_body ON RPT_SOURCE (BODY)
      INDEXTYPE IS CTXSYS.CONTEXT
      PARAMETERS
      ('FILTER CTXSYS.AUTO_FILTER format column     BODY_FORMAT');
      4. select * from ctx_user_index_errors;
      no rows selected.

      5.select * from rpt_source
      where contains(body, 'مصطلح') > 0;

      no rows selected.

      I am searching one of the terms in the PDF in the correct direction, but when I search for the same term in reverse I mean Arabic characters from left-to-right.

      6. select source_id from rpt_source
      where contains(body, 'حلطصم') > 0;

      source_id
      1

      you see I can get the results. when I checked the

      7. select * from dr$idx_txt_body$i
      where token_text = 'مصطلح';
      no rows selected.

      8. select * from dr$idx_txt_body$i
      where token_text = 'حلطصم';

      حلطصم     0     1     1     1     (BLOB)
      when I checked dr$idx_txt_body$i again all the words appears in reverse order. Sorry but I have to include Arabic words I am not sure if you can read them in reverse or not. Just to let you know that for the English words it is ok

      Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
      With the Partitioning, OLAP, Data Mining and Real Application Testing options

      PDF version 1.4 (Acrobat 5.x)

      Thanks in advance.
      Omar