This discussion is archived
11 Replies Latest reply: Jan 17, 2011 7:58 AM by 4999 RSS

Strange problem with REGEXP_LIKE

4999 Newbie
Currently Being Moderated
Dear all,

I am facing a strange problem with a query using the function REGEXP_LIKE.
I want to filter a set of strings which can contain some special characters.

See this:
SELECT * FROM (
SELECT 1 line_no, '01!' char_string FROM dual
UNION
SELECT 2 line_no, '02"' char_string FROM dual
UNION
SELECT 3 line_no, '03/' char_string FROM dual
UNION
SELECT 4 line_no, '04(' char_string FROM dual
UNION
SELECT 5 line_no, '05)' char_string FROM dual
UNION
SELECT 6 line_no, '06=' char_string FROM dual
UNION
SELECT 7 line_no, '07''' char_string FROM dual
UNION
SELECT 8 line_no, '08#' char_string FROM dual
UNION
SELECT 9 line_no, '09+' char_string FROM dual
UNION
SELECT 10 line_no, '10-' char_string FROM dual
UNION
SELECT 11 line_no, '11,' char_string FROM dual
UNION
SELECT 12 line_no, '12_' char_string FROM dual
UNION
SELECT 13 line_no, '13&' char_string FROM dual
UNION
SELECT 14 line_no, '14 ' char_string FROM dual
UNION
SELECT 15 line_no, '15.' char_string FROM dual
UNION
SELECT 16 line_no, '16,' char_string FROM dual
UNION
SELECT 17 line_no, '17@' char_string FROM dual
UNION
SELECT 18 line_no, '18:' char_string FROM dual
UNION
SELECT 19 line_no, '19;' char_string FROM dual
UNION
SELECT 20 line_no, '20[' char_string FROM dual
UNION
SELECT 21 line_no, '21]' char_string FROM dual
UNION
SELECT 22 line_no, '22\' char_string FROM dual
UNION
SELECT 23 line_no, '23<' char_string FROM dual
UNION
SELECT 24 line_no, '24>' char_string FROM dual
UNION
SELECT 25 line_no, '25?' char_string FROM dual
UNION
SELECT 26 line_no, '26*' char_string FROM dual
UNION
SELECT 27 line_no, '27$' char_string FROM dual
)
WHERE REGEXP_LIKE(char_string, '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:][\!\"\/\(\)\=\''\#\+\-\,\_\&\%\ \.\,\@\:\;\[\]\\\<\>\?]]+')
;
This query should return all lines up to no 25. Lines 26 and 27 should be filtered out.
But this query doesn't return any lines.

Some remarks :
I do list all valid special characters explicitly because [A-Za-z] or [:alpha:] does not work properly for NLS-specific characters (which should be filtered out).
I do not use char class [:punct:] because characters like "*" or "$" (which are part of class [:punct:]) should be filtered out as well (see lines 26 and 27).

First I doubted the char "]" to end the last char class, but as this is escaped with char "\" this should work in the expected way.
Query does not work either if I leave out the square brackets for the char class of special characters.
I also tried to replace "\[\]" with the corresponding hex codes ("\x5B\x5D") but this returns all lines, including lines 26 and 27.
Same result when I leave out "\[\]" .

This seems kind of odd to me.

Does any body have some ideas on this?

Regards
Kai
  • 1. Re: Strange problem with REGEXP_LIKE
    MichaelS Guru
    Currently Being Moderated
    First I doubted the char "]" to end the last char class, but as this is escaped with char "\" this should work in the expected way.
    check Re: REGEXP_REPLACE with ^ character !
  • 2. Re: Strange problem with REGEXP_LIKE
    BluShadow Guru Moderator
    Currently Being Moderated
    Square brackets at the start and minus sign at the end... oh, and you don't need to escape all the punctuation characters, just the "\" and obviously the single quote...
    SQL> ed
    Wrote file afiedt.buf
    
      1  SELECT * FROM (
      2  SELECT 1 line_no, '01!' char_string FROM dual UNION ALL
      3  SELECT 2 line_no, '02"' char_string FROM dual UNION ALL
      4  SELECT 3 line_no, '03/' char_string FROM dual UNION ALL
      5  SELECT 4 line_no, '04(' char_string FROM dual UNION ALL
      6  SELECT 5 line_no, '05)' char_string FROM dual UNION ALL
      7  SELECT 6 line_no, '06=' char_string FROM dual UNION ALL
      8  SELECT 7 line_no, '07''' char_string FROM dual UNION ALL
      9  SELECT 8 line_no, '08#' char_string FROM dual UNION ALL
     10  SELECT 9 line_no, '09+' char_string FROM dual UNION ALL
     11  SELECT 10 line_no, '10-' char_string FROM dual UNION ALL
     12  SELECT 11 line_no, '11,' char_string FROM dual UNION ALL
     13  SELECT 12 line_no, '12_' char_string FROM dual UNION ALL
     14  SELECT 13 line_no, '13&' char_string FROM dual UNION ALL
     15  SELECT 14 line_no, '14 ' char_string FROM dual UNION ALL
     16  SELECT 15 line_no, '15.' char_string FROM dual UNION ALL
     17  SELECT 16 line_no, '16,' char_string FROM dual UNION ALL
     18  SELECT 17 line_no, '17@' char_string FROM dual UNION ALL
     19  SELECT 18 line_no, '18:' char_string FROM dual UNION ALL
     20  SELECT 19 line_no, '19;' char_string FROM dual UNION ALL
     21  SELECT 20 line_no, '20[' char_string FROM dual UNION ALL
     22  SELECT 21 line_no, '21]' char_string FROM dual UNION ALL
     23  SELECT 22 line_no, '22\' char_string FROM dual UNION ALL
     24  SELECT 23 line_no, '23&lt;' char_string from dual UNION ALL
     25  SELECT 23 line_no, '24&gt;' char_string FROM dual UNION ALL
     26  SELECT 25 line_no, '25?' char_string FROM dual UNION ALL
     27  SELECT 26 line_no, '26*' char_string FROM dual UNION ALL
     28  SELECT 27 line_no, '27$' char_string FROM dual)
     29* WHERE REGEXP_LIKE(char_string, '[][ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:blank:]!"/()=''#+,_&.@:;\\&lt;&gt;?-]+')
    SQL> /
    
       LINE_NO CHA
    ---------- ---
             1 01!
             2 02"
             3 03/
             4 04(
             5 05)
             6 06=
             7 07'
             8 08#
             9 09+
            10 10-
            11 11,
            12 12_
            13 13&
            14 14
            15 15.
            16 16,
            17 17@
            18 18:
            19 19;
            20 20[
            21 21]
            22 22\
            23 23<
            23 24>
            25 25?
    
    25 rows selected.
    {code}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  • 3. Re: Strange problem with REGEXP_LIKE
    4999 Newbie
    Currently Being Moderated
    Thanks for your input, Michael.
    So experimented with using the special characters square brackets as the first characters and the minus sign as the last.

    But ...
    ... changing my WHERE-clause to
    WHERE REGEXP_LIKE(char_string, '[][ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:][\!\"\/\(\)\=\''\#\+\,\_\&\%\ \.\,\@\:\;\\\<\>\?\-]]+')
    {code}
    results in line 21 ("21]") only (which is not correct).
    
    ... changing it to
    {code}
    WHERE REGEXP_LIKE(char_string, '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:][[]\!\"\/\(\)\=\''\#\+\,\_\&\%\ \.\,\@\:\;\\\<\>\?\-]]+')
    shows no results at all.

    ... changing it to
    WHERE REGEXP_LIKE(char_string, '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:][\[\]\!\"\/\(\)\=\''\#\+\,\_\&\%\ \.\,\@\:\;\\\<\>\?\-]]+')
    {code}
    
    shows no results at all, too.
    
    So this is no improvement.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  • 4. Re: Strange problem with REGEXP_LIKE
    MichaelS Guru
    Currently Being Moderated
    Probably you need the $ (-end) sign as well:
    SQL> select * from (
        select 1 line_no, '01!' char_string from dual union
        select 2 line_no, '02"' char_string from dual union
        select 3 line_no, '03/' char_string from dual union
        select 4 line_no, '04(' char_string from dual union
        select 5 line_no, '05)' char_string from dual union
        select 6 line_no, '06=' char_string from dual union
        select 7 line_no, '07''' char_string from dual union
        select 8 line_no, '08#' char_string from dual union
        select 9 line_no, '09+' char_string from dual union
        select 10 line_no, '10-' char_string from dual union
        select 11 line_no, '11,' char_string from dual union
        select 12 line_no, '12_' char_string from dual union
        select 13 line_no, '13&' char_string from dual union
        select 14 line_no, '14 ' char_string from dual union
        select 15 line_no, '15.' char_string from dual union
        select 16 line_no, '16,' char_string from dual union
        select 17 line_no, '17@' char_string from dual union
        select 18 line_no, '18:' char_string from dual union
        select 19 line_no, '19;' char_string from dual union
        select 20 line_no, '20[' char_string from dual union
        select 21 line_no, '21]' char_string from dual union
        select 22 line_no, '22' char_string from dual union
        select 23 line_no, '23<''24>' char_string from dual union
        select 25 line_no, '25?' char_string from dual union
        select 26 line_no, '26*' char_string from dual union
        select 27 line_no, '27$' char_string from dual
    )
    --
    --
    where regexp_like(char_string, '[][ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:]!"/()=''#+,_&% .,@:;<>?-]+$')
    /
       LINE_NO CHAR_STRING
    ---------- -----------
             1 01!        
             2 02"        
             3 03/        
             4 04(        
             5 05)        
             6 06=        
             7 07'        
             8 08#        
             9 09+        
            10 10-        
            11 11,        
            12 12_        
            13 13&        
            14 14         
            15 15.        
            16 16,        
            17 17@        
            18 18:        
            19 19;        
            20 20[        
            21 21]        
            22 22         
            23 23<'24>    
            25 25?        
    
    24 rows selected.
    {code}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  • 5. Re: Strange problem with REGEXP_LIKE
    4999 Newbie
    Currently Being Moderated
    Thanks, Michael.

    I did not update this thread during my "experiments", so I did not notice your last reply.

    But now it works like I expected it.

    Made my day
  • 6. Re: Strange problem with REGEXP_LIKE
    4999 Newbie
    Currently Being Moderated
    Update

    Tried your tip in the "Real" world.
    I want to get a list of job titles which contain invalid characters.

    ... like this
    SELECT
    to_char(rownum,'009') rownumber,
    JOB_TITLE 
    FROM PERSON
    WHERE REGEXP_LIKE(JOB_TITLE, '[^[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:]!"/()=''#+,_&.@:;\\<>?-]+')
    AND ROWNUM  <= 25
    {code}
    
    ... and it returned no lines at all.
    
    Please note that I here use a negation of the allowed characters.
    
    To give you an impression of the expected output I removed the square brackets (which produces wrong output, too)
    means
    {code}
    WHERE REGEXP_LIKE(JOB_TITLE, '[^ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:]!"/()=''#+,_&.@:;\\<>?-]+')
    results in
     001     DIV. MGR.
     002     OFFICE MGR.
     003     DRAY MGR.
     004     DRAY MGR.
     005     TERM. MGR.
     006     TERM. MGR.
     007     DIV. MGR.
     008     LEITER ÜBERSEE
     009     MANAGER LEAF LOG.
     010     HANDLUNGSBEVOLLM.
     011     ING.
     012     BRANCH OFFICE MGR.
     013     PROD. MGR SEAFREIGHT
     014     ASSISTANT CHART. MGR
     015     LTG. ZENTR. VERKAUF
     016     LEITER DER GESCHÄFTS
     017     STELLV. ABTEILUNGSLE
     018     USA I.V.
     019     EDI COMPETENCE CTR.
     020     PURCHASING PROC.MANA
     021     ING.
     022     DR.
     023     DIPL.-KAUFFRAU
     024     IMPORT&EXPORT REP.
     025     OPERATION MER.
    Only the row numbers 008 and 016 are expected in the output.

    So, not only that the square brackets as the first characters won't work even job titles with "." and "-" are listed which should not.

    Any ideas what went wrong here?

    Kai
  • 7. Re: Strange problem with REGEXP_LIKE
    MichaelS Guru
    Currently Being Moderated
    note that I here use a negation of the allowed characters.
    what if you simply use
     where NOT regexp_like (.....)
    ?
  • 8. Re: Strange problem with REGEXP_LIKE
    4999 Newbie
    Currently Being Moderated
    This ...
    WHERE NOT REGEXP_LIKE(JOB_TITLE, '[[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:]!"/()=''#+,_&.@:;\\<>?-]+')
    {code}
    ... returns all rows where JOB_TITLE is not null
    
    This ... (without square brackets)
    {code}
    WHERE NOT REGEXP_LIKE(JOB_TITLE, '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:]!"/()=''#+,_&.@:;\\<>?-]+')
    returns no rows.
  • 9. Re: Strange problem with REGEXP_LIKE
    Sven W. Guru
    Currently Being Moderated
    As a new approach you can use the translate function. It should be faster than any regexp_ expression and seems perfectly suited for your task.

    untested
    select
       to_char(rownum,'000') rownumber,
       JOB_TITLE 
    from PERSON
    where trim(translate (job_title, 
                      'Ä[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 !"/()=''#+,_&..@:;\\?-]+',
                      'Ä')) is not null
    and rownum <= 25;
    {code}
    
    Translate will replace the specified chars. In this case it will remove all the chars from the list. Ä will be replaced with Ä (translate needs a not null third parameter). Any remaining chars are the special characters that you are looking for.
    
    Edited by: Sven W. on Jan 17, 2011 4:22 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  • 10. Re: Strange problem with REGEXP_LIKE
    MichaelS Guru
    Currently Being Moderated
    SQL> with t as (
     select 001 id, 'DIV. MGR.' job_title from dual union all
     select 002, 'OFFICE MGR.' from dual union all
     select 003, 'DRAY MGR.' from dual union all
     select 004, 'DRAY MGR.' from dual union all
     select 005, 'TERM. MGR.' from dual union all
     select 006, 'TERM. MGR.' from dual union all
     select 007, 'DIV. MGR.' from dual union all
     select 008, 'LEITER ÜBERSEE' from dual union all
     select 009, 'MANAGER LEAF LOG.' from dual union all
     select 010, 'HANDLUNGSBEVOLLM.' from dual union all
     select 011, 'ING.' from dual union all
     select 012, 'BRANCH OFFICE MGR.' from dual union all
     select 013, 'PROD. MGR SEAFREIGHT' from dual union all
     select 014, 'ASSISTANT CHART. MGR' from dual union all
     select 015, 'LTG. ZENTR. VERKAUF' from dual union all
     select 016, 'LEITER DER GESCHÄFTS' from dual union all
     select 017, 'STELLV. ABTEILUNGSLE' from dual union all
     select 018, 'USA I.V.' from dual union all
     select 019, 'EDI COMPETENCE CTR.' from dual union all
     select 020, 'PURCHASING PROC.MANA' from dual union all
     select 021, 'ING.' from dual union all
     select 022, 'DR.' from dual union all
     select 023, 'DIPL.-KAUFFRAU' from dual union all
     select 024, 'IMPORT&EXPORT REP.' from dual union all
     select 025, 'OPERATION MER.' from dual
    )
    --
    --
    select * from t
     where not regexp_like(job_title, '^[][ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz[:digit:][:blank:]!\"/()=''#+,_&.@:;?-]+$')
    /
            ID JOB_TITLE           
    ---------- --------------------
             8 LEITER ÜBERSEE      
            16 LEITER DER GESCHÄFTS
    
    2 rows selected.
    {code}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
  • 11. Re: Strange problem with REGEXP_LIKE
    4999 Newbie
    Currently Being Moderated
    Nice idea, really.

    But, there seems to be a problem with the point char "."

    I used this query with the translated_job_title as the second column to make it visible here.
    SELECT
       to_char(ROWNUM,'000') rownumber,
       trim (translate (job_title, 
                      'Ä[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 !"/()=''#+,_&.@:;\\?-',
                      'Ä')) translated_job_title
       ,JOB_TITLE
    FROM PERSON
    WHERE trim (translate (job_title, 
                      'Ä[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 !"/()=''#+,_&.@:;\\?-',
                      'Ä')) IS NOT NULL
    AND ROWNUM <= 25
    ... which resulted in ...
     001     ..     DIV. MGR.
     002     .     OFFICE MGR.
     003     .     DRAY MGR.
     004     .     DRAY MGR.
     005     ..     TERM. MGR.
     006     ..     TERM. MGR.
     007     ..     DIV. MGR.
     008     Ü     LEITER ÜBERSEE
     009     .     MANAGER LEAF LOG.
     010     .     HANDLUNGSBEVOLLM.
     011     .     ING.
     012     .     BRANCH OFFICE MGR.
     013     .     PROD. MGR SEAFREIGHT
     014     .     ASSISTANT CHART. MGR
     015     ..     LTG. ZENTR. VERKAUF
     016     Ä     LEITER DER GESCHÄFTS
     017     .     STELLV. ABTEILUNGSLE
     018     ..     USA I.V.
     019     .     EDI COMPETENCE CTR.
     020     .     PURCHASING PROC.MANA
     021     .     ING.
     022     .     DR.
     023     .     DIPL.-KAUFFRAU
     024     .     IMPORT&EXPORT REP.
     025     .     OPERATION MER.
    So, besides the records with a point in the job_title column it works perfect.

    I now tried to escape the point character like I would do in a regular expression ....
    SELECT
       to_char(ROWNUM,'000') rownumber,
       trim (translate (job_title, 
                      'Ä[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 !"/()=''#+,_&\.@:;\\?-',
                      'Ä')) translated_job_title
       ,JOB_TITLE
    FROM PERSON
    WHERE trim (translate (job_title, 
                      'Ä[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 !"/()=''#+,_&\.@:;\\?-',
                      'Ä')) IS NOT NULL
    AND ROWNUM <= 25
    ... and it worked!

    Result:
     001     Ü     LEITER ÜBERSEE
     002     Ä     LEITER DER GESCHÄFTS
     003     ¿     SALES REPRESENTAT¿VE
     004     Ç     ADMINISTRAÇAO VENDAS
     005     ÇÕ     GERENTE OPERAÇÕES
     006     Ñ     DUEÑA DE LAMAR
     007     ÇÕ     GERENTE DE OPERAÇÕES
     008     Í     COMPRAS E LOGÍSTICA
     009     Í     LOGÍSTICA
     010     Á     PROPRIETÁRIO
     011     Á     ENGENHEIRO RESPONSÁV
     012     Á     PROPRIETÁRIO
     013     Á     PROPRIETÁRIO
     014     Ó     SÓCIO
     015     Çà    EXPEDIÇÃO
     016     Ó     SÓCIO
     017     ÇÕ     OPERAÇÕES
     018     Ý     OPERATÝON MANAGER
     019     Ó     SÓCIO
     020     Á     PROPRIETÁRIO
     021     Í     COMPRAS E LOGÍSTICA
     022     Í     LOGÍSTICA
     023     Á     PROPRIETÁRIO DUNORTE
     024     ÄÜ     GESCHÄFTSFÜHRERIN
     025     Ä     STELLV. GESCHÄFTSF.
     026     Í     LOGÍSTICA
     027     ÝÝ     MARKETÝNG EXECUTÝVE
     028     Ý     EXPORT DÝRECTOR
     029     Ý     MARKETÝNG & ACC MNGR
     030     Ý     CHAÝRPERSON
     031     ÝÝ     MANAGÝNG DÝRECTOR
     032     Ý     NATÝONAL S/F MNGR
     033     ÝÝ     VÝCE PRESÝDENT
     034     Ñ     GER ING Y DISEÑO
    which shows all records with invalid characters in column job_title.

    Thank you Sven.
    You put me on the right track.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points