1 2 Previous Next 21 Replies Latest reply on Nov 28, 2012 9:17 AM by 972199

    Retrieving columns with non-english characters?

    972199
      Hello,

      I'm running a query using SQL Developer on a table which contains several companies names from many different countries, and one of the checks I need to make to ensure data consistency is to search for all rows which the name of company contains special or non-english characters (like ç, ã, ä as example).

      I don't know what can I use to do this. I tried to collate using NLS_SORT but it didn't work.

      Is there someway to select only the rows that contain these special or non-english characters, excluding from the results the rows that only have english characters? Please have in mind that we have many languages in this table.

      The field I would like to make the conditions on is VARCHAR2.

      Please let me know if there is any extra information I should provide you so that you can help me.

      Thank you in advance for the help.

      Regards,
      Luís
        • 1. Re: Retrieving columns with non-english characters?
          ranit B
          Try using a REGEX.. where you check for all characters outside the ASCII range.

          Something like ...
          WITH t AS (
              SELECT 'TEMNÇON' str FROM DUAL UNION ALL
              SELECT 'TEMNCON' FROM DUAL UNION ALL
              SELECT 'Heãllço' FROM DUAL
          )
          SELECT str "Non English Chars"
              FROM t
          WHERE REGEXP_LIKE (str, '[^ -~]', 'i');
          gives
          TEMNÇON
          Heãllço
          HTH
          Ranit B.

          Edited by: ranit B on Nov 26, 2012 12:29 PM
          -- o/p added
          1 person found this helpful
          • 2. Re: Retrieving columns with non-english characters?
            Manik
            Check for CONVERT in oracle documentation.

            e.g.
            WITH t AS
                    (SELECT 'TEMNÇON' str FROM DUAL
                     UNION
                     SELECT 'TEMNCON' FROM DUAL)
            SELECT str,
                   CASE
                      WHEN str != CONVERT ('TEMNÇON', 'US7ASCII') THEN
                         'Non English Character'
                   END
                      chk
              FROM t;
            Cheers,
            Manik.

            Edited by: BluShadow on 26-Nov-2012 07:59
            replace <> with != as the forum has difficulty displaying <>
            • 3. Re: Retrieving columns with non-english characters?
              ranit B
              Also useful... to learn...
              WITH t AS (
                  SELECT 'TEMNÇON' str FROM DUAL UNION ALL
                  SELECT 'TEMNCON' FROM DUAL UNION ALL
                  SELECT 'Heãllço' FROM DUAL
              )
              select REGEXP_REPLACE (str, '[^ -~]', '') from t;
              gives
              TEMNON
              TEMNCON
              Hello
              This removes any special character i.e. any char outside the range of a Space(Ascii = 32) to Tilde(Ascii = 126).


              HTH
              Ranit B.

              Edited by: ranit B on Nov 26, 2012 2:40 PM
              -- o/p added
              • 4. Re: Retrieving columns with non-english characters?
                972199
                Hello Ranit,

                Thank you for you reply.

                I'm currently testing the version with WHERE REGEXP_LIKE (str, '[^ -~]', 'i').
                This looks to work well, I'm currently extrating the query results to verify how it works with some specific languages.

                The select REGEXP_REPLACE (str, '[^ -~]', '') from t; expression might not work for me. Because I don't want to replace the characters, I wan to be able to view which are these letters.

                I'll let you know as soon as possible the result.

                Regards,
                Luis
                • 5. Re: Retrieving columns with non-english characters?
                  Manik
                  If you are fond of REGEXP (which is debatable)
                  Please check this:
                  WITH t AS
                          (SELECT 'TEMNÇON' str FROM DUAL
                           UNION ALL
                           SELECT 'TEMNCON' FROM DUAL
                           UNION ALL
                           SELECT 'Heãllço' FROM DUAL)
                  SELECT str, case when REGEXP_INSTR (str, '[^ -~]', 1)>0 then 'Non English Letter' end
                    FROM t;
                  Cheers,
                  Manik.
                  • 6. Re: Retrieving columns with non-english characters?
                    ranit B
                    Glad that it helped...

                    Ok. REGEXP_REPLACE was just to let you know. So know you yourself have found out REGEXP_LIKE . Coool.

                    Try using this ...
                    /* Formatted on 11-27-2012 3:15:58 PM (QP5 v5.163.1008.3004) */
                    WITH t
                         AS (SELECT 'TEMNÇON' str FROM DUAL
                             UNION ALL
                             SELECT 'TEMNCON' FROM DUAL
                             UNION ALL
                             SELECT 'Heãllço' FROM DUAL)
                    SELECT str
                      FROM t
                     WHERE REGEXP_LIKE (str, '[^ -~]', 'i');
                    HTH
                    Ranit B.

                    Edited by: ranit B on Nov 27, 2012 3:29 PM
                    • 7. Re: Retrieving columns with non-english characters?
                      ranit B
                      If you are fond of REGEXP (which is debatable)
                      Can you please explain this?
                      SELECT str, case when REGEXP_INSTR (str, '[^ -~]', 1)>0 then 'Non English Letter' end
                      Why to go for REGEXP_INSTR when Oracle has provided REGEXP_LIKE which exactly suits our requirement.

                      Ranit B.
                      • 8. Re: Retrieving columns with non-english characters?
                        Manik
                        Hi Ranit my Frnd, I am just trying to explain that REGULAR EXPRESSIONS are CPU intensive (according to many other gurus including TOM).
                        You may find this info over internet and our own oracle forums..

                        Again it all depends on number of rows what OP is processing and many other factros.. so I used the term DEBATABLE :) hope u understand buddy!

                        Cheers,
                        Manik.
                        • 9. Re: Retrieving columns with non-english characters?
                          ranit B
                          Manik wrote:
                          Hi Ranit my Frnd, I am just trying to explain that REGULAR EXPRESSIONS are CPU intensive (according to many other gurus including TOM).
                          You may find this info over internet and our own oracle forums..

                          Again it all depends on number of rows what OP is processing and many other factros.. so I used the term DEBATABLE :) hope u understand buddy!

                          Cheers,
                          Manik.
                          Yes Manik. Point accepted. :-)

                          But, the factors you mentioned come into picture, only when we have 2 functions serving their primary purpose properly.
                          This doesn't seem like the prime purpose of CONVERT, but REGEXP_LIKE.

                          I guess the Character Encoding you used 'US7ASCII'... wiil it behave properly for other Encoding or the database NLS settings?
                          I'm a bit dubious in this part.
                          WHEN str != CONVERT ('TEMNÇON', 'US7ASCII') THEN
                          Btw, it's always good brainstorming the ideas & choose the best out of it. B-)
                          • 10. Re: Retrieving columns with non-english characters?
                            Manik
                            Yes it should: check this and try a test case yourself where it fails..

                            http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions027.htm

                            •US7ASCII: US 7-bit ASCII character set
                            •WE8ISO8859P1: ISO 8859-1 West European 8-bit character set
                            •EE8MSWIN1250: Microsoft Windows East European Code Page 1250
                            •WE8MSWIN1252: Microsoft Windows West European Code Page 1252
                            •WE8EBCDIC1047: IBM West European EBCDIC Code Page 1047
                            •JA16SJISTILDE: Japanese Shift-JIS Character Set, compatible with MS Code Page 932
                            •ZHT16MSWIN950: Microsoft Windows Traditional Chinese Code Page 950
                            •UTF8: Unicode 3.0 Universal character set CESU-8 encoding form
                            •AL32UTF8: Unicode 5.0 Universal character set UTF-8 encoding form


                            Cheers,
                            Manik.
                            • 11. Re: Retrieving columns with non-english characters?
                              ranit B
                              Manik wrote:
                              Yes it should: check this and try a test case yourself where it fails..

                              http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions027.htm
                              A snippet from the same page...
                              Note:
                              Oracle discourages the use of the CONVERT function in the current Oracle Database release. 
                              The return value of CONVERT has a character datatype, so it should be either in the database 
                              character set or in the national character set, depending on the datatype.
                              Did you check this?
                              • 12. Re: Retrieving columns with non-english characters?
                                972199
                                Hello Ranit,

                                I bwelieve that this identifies also the comma, dot and parentesis.

                                Is there the possibility of excluding some characters like these if I need? I should just change the group of characters inside [^ ]?

                                Thank you very much for your help.

                                Regards,
                                Luis
                                • 13. Re: Retrieving columns with non-english characters?
                                  ranit B
                                  969196 wrote:
                                  Hello Ranit,

                                  I bwelieve that this identifies also the comma, dot and parentesis.

                                  Is there the possibility of excluding some characters like these if I need? I should just change the group of characters inside [^ ]?

                                  Thank you very much for your help.

                                  Regards,
                                  Luis
                                  I didn't get your question exactly? Do you want the comma/dot/paranthesis to be included or not?
                                  Please post some sample data


                                  Let me explain you the REGEX, then you can mould it as per your requirement...
                                  '[^ -~]'
                                  Means anything outside the range of a Space character(Ascii = 32) to a Tilde character (Ascii = 126)
                                  This is range from Ascii = 32 to 126 is actually the range of printable characters (i.e. Keyboard characters) and anything outside this range falls under Non-Keyboard characters.

                                  The caret symbol '^' inside the square braces '[]' mean Negation of the data set present inside.

                                  Edited by: ranit B on Nov 27, 2012 6:19 PM
                                  • 14. Re: Retrieving columns with non-english characters?
                                    972199
                                    Hello,

                                    I'm sorry if my question is not clear.

                                    I don't want comma to be included.
                                    If the column value is like:

                                    Behringer Harvard Eldridge Venture, LLC
                                    Behringer Harvard Redwood, LLC
                                    Behringer Harvard Lovers Lane Venture I, LLC
                                    Atlas Pipeline Mid-Continent KansOK, LLC

                                    These values should not be displayed, as for the purpose they do not contain any special character.

                                    But these values should be displayed:
                                    AB JÄRNBJÖRNEN
                                    FRONTBILAR I GÖTEBORG AB
                                    ARANDA HERMANOS MUEBLERÍA, S.A. DE C.V.

                                    How can I change the range of characters as an example from [^ -~] into something like this [^ -% ] plus [^*-~], so that exclude some characters from the results?

                                    Maybe my question is very basic, but I think that my difficulty now is how to manipulate this range of characters. :)

                                    Once again, thank you for your help and patience.

                                    Regards,
                                    Luis
                                    1 2 Previous Next