9 Replies Latest reply: Mar 13, 2012 6:32 AM by 581144 RSS

    Get rows not containing UTF-8 characters.

    581144
      Hi guys,

      The first part is just general information. If you want to go straight to my question go to the bottom.

      Our business has recently loaded a lot of data from a legacy system to our database.
      I have found out that there are some problems with the data added to our database.
      Some characters which is not UTF-8(?) characters has been added to a column (LONG) in the table fnd_documents_long_text.

      I noticed the data error because XML Publisher gave me the following warning:
      -----
      java.io.UTFDataFormatException: Invalid UTF8 encoding.


      I then copied the output xml from the server to my own desktop. When I tries to open the XML file in my editor it replaces a character with a Unicode substitution character:
      -----
      Some bytes have been replace with the Unicode substitution character while..


      Now. I found the line causing the error in XML Publisher and my editor.

      The line is:
      <LONG_TEXT>COPPER WIRE TO BS 4516 PT.1 [xCF].25 PVA GR.2 (LEWMEX) MAX REEL SIZE 25KG</LONG_TEXT>
      The [xCF] are what raises the whole problem. When I try to copy the character into the editor in the forum it shows ϱ

      QUESTION
      I want to find all the attachments (rows in fnd_documents_long_text), which will cause XML Publisher to fail executing a report request.

      I have created the following PL/SQL Block to identify the rows that needs correction:
      DECLARE
        CURSOR c1 IS
          SELECT  media_id,
                  long_text
            FROM  fnd_documents_long_text
          ;
          
        v_media_id              fnd_documents_long_text.media_id%TYPE;
        v_long_text             fnd_documents_long_text.long_text%TYPE;
        
        l_test                  varchar2(2000);
      BEGIN
        dbms_output.put_line('START');
        
        IF (c1%ISOPEN)
        THEN
          CLOSE c1;
        END IF;
        
        OPEN c1;
        LOOP
          FETCH c1 INTO v_media_id, v_long_text;
          EXIT WHEN c1%NOTFOUND;
          
            l_test := REGEXP_REPLACE(v_long_text,'[\x80-\xFF]','');
            
            IF (l_test != ' ')
            THEN
              dbms_output.put_line('Media: ' || v_media_id || ', Text: ' || v_long_text);
            END IF;
            
        END LOOP;
        
        CLOSE c1;
        
        dbms_output.put_line('END');
      END;
      My problem is that the list I get back contains row where æøå üä etc. exists. How do I get the list so I only contains rows where screwed up characters like ϱ exists?


      Thank you in advance.

      /Kenneth
        • 1. Re: Get rows not containing UTF-8 characters.
          922753
          Find out the ascii value of the characters you are looking for and check for those characters using instr and chr function
          instr(<column_name>, chr(26)) > 0

          HTH,
          ~Yogesh
          • 2. Re: Get rows not containing UTF-8 characters.
            581144
            Hi,

            Thank you for your reply. At the moment I cannot get to the legacy system to see the original character so If I could do Im looking for a solution where I can get a list of the characters not part of the UTF-8 character set.

            /Kenneth
            • 3. Re: Get rows not containing UTF-8 characters.
              Etbin
              Take a look at [url http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions034.htm#SQLRF00620]convert function.
              You might use it like
              where your_string != convert(your_string,'UTF8') /* not sure the UTF8 is correct */
              Regards

              Etbin
              • 4. Re: Get rows not containing UTF-8 characters.
                581144
                Hi,

                The convert function is actually really nice. But I cannot use it to solve my problem.

                I do not want to convert characters. I want to get a "list" of rows with non-utf8 characters in it, so the business can decide the next action.

                /Kenneth
                • 5. Re: Get rows not containing UTF-8 characters.
                  Etbin
                  I had in mind something like
                  select original_text,
                         convert(original_text,'UTF8') original_converted,
                         nullif(original_text,convert(original_text,'UTF8')) original_check 
                   from (select 'S' || convert('TRING','UTF8') original_text from dual union all
                         select 'Š' || convert('TRING','UTF8') original_text from dual
                        ) /* invalid utf8 character simulation */
                  Regards

                  Etbin

                  Edited by: Etbin on 13.3.2012 10:07
                  my database character set is EE8MSWIN1250
                  • 6. Re: Get rows not containing UTF-8 characters.
                    581144
                    Hi Etbin,

                    thank you for your example. I thought you ment something like that, so I have already tried comparing the original value with the converted one.

                    I don't understand your example fully.

                    If the original string differes from the converted one, the original will be display in the ORIGINAL_CHECK column... check.

                    When I run the query I get the following result:

                    STRING     STRING     
                    ŠTRING     ŠTRING     

                    Which means that Š is a UTF8 chacter.?.

                    I have tried with several different UTF16 characters instead (水) but with the same result... :/

                    /Kenneth
                    • 7. Re: Get rows not containing UTF-8 characters.
                      Etbin
                      When I run the query I get the following result:

                      STRING STRING
                      ŠTRING ŠTRING

                      Which means that Š is a UTF8 chacter.?.
                      That's because your database character is UTF8 (AL16UTF16 or AL32UTF8)
                      I get
                      ORIGINAL_TEXT ORIGINAL_CONVERTED             ORIGINAL_CHECK
                      ------------- ------------------------------ --------------
                      STRING        STRING                                       
                      ŠTRING        Ĺ TRING                        ŠTRING        
                      
                      2 rows selected.
                      Maybe http://docs.oracle.com/cd/E11882_01/server.112/e10729/ch12scanner.htm might help.

                      Regards

                      Etbin
                      • 8. Re: Get rows not containing UTF-8 characters.
                        chris227
                        Kenneth_ wrote:

                        Which means that Š is a UTF8 chacter.?.
                        I have tried with several different UTF16 characters instead (水) but with the same result... :/
                        Kenneth_ wrote:I do not want to convert characters. I want to get a "list" of rows with non-utf8 characters in it, so the business can decide the next action.
                        imho this is not a problem of "non-UTF" character. Greek letters as you mentioned in your first post are "part" of it, as Š and (水) are.

                        The difference betweeen utf-8 and utf-16 is not the letters representable but the bytes they use for the representation.

                        Your problem seems more to be a configuration problem of xml-publisher (or the xml-Document). There are some threads on this under the business suite forum.

                        regards chris
                        • 9. Re: Get rows not containing UTF-8 characters.
                          581144
                          Hi Chirs and Etbin,

                          I will look closer into the possibilities that you both mention and get back to you.

                          Thanks,

                          /Kenneth