1 2 3 Previous Next 41 Replies Latest reply on Nov 17, 2017 2:19 PM by Oracle Maniac Go to original post
      • 15. Re: Junk characters in XML files
        Oracle Maniac

        It has to be all automated, I cannot open the file

        • 16. Re: Junk characters in XML files
          Oracle Maniac

          Hi Paul,

           

           

          The requirement is below:

           

           

          1) I get XML files generated out of a Tool, this xml file has a reference to DTD, which make it difficult for me to load into DB using SQLLDR.

          I dont have an option to use Oracle server directory. The only option is to use the windows box where I get the XML files.

           

           

          2) I use Powershell script to Trim first two lines so that it can be consumed by the SQLLDR

           

           

          <?xml version="1.0" encoding="utf-8"?>

          <!DOCTYPE ichicsr SYSTEM "D:\Program Files (x86)\Oracle\DTDFiles\ABC.dtd"[]>

          <ichicsr lang="en">

           

           

           

           

          3) One the XML is loaded into oracle table, I use below query:

           

           

           

           

          with XML_TAB as 

              ( 

                select XMLDATA xml_data,sno,loaddate,filename  from table_xml-- where sno=539 

              ) 

          select distinct 

          x.sno, 

          x.loaddate processdate, 

          substr(x.filename, - instr(reverse(x.filename), '\') + 1)xml_file, 

          xtt.safetyreportid, 

          xt.medicinalproduct, 

          decode(xt.drugc,1,'Suspect')drugtype, 

          xt.formulation, 

          xt.activesubstance1 ingredient1, 

          xt.activesubstance2 ingredient2, 

          xt.activesubstance3 ingredient3, 

          xt.activesubstance4 ingredient4, 

          xt.activesubstance5 ingredient5, 

          nvl(xtt.occur,xtt.psource) country, 

          decode(xtt.serious,1,'Yes',2,'No')serious 

          from  xml_tab x 

                 join XMLTABLE('ichicsr/safetyreport' 

                                   PASSING x.xml_data 

                                   columns 

                                   occur    varchar2(10)   PATH 'occurcountry', 

                                   psource  varchar2(100)  path 'primarysourcecountry', 

                                   serious  varchar2(100)  path 'serious', 

                                   safetyreportid varchar2(100)  path 'safetyreportid', 

                                   drugs xmltype path 'patient' 

                                   )xtt on 1=1 

                  join xmltable('/patient/drug' 

                               PASSING xtt.drugs 

                               columns 

                               medicinalproduct      varchar2(100) path 'medicinalproduct', 

                               drugc                 varchar2(100) path 'drugcharacterization', 

                               formulation           varchar2(100) path 'drugdosageform', 

                               activesubstance1      varchar2(100) path 'activesubstance[1]/activesubstancename', 

                               activesubstance2      varchar2(100) path 'activesubstance[2]/activesubstancename', 

                               activesubstance3      varchar2(100) path 'activesubstance[3]/activesubstancename', 

                               activesubstance4      varchar2(100) path 'activesubstance[4]/activesubstancename', 

                               activesubstance5      varchar2(100) path 'activesubstance[5]/activesubstancename' 

                               ) xt on 1=1 

          where xt.drugc=1;

           

           

           

           

           

           

           

           

          Table Desc

          SNO      NOT NULL NUMBER        

          FILENAME          VARCHAR2(4000)

          LOADDATE          DATE          

          XMLDATA           XMLTYPE()     

           

           

           

           

          Control File:

          LOAD DATA 

          INFILE 'filelist.dat' 

          APPEND INTO TABLE table_xml 

             filename CHAR(1000), 

             xmldata  LOBFILE(filename) TERMINATED BY EOF, 

             sno      CHAR(10) "TABLE_XML_SEQ.NEXTVAL", 

             loaddate SYSDATE 

          • 17. Re: Junk characters in XML files
            Oracle Maniac

            Hi Paul,

             

             

            The requirement is below:

             

             

            1) I get XML files generated out of a Tool, this xml file has a reference to DTD, which make it difficult for me to load into DB using SQLLDR.

            I dont have an option to use Oracle server directory. The only option is to use the windows box where I get the XML files.

             

             

            2) I use Powershell script to Trim first two lines so that it can be consumed by the SQLLDR

             

             

            <?xml version="1.0" encoding="utf-8"?>

            <!DOCTYPE ichicsr SYSTEM "D:\Program Files (x86)\Oracle\DTDFiles\ABC.dtd"[]>

            <ichicsr lang="en">

             

             

             

             

            3) One the XML is loaded into oracle table, I use below query:

             

             

             

             

            with XML_TAB as 

                ( 

                  select XMLDATA xml_data,sno,loaddate,filename  from table_xml-- where sno=539 

                ) 

            select distinct 

            x.sno, 

            x.loaddate processdate, 

            substr(x.filename, - instr(reverse(x.filename), '\') + 1)xml_file, 

            xtt.safetyreportid, 

            xt.medicinalproduct, 

            decode(xt.drugc,1,'Suspect')drugtype, 

            xt.formulation, 

            xt.activesubstance1 ingredient1, 

            xt.activesubstance2 ingredient2, 

            xt.activesubstance3 ingredient3, 

            xt.activesubstance4 ingredient4, 

            xt.activesubstance5 ingredient5, 

            nvl(xtt.occur,xtt.psource) country, 

            decode(xtt.serious,1,'Yes',2,'No')serious 

            from  xml_tab x 

                   join XMLTABLE('ichicsr/safetyreport' 

                                     PASSING x.xml_data 

                                     columns 

                                     occur    varchar2(10)   PATH 'occurcountry', 

                                     psource  varchar2(100)  path 'primarysourcecountry', 

                                     serious  varchar2(100)  path 'serious', 

                                     safetyreportid varchar2(100)  path 'safetyreportid', 

                                     drugs xmltype path 'patient' 

                                     )xtt on 1=1 

                    join xmltable('/patient/drug' 

                                 PASSING xtt.drugs 

                                 columns 

                                 medicinalproduct      varchar2(100) path 'medicinalproduct', 

                                 drugc                 varchar2(100) path 'drugcharacterization', 

                                 formulation           varchar2(100) path 'drugdosageform', 

                                 activesubstance1      varchar2(100) path 'activesubstance[1]/activesubstancename', 

                                 activesubstance2      varchar2(100) path 'activesubstance[2]/activesubstancename', 

                                 activesubstance3      varchar2(100) path 'activesubstance[3]/activesubstancename', 

                                 activesubstance4      varchar2(100) path 'activesubstance[4]/activesubstancename', 

                                 activesubstance5      varchar2(100) path 'activesubstance[5]/activesubstancename' 

                                 ) xt on 1=1 

            where xt.drugc=1;

             

             

             

             

             

             

             

             

            Table Desc

            SNO      NOT NULL NUMBER        

            FILENAME          VARCHAR2(4000)

            LOADDATE          DATE          

            XMLDATA           XMLTYPE()     

             

             

             

             

            Control File:

            LOAD DATA 

            INFILE 'filelist.dat' 

            APPEND INTO TABLE table_xml 

               filename CHAR(1000), 

               xmldata  LOBFILE(filename) TERMINATED BY EOF, 

               sno      CHAR(10) "TABLE_XML_SEQ.NEXTVAL", 

               loaddate SYSDATE 

            • 18. Re: Junk characters in XML files
              BluShadow

              So, by removing:

               

              <?xml version="1.0" encoding="utf-8"?>

               

              from your XML file, you remove the information about the character encoding of the file, as people previously expected you were doing.

               

              The problem is that YOU are breaking the XML file.  The solution is to not do that.

              Sure, if you want to remove the unnecessary DOCTYPE line, do that, but ensure that the file is saved again with the proper character encoding.

              • 19. Re: Junk characters in XML files
                Oracle Maniac

                Hi BluShadow

                 

                 

                I am not removing those lines.  I have now followed what Billy suggested me

                 

                 

                1) Load XMLs in to Clob

                2) Replace Junks in your query

                3) Typecast into Xmltype and use

                 

                 

                select XMLType(replace((replace(replace(replace(replace(XMLDATA,'�'),'ï'),'»'),'¿')),

                '<!DOCTYPE ichicsr SYSTEM "D:\Program Files (x86)\Oracle\ESMService\DTDFiles\ABC.dtd"[]>')) xml_data,sno,loaddate,filename  from TABLE_XML

                 

                 

                 

                 

                 

                 

                However, everytime I process a new file, I get a new junk character which I am using in Replace.

                Any robust fix for this?

                 

                 

                Also, we have a Tool which can process all these files without any error. Seems that tool has way to handle

                all such junk characters (apparently that tool is built using C or Java)

                 

                 

                 

                 

                 

                 

                The latest error i receive is below:

                 

                 

                ORA-31011: XML parsing failed

                ORA-19202: Error occurred in XML processing

                LPX-00210: expected '<' instead of '�'

                Error at line 1

                ORA-06512: at "SYS.XMLTYPE", line 272

                ORA-06512: at line 1

                31011. 00000 -  "XML parsing failed"

                *Cause:    XML parser returned an error while trying to parse the document.

                *Action:   Check if the document to be parsed is valid.

                 

                 

                 

                 

                This latest junk character '�' - is diffent to what it appears here.

                The ASCII value is 15712189.

                • 20. Re: Junk characters in XML files
                  Oracle Maniac

                  Hi

                   

                  I have now followed what suggested me

                   

                   

                  1) Load XMLs in to Clob

                  2) Replace Junks in your query

                  3) Typecast into Xmltype and use

                   

                   

                  select XMLType(replace((replace(replace(replace(replace(XMLDATA,'�'),'ï'),'»'),'¿')),

                  '<!DOCTYPE ichicsr SYSTEM "D:\Program Files (x86)\Oracle\ESMService\DTDFiles\ABC.dtd"[]>')) xml_data,sno,loaddate,filename  from TABLE_XML

                   

                   

                   

                   

                   

                   

                  However, everytime I process a new file, I get a new junk character which I am using in Replace.

                  Any robust fix for this?

                   

                   

                  Also, we have a Tool which can process all these files without any error. Seems that tool has way to handle

                  all such junk characters (apparently that tool is built using C or Java)

                   

                   

                   

                   

                   

                   

                  The latest error i receive is below:

                   

                   

                  ORA-31011: XML parsing failed

                  ORA-19202: Error occurred in XML processing

                  LPX-00210: expected '<' instead of '�'

                  Error at line 1

                  ORA-06512: at "SYS.XMLTYPE", line 272

                  ORA-06512: at line 1

                  31011. 00000 -  "XML parsing failed"

                  *Cause:    XML parser returned an error while trying to parse the document.

                  *Action:   Check if the document to be parsed is valid.

                   

                   

                   

                   

                  This latest junk character '�' - is diffent to what it appears here.

                  The ASCII value is 15712189.

                  • 21. Re: Junk characters in XML files
                    AndrewSayer

                    Then your tool is more aware of what the character encoding should be, or it made a lucky guess.

                    What is the character set of the DB, can it even handle utf-8? Is the file really encoded with some UTF8 characterset, or is that just what the characters say when you open it in notepad?

                    • 22. Re: Junk characters in XML files
                      Oracle Maniac

                      I have now followed what Billy suggested me

                       

                       

                      1) Load XMLs in to Clob

                      2) Replace Junks in your query

                      3) Typecast into Xmltype and use

                       

                       

                      select XMLType(replace((replace(replace(replace(replace(XMLDATA,'�'),'ï'),'»'),'¿')),

                      '<!DOCTYPE ichicsr SYSTEM "D:\Program Files (x86)\Oracle\ESMService\DTDFiles\ABC.dtd"[]>')) xml_data,sno,loaddate,filename  from TABLE_XML

                       

                       

                       

                       

                       

                       

                      However, everytime I process a new file, I get a new junk character which I am using in Replace.

                      Any robust fix for this?

                       

                       

                      Also, we have a Tool which can process all these files without any error. Seems that tool has way to handle

                      all such junk characters (apparently that tool is built using C or Java)

                       

                       

                       

                       

                       

                       

                      The latest error i receive is below:

                       

                       

                      ORA-31011: XML parsing failed

                      ORA-19202: Error occurred in XML processing

                      LPX-00210: expected '<' instead of '�'

                      Error at line 1

                      ORA-06512: at "SYS.XMLTYPE", line 272

                      ORA-06512: at line 1

                      31011. 00000 -  "XML parsing failed"

                      *Cause:    XML parser returned an error while trying to parse the document.

                      *Action:   Check if the document to be parsed is valid.

                       

                       

                       

                       

                      This latest junk character '�' - is diffent to what it appears here.

                      The ASCII value is 15712189.

                      • 23. Re: Junk characters in XML files
                        Oracle Maniac

                        I have byepassed Powershell now

                         

                         

                        I have now followed what Billy suggested me

                         

                         

                        1) Load XMLs in to Clob

                        2) Replace Junks in your query

                        3) Typecast into Xmltype and use

                         

                         

                        select XMLType(replace((replace(replace(replace(replace(XMLDATA,'�'),'ï'),'»'),'¿')),

                        '<!DOCTYPE ichicsr SYSTEM "D:\Program Files (x86)\Oracle\ESMService\DTDFiles\ABC.dtd"[]>')) xml_data,sno,loaddate,filename  from TABLE_XML

                         

                         

                         

                         

                         

                         

                        However, everytime I process a new file, I get a new junk character which I am using in Replace.

                        Any robust fix for this?

                         

                         

                        Also, we have a Tool which can process all these files without any error. Seems that tool has way to handle

                        all such junk characters (apparently that tool is built using C or Java)

                         

                         

                         

                         

                         

                         

                        The latest error i receive is below:

                         

                         

                        ORA-31011: XML parsing failed

                        ORA-19202: Error occurred in XML processing

                        LPX-00210: expected '<' instead of '�'

                        Error at line 1

                        ORA-06512: at "SYS.XMLTYPE", line 272

                        ORA-06512: at line 1

                        31011. 00000 -  "XML parsing failed"

                        *Cause:    XML parser returned an error while trying to parse the document.

                        *Action:   Check if the document to be parsed is valid.

                         

                         

                         

                         

                        This latest junk character '�' - is diffent to what it appears here.

                        The ASCII value is 15712189.

                        • 24. Re: Junk characters in XML files
                          Paulzip

                          They are more than likely NOT junk characters, just UTF 8 characters that cannot or have not been encoded to your DB character set correctly.

                           

                          You are using hacky solutions to work around problems caused by other hacky approaches.

                           

                          What are your DB character set and National Character sets?

                          • 25. Re: Junk characters in XML files
                            Gaz in Oz

                            Some of those characters "�'),'ï'),'»'),'¿'" look like the UNICODE BOM watermark. This is to do with how th file was createdd/saved. https://en.wikipedia.org/wiki/Byte_order_mark

                            • 26. Re: Junk characters in XML files
                              BluShadow

                              You don't have to reply to each and every person giving the same answer.  We can all see the other responses.  This isn't an email system where you're responding to each person individually.

                               

                              As PaulZip says, by replacing those characters you think are "junky" is breaking it even further.

                              Those characters are likely part of the UTF-8 characterset that your XML is made from, and the tool you're using to look at the data is not showing the character properly because you're not doing it through a UTF-8 supported interface (or simply because you've already broken the XML by trying to "fix" it, so everything is now looking junky).

                               

                              Take a step back, and only remove the "DOCTYPE" part of the document, leave the rest of it alone.

                              The fact that other commercial tools etc. can open those XML's correctly shows that they are valid XML.  All you are doing it trashing it and then complaining it's broken.

                              • 27. Re: Junk characters in XML files
                                Oracle Maniac

                                NLS_NCHAR_CHARACTERSET UTF8

                                NLS_LANGUAGE AMERICAN

                                NLS_TERRITORY AMERICA

                                NLS_CURRENCY $

                                NLS_ISO_CURRENCY AMERICA

                                NLS_NUMERIC_CHARACTERS .,

                                NLS_CHARACTERSET AL32UTF8

                                NLS_CALENDAR GREGORIAN

                                NLS_DATE_FORMAT DD-MON-RR

                                NLS_DATE_LANGUAGE AMERICAN

                                NLS_SORT BINARY

                                NLS_TIME_FORMAT AM

                                NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM

                                NLS_TIME_TZ_FORMAT TZR

                                NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR

                                NLS_DUAL_CURRENCY $

                                NLS_COMP BINARY

                                NLS_LENGTH_SEMANTICS CHAR

                                NLS_NCHAR_CONV_EXCP FALSE

                                NLS_RDBMS_VERSION 11.2.0.2.0

                                • 28. Re: Junk characters in XML files
                                  Paulzip

                                  I wouldn't choose to use SQLLdr for what you're doing, as I don't know how it handles XML / character sets etc. I suspect it doesn't correctly, so first things first, try creating your table with XMLData as BLOB column. BLOBs are binary data so won't screw up the character set,  Use Sqldr to load into that.

                                   

                                  Then  do something like this (untested)...

                                   

                                  create or replace function convert_to_clob(vblob blob, pCharsetID number default null) return clob is

                                    vclob         clob;

                                    vdest_offset  number := 1;

                                    vsrc_offset   number := 1;

                                    vlang_context number := dbms_lob.default_lang_ctx;

                                    vwarning      number;

                                  begin

                                    dbms_lob.createtemporary(vclob, true);

                                    dbms_lob.converttoclob(dest_lob     => vclob,

                                                           src_blob     => vblob,

                                                           amount       => dbms_lob.lobmaxsize,

                                                           dest_offset  => vdest_offset,

                                                           src_offset   => vsrc_offset,

                                                           blob_csid    => pCharsetID,

                                                           lang_context => vlang_context,

                                                           warning      => vwarning);

                                    return vclob;

                                  end convert_to_clob;

                                   

                                  Creaet your XML instance without the DTD like this...

                                  XMLType(regexp_replace(convert_to_clob(XMLData, nls_charset_id('AL32UTF8'), '<!DOCTYPE[^<]*>', '', 1, 1, 'i')))

                                  1 person found this helpful
                                  • 29. Re: Junk characters in XML files
                                    cormaco

                                    Hi Paulzip,

                                     

                                    it's not necessary to fall back to BLOB, SQL*LOADER can handle CLOB and XMLTYPE in any characterset quite well, if the characterset is specified correctly in the control file:

                                     

                                    Description of GUID-77C67B34-D947-4D78-8ED4-EFA280347861-print.eps follows