5 Replies Latest reply: Jun 11, 2014 5:06 AM by magicliver RSS

    Insert bfilename file.xml changes encoding entity to windows-1252

    magicliver

      Hi, I am running Oracle 11.2.0.1 and I am trying to insert xml from a .xml file in a windows directory into a table new_xmldata with a single xmltype field. The database characterset is set to AL32UTF8. The xml encoding is 'UTF-8' and this is in the header entity. I have checked the encoding by opening the file in XMLSpy and also if I open the file and 'save as', again the encoding is listed as UTF-8. Then I run the code

       

      insert into new_xmldata values (XMLTYPE(bfilename('IMPORT_DIRECTORY', 'file.xml'), nls_charset_id ('AL32UTF8')));
      
      

       

      Now when I select the data in the record the XML encoding entity declares the encoding as WINDOWS-1252.

       

      Why is the encoding changing and what can I do to prevent this? (Sorry I would like to provide the data as an example but I work on a secure site and cannot upload this?)

      Many thanks in advance to any that can help.

        • 1. Re: Insert bfilename file.xml changes encoding entity to windows-1252
          magicliver

          So I still have this issue, I have recreated the problem with some test data.

          So the following xml is saved as a UTF-8 text file - test.xml in a directory xml_transaction_dir

           

           

          <?xml version="1.0" encoding="UTF-8" ?>
          <record>
            <title>test</title>
          </record>
          

           

          I ingest the record in to table new_xmldata:


          INSERT INTO new_xmldata VALUES (xmltype(bfilename('XML_TRANSACTION_DIR', 'test.xml'), nls_charset_id ('AL32UTF8'));
          commit;
          

           

          select * from new_xmldata;
          

           

          XMLDATA

          --------

          <?xml version="1.0" encoding="WINDOWS-1252"?>

          <record>

            <title>test</title>

          </record>


          so i try the convert function:

           

          INSERT INTO new_xmldata VALUES (convert(xmltype(bfilename('XML_TRANSACTION_DIR', 'test.xml'), nls_charset_id ('AL32UTF8')), 'AL32UTF8'));
          commit;
          

           

          select * from new_xmldata;
          

           

           

          XMLDATA

          -----------

          <?xml version="1.0" encoding="WINDOWS-1252"?>

          <record>

            <title>test</title>

          </record>

           

          Now if I use the convert function in a select statement:

           

          select convert(xmldata, 'AL32UTF8') from new_xmldata;

           

          CONVERT(XMLDATA,'AL32UTF8')

          ----------------------------

          <?xml version="1.0" encoding="UTF-8"?>

          <record>

            <title>test</title>

          </record>


          This is what I want so I try an update:


          update new_xmldata set xmldata = convert(xmldata, 'AL32UTF8');
          commit;
          

           

           select * from new_xmldata;

           

          XMLDATA

          ------------

          <?xml version="1.0" encoding="WINDOWS-1252"?>

          <record>

            <title>test</title>

          </record>

           

          What am I doing wrong? I just want the encoding as UTF8? Really stumped and hope someone can offer some advice. Thanks

          • 2. Re: Insert bfilename file.xml changes encoding entity to windows-1252
            odie_63

            The encoding is adjusted to reflect that of your client environment (NLS_LANG).

            It is expected behaviour.

            See : Using Oracle XML DB

             

            You don't have to update anything, the XML is still stored in UTF-8 encoding in the database.

             

            To preserve the original encoding in the prolog regardless of your client's NLS, use an explicit serialization :

            select xmlserialize(document xmldata)

            from new_xmldata;

            • 3. Re: Insert bfilename file.xml changes encoding entity to windows-1252
              magicliver

              Thanks odie for the advice. I don't really get why the client encoding is displayed in the database in the XML. If I save a file in windows or xmlspy as UTF8 then the files encoding is still  UTF8 and not windows-1252 or ANSI. I don't get why when in the database the encoding of the client is displayed in the XML. The unformed XML output from XMLSerialize didn't help as my customer wants to see the standardized profile of XML with UTF-8 in the database and also fully-formed but your tip about the clients NLS_LANG got me to set that before logging in and now I can display the XML with the correct encoding. It would be great if you have any reply to this but if not, thanks for the help anyway.

              • 4. Re: Insert bfilename file.xml changes encoding entity to windows-1252
                odie_63

                I don't get why when in the database the encoding of the client is displayed in the XML.

                Because you're using a client tool to view the content of your database. You're not "in" the database.

                It makes sense that the XML encoding is adjusted to one that your client tool supports, or at least declared to be supported via the NLS_LANG setting.

                 

                The unformed XML output from XMLSerialize didn't help as my customer wants to see the standardized profile of XML with UTF-8 in the database and also fully-formed

                What do you mean by "unformed" and "fully-formed"?

                Are you referring to pretty-printing, with indentation and newlines?

                If so, nowhere in the XML specs it is said that a serialized representation of an XML document should appear like this.

                 

                Anyway, XMLSerialize has an INDENT option too :

                select xmlserialize(

                         document xmldata

                         as clob            -- (optional) CLOB is the default

                         indent

                       )

                from new_xmldata;

                NLS_LANG got me to set that before logging in and now I can display the XML with the correct encoding

                What's your client tool btw?

                Setting NLS_LANG to UTF-8 (AL32UTF8) makes sense if the tool actually supports UTF-8.

                Otherwise you may end up with only an "apparent" UTF-8 encoding in the prolog, and eventually corrupted characters in the content.

                 

                Read about NLS_LANG to understand the implications.

                • 5. Re: Insert bfilename file.xml changes encoding entity to windows-1252
                  magicliver

                  Thanks again for the enlightening response. I wish I had understood this earlier as I spent a lot of time puzzled by this. The way you have explained makes sense, I guess I just thought of myself as actually being 'in' the database whereas I am actually logging in to 'a client tool' as you point out. I am logging into sqlplus using windows command prompt but I am now setting NLS_LANG as I log in if I need to display the desired encoding.

                  And yes I did mean pretty-printed. I just needed to understand how I get to show my customer that the data is as they asked for and not displaying the windows encoding and formatted with the pretty printing. I can do that now and even better I understand why.

                  Thanks again for your help.