Forum Stats

  • 3,837,492 Users
  • 2,262,264 Discussions
  • 7,900,302 Comments

Discussions

how to verify the encoding of a blob file

Alibune
Alibune Member Posts: 86
edited Jun 15, 2018 6:18PM in APEX Discussions

Hi everyone!

I have a browse item on my application that get the xml file and put it into a xmltype column on my table and i'm using this process:

declare    l_xml      xmltype;    l_xml_blob apex_application_temp_files.blob_content%type; -- blob;    l_filename apex_application_temp_files.filename%type;  begin      select f.blob_content,f.filename      into l_xml_blob,l_filename      from apex_application_temp_files f      where f.name = :P201_XMLFILE;      --create xmltype from blob    l_xml := xmltype(l_xml_blob, nls_charset_id('UTF8'));--specify file encoding      --and now just insert the record    insert into xml_tab(id,filename,xml) values (xml_tab_seq.NEXTVAL,l_filename,l_xml);     end;  

but sometimes i'll get files with encoding that is not UTF-8, how can i veriffy the file encoding to define which type of code i will use?

I'm using Oracle APEX 5.1 and Oracle Database 11G XE.

Tagged:

Answers

  • Pavel_p
    Pavel_p Member Posts: 2,314 Gold Trophy
    edited Jun 15, 2018 9:20AM

    Hello,

    it's not up to your app/process (in fact, I have no idea how you could possibly determine file's encoding, in my opinion it's not possible) to determine its encoding. You need to tell the process file's encoding and then it can later handle it according to the codepage specified. Let's give another example - you have a plain text file and a viewer. The viewer may try to determine file's encoding to show special characters properly, but generally you need to tell the viewer that some special character should be presented some specified way. The only thing that is certain is that the lower part of ASCII table https://en.wikipedia.org/wiki/ASCII (e.g. characters with ASCII codes 0-127) are common for all languages, upper part is for special characters like "éířšěč" etc. And because there is only 128 such codes available (but there is way more such special characters in all the languages), we have various encodings to specify also the codepage as an additional information.

    So if you're processing files from external system(s), you need to know their encoding(s) as a part of the "contract" between your system (consumer) and the system that produces input files.

    Regards,

    Pavel

    Alibune
  • Alibune
    Alibune Member Posts: 86
    edited Jun 15, 2018 9:42AM

    Hi Pavel!

    The problem is that not always the files will come in UTF-8, i know that the right way was to define this with the client, but in this case it's not possible.

    It was you who posted this code on another topic that i've created, if i can't define the file encoding can i upload the XML independent of the encoding type?

    Because as it is if the file is on another encoding i am getting an error.

    Regards,

    Alibune

  • Mike Kutz
    Mike Kutz Member Posts: 6,195 Silver Crown
    edited Jun 15, 2018 10:09AM

    Aren't UTF files supposed to have a BOM in it?

    https://en.wikipedia.org/wiki/Byte_order_mark

    You might need a drop down of "encoding used" for UTF-8, ASCII7, etc.

    My $0.02

    MK

  • Anton Scheffer
    Anton Scheffer Member Posts: 1,950 Gold Trophy
    edited Jun 15, 2018 10:40AM

    You could check if your blob starts with something like

    <?xml version="1.0" encoding="ISO-8859-1"?>

    Use nls_charset_id( utl_i18n.map_charset( 'ISO-8859-1', utl_i18n.GENERIC_CONTEXT, utl_i18n.IANA_TO_ORACLE ) )

    to get oracle character set id

  • Pavel_p
    Pavel_p Member Posts: 2,314 Gold Trophy
    edited Jun 15, 2018 6:18PM

    Hi Alibune,

    yes, I noticed my code from the previous thread How to load a XML whole file into database. .

    it seems we got into some kind of an infinite loop where you keep saying that you need something and I keep responding that it's not possible.

    If systems should "talk" to each-other and understand each-other, then some kind of a contract is desperately needed. That's why we have things like protocols - TCP, HTTP(S) and so on... Simply something that defines the way information is exchanged between those systems. If you exchange plain text between systems (or some binary files, whatever), both systems need to know the structure, otherwise they simply do not talk to each-other. Fortunately XML structure is (or can be) strictly defined by XML schemes https://en.wikipedia.org/wiki/XML_Schema_(W3C) and also it's possible to define its encoding in XML prolog (as Anton mentioned), that should look like

    <?xml version="1.0" encoding="UTF-8"?>

    The default encoding is UTF-8 and if file's encoding is different, it should be specified. If it's not, you can just guess (using more or less sophisticated methods) what its encoding could possibly be. Please, could you post some sample (hashed) data how your xml files look like?

    Thanks,

    Pavel

This discussion has been closed.