This content has been marked as final. Show 9 replies
I don't have a straight answer for you, but I have had problems with the NULL (0x0) character before. This character is not valid in XML, but it can legitimately form the initial byte of certain characters when encoded in Unicode. I think XML Spy automatically saves documents as UTF-8 - if you are then importing this document into a database which does not have UTF-8 as the NLS_CHARACTERSET, you may be hitting conversion problems.
Thank you for the quick response. You are right, XML Spy is set to UTF-8, and Oracles NLS_CHARACTERSET is set to "WE8MSWIN1252". I am not too strong on character sets, so do you have any suggestions on the best way for me to test/solve this issue? I would guess the quickest would likely be to switch Oracle to UTF-8, but I have no idea how to do that. I am not a DBA, but I am responsible for the Development Database. To change the Character set, do I simply change the value of NLS_CHARACTERSET in the init.ora file for my database server?
There is a 3rd layer here, too, that could have a role. I used XML Spy to do the transformation and validate the document, but I used TOAD to put the schema into the table Oracle reads it from. Do you think I will need to look at the character set of Toad instead of XML Spy?
Ok, I did some research and found that switching Oracles Character Set is not the best idea. So, is there a way I can convert the schema I have to WE8MSWIN1252 so I can see what it is complaining about? The schema is over 550,000 characters, so there are issues arrising when I try to run queries and functions on the document in SQL.
Also, I checked XML Spy to see if it had WE8MSWIN1252 as a character set option and I didnt see it. Is this character set by chance known by another name that XML Spy might know? If I could do the conversion in XML Spy, I may be able to resolve the issue.
Ok, I have done some more research and it really looks like this is the issue. I found some documentation about CSSCAN and ran that on the table, first going from UTF-8 to WE8MSWIN1252, and it returned with "All character type application data remain the same in the new character set". I then switched it going from WE8MSWIN1252 to UTF-8, and it said there would be conversion. Now I am guessing that the CSSCAN doesnt factor in that UTF-8 may have several a single character that translates into several characters in WE8MSWIN1252.
Now, I guess my question would be, how do I get around this? I will need to convert the document I have in UTF-8 to WE8MSWIN1252 before I put it in the database. My co-worker suggested saving the document in Notepad, then just taking it from there to Oracle, but that will read the document the same way Oracle does and give me the invalid character still, right? Is there another name for WE8MSWIN1252 or a compatable character set that I could set XML Spy up with and validate the schema before I take it into Oracle? I could not find WE8MSWIN1252 with in the character set options with XML Spy.
There is a client character set and a database character set. Characters are converted between the client and the database character set and the charcter should exist in both client and database character set.
If there is no charcter you will see a default character ? to replace it.
WE8MS1252 is a subset of UTF8, therefore if you have a character that can't be converted you will receive the ? character.
Have you tried to set your sqlplus session NLS_LANG=America_American.WE8MS1252 to see if your client will run the conversion prior to attempting to place in the Oracle database?
Unicode support in Oracle is the following character sets
Hmmm, so every character that does not convert correctly becomes a "?" character? Would a "?" out of place in an XML Schema throw: LPX-00216: invalid character 0 (0x0)? If that is the case, its starting to sound like my error is not related to conversion.
I just did a search in my CLOB for "?" and the same number for the CLOB returned as there was in XML Spy. So, if "?" is what happens everytime it doesnt find a character, then its not a conversion problem? Do you have any ideas what else it could be? I also heard that sometimes it will convert it to an upside-down question mark. Is it possible it did that? What is the char(xxx) representation of that so I can do a search of my document for it?
Err, I confused myself. I still think its the conversion issue, though I dont think its missing a match. Heres an explaination of what I think it is.
Lets just say that in UTF-8, the character "A" is represented by: D3 H7. Now, in WE8MSWIN1252 lets say "A" is represented simply by D3, and H7 actually represents "&". When my document comes into Oracle, it is able to match every character set to something in its own character set, however, they arent matched up correctly. So, what I need to do is get an XML Schema validator that is able to debug a document and use Oracle's character set, so that it will hit that "&" throw an error, and show me where the error is for me to fix. I would do it by hand, but all of my client tools convert the document back to UTF-8 before displaying it to me, and the document is very large, so scanning the thing manually is not likely to produce results.
Does that sound plausable or do you think I am barking up the wrong tree?
Is it feasible for you to start with a smaller schema that
is valid, and then successively add elements until you
find the cause of the error? I realize that this approach
is time-consuming and tedious, but you might get lucky:)
That is an interesting idea. It might be hard for me to do because the schema was based on refs, so no element went more than 1 or 2 levels and simply referenced other elements that referenced other elements, etc. So, I will run into issues where my document is not valid simply because I havent added all the needed elements in. It may be worth looking into, though.
The backup plan I have been considering is writting a block that will loop through each character in the database and check it against a list of know valid characters. When I hit a character that is not valid, I would output the character, the location and a chunk of the document around the character so I can search on it later.