There is no fully reliable way to detect the character set of a file. You can check how well the task is performed by the lcsscan utility, which you will find in your database Oracle Home. This utility uses statistical analysis to guess the character set and language of text but results are not always good. See documentation here: http://docs.oracle.com/cd/E16655_01/server.121/e17750/ch11charsetmig.htm#NLSPG982
My recommendation is to imply the character set of the file based on its source. You should have a protocol (a contract) with providers of those files to get the files in prescribed character sets. Then you can use lcsscan as a quality verification tool to warn you about possible violations of the protocol.
Thank you Sergiusz : the lcsscan tools gives quite good results and it may be a solution for us.
The only issue is that lcsscan is only available on last oracle DB releases (10 and 11 ).
Although our database server is on release 11G, the EBS applications server is on oracle 8i due to Oracle Forms restrictions.
As the SQLLOAD is run from the applications server, I have to run the lcsscan tool by SSH on the DB server
So I let the question open a few days again in case someone knows a Linux command or tool that does the same control as lcsscan
Note: We are also considering the other solution you mention eg to assign to each sending system an agreed characterset .
Thank you again and best regards
If you can NFS mount the database binaries on the Apps server, you can use the 11g version of sql*loader. See these MOS Docs
You should double-check it with your account manager (sales person) from Oracle but I am pretty convinced you can install an additional Oracle Client matching your database server version (or even a newer version, if you have a Support contract) on the application server host. Then, set (and export) ORACLE_HOME to the path of the new Oracle Client home in a shell script and run lcsscan. An Oracle Client installation is not small but you have a GB or two spare, don't you? ;-)
As I said, I strongly recommend a protocol + lcsscan. Neither of the two methods alone is foolproof, though a protocol is usually more reliable, if file generation is done automatically at source.
Our issue is now solved by applying following logic:
If character set returned by Linux command file -i is utf-8
then the file is uploaded with oracle charset UTF8
else if the file is sent by our Polish company (very few files easy to recognize by the filename prefix)
then the file is uploaded with oracle charset EE8MSWIN1250
the file is uploaded with oracle charset WE8MSWIN1252
So now we use WE8MSWIN1252 as default characterset rather than WE8ISO8859P1.
We have much less issues because, af far as I understand, WE8MSWIN1252 is a superset of WE8ISO8859P1.
Thank you Sergiusz and Srini for your help.
Yes, WE8MSWIN1252 is a binary superset of WE8ISO8859P1. Whatever is a valid byte sequence in WE8ISO8859P1 is also a valid byte sequence in WE8MSWIN1252 with the same meaning.