3 Replies Latest reply: Sep 19, 2013 6:25 AM by ptoye RSS

    How to detect character file encoding?

    ptoye

      I'm new to the nio classes, and had sort of assumed that as they're Unicode-oriented they'd automatically look at the magic numbers at the beginning of text files and do any conversions necessary.

       

      But I see that I have to tell Files.newBufferedReader what the encoding is. How (without using a byte editor) am I expected to know? Agreed, quite a lot of applications don't insert the magic numbers, but if even Notepad can do it, surely Java can read it? Or am I missing a class or method somewhere?

        • 1. Re: How to detect character file encoding?
          sabre150

          ptoye wrote: Agreed, quite a lot of applications don't insert the magic numbers

          A gross understatement. Very very few application prepend the BOM character encoding magic numbers to text files.

          ptoye wrote: but if even Notepad can do it, surely Java can read it?

          Notepad is not a good advertisement for anything but I agree that I would have expected some option on StreamReader and StreamWriter to handle the BOM magic numbers. Several years ago I posted on the Sun forums a much plagiarised class to strip the BOM prefix. Since then I have created but not published Reader and Writer classes that handle the BOM. I have not published them because I'm not happy with my approach but at the moment I'm not actively working with Java so its a case of 'out of sight, out of mind' . This thread may push me into working on them again.

          • 2. Re: How to detect character file encoding?
            ptoye

            Take your point about Notepad . It's not important now, as I've got round it. I was just a bit surprised that there wasn't already something in the nio package which did the work for me.

             

            There also seem to be quite a few encodings missing from java.nio.charset.StandardCharsets - the most important that I can think of being the various 8-bit extensions to ASCII. OK, they're not official standards, but are used by a lot of programs.

            • 3. Re: How to detect character file encoding?
              ptoye

              OK, if not Notepad, then look at the .NET API.

               

              StreamReader Constructor (String, Encoding, Boolean) (System.IO)   is a good place to start.

              http://msdn.microsoft.com/en-us/library/system.text.encoding.getpreamble.aspx  gives a fair rundown of the pros and cons.

               

              What's needed is something like the first of these, but I can see how it's difficult to do this with the current BufferedReader class as returned by newBufferedReader. Certainly one imperative is that the charset used for any translation is available to the user, which isn't currently the case.

               

              Also, similar changes are needed for output streams to add the preambles.

               

              I've submitted this as an RFE, so we'll see what happens.

               

              Message was edited by: ptoye - added comment about RFE.

               

              [much later] I submitted the RFE and was given a bug number of 9002718, but it doesn't seem to be in the database. Has anyone here any idea how this happens?

               

              Message was edited by: ptoye about non-appearance of submitted bug.