3 Replies Latest reply: Oct 27, 2011 4:30 AM by Dimitar Slavov-Oracle RSS

    JAXP XSLT transformation UTF-8 issue

    Dimitar Slavov-Oracle
      Hi all,

      I have XML file generated from Java app. File is valid and is in UTF-8.
      I have XSLT template created with Altova StyleVision. Template is to produce RTF output file from above XML.
      I Java app there are two variant for XSLT transformation - with and without Saxon. I have simplified the code and it is almost the same in two variants.

      -----------------------------------------------------------------------
      File outputFile = new File("path");
      *if (saxonToBeUsed) {*
      ClassLoader saxonClassLoader = SaxonLoader.getInstance(saxonPath);
      TransformerFactory transFact = TransformerFactory.newInstance("net.sf.saxon.TransformerFactoryImpl", saxonClassLoader); //$NON-NLS-1$
      Transformer trans = transFact.newTransformer(xsltSource);
      trans.setOutputProperty("encoding", "UTF-8"); //$NON-NLS-1$
      StreamResult res = new StreamResult(outputFile);
      trans.transform(xmlSource, res);
      *} else {*
      TransformerFactory transFact = TransformerFactory.newInstance();
      Transformer trans = transFact.newTransformer(xsltSource);
      trans.setOutputProperty("encoding", "UTF-8"); //$NON-NLS-1$
      StreamResult res = new StreamResult(outputFile);
      *// Same result -> StreamResult res = new StreamResult(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));*
      *// Same result -> StreamResult res = new StreamResult(new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8")));*
      trans.transform(xmlSource, res);
      *}*

      -----------------------------------------------------------------------
      XML file contains Arabic or Chinese text. RTF is generated.
      When RTF and XML are opened with text editor or web browser Arabic/Chinese whatever text is readable.

      The issue is that when RTF is opened with MSWord/WordPad/Open Office Writer :
      - Saxon generation -> RTF file can be opened and Arabic/Chinese text is normal i.e readable. English language is readable too.
      - JAXP generation -> RTF file can be opened but Arabic/Chinese is scrabmbled, messed up. English language is readable.

      All three files (XML,XSLT,RTF) are in UTF-8
      I have installed Chinese and Arabic true type fonts. When I create manual MSWord doc i can write some of their symbols.

      Please suggest some solution.
      Thanks
        • 1. Re: JAXP XSLT transformation UTF-8 issue
          jtahlborn
          how are xmlSource and xsltSource opened?

          also, what is the actual byte-level difference in the 2 result files?

          Edited by: jtahlborn on Oct 24, 2011 11:49 AM
          • 2. Re: JAXP XSLT transformation UTF-8 issue
            Dimitar Slavov-Oracle
            Hi jtahlborn,

            Source are opened like this :

            Source xmlSource = new StreamSource(new InputStreamReader(new FileInputStream(xmlFile), "UTF-8"));
            Source xsltSource = new StreamSource(new InputStreamReader(new FileInputStream(xsltFile), "UTF-8"));

            After long searching I found out there is something wrong with XSLT templates but I don't know what. There are two identical templates - for XSLT1.0 and XSLT2.0. Both of them are UTF-8 files and states <xsl:output method="text" encoding="UTF-8"/>.
            Template for XSLT1.0 does not output proper UTF-8.

            Thanks.
            • 3. Re: JAXP XSLT transformation UTF-8 issue
              Dimitar Slavov-Oracle
              Hi jtahlborn,

              I found the difference between two RTFs. This is one table cell containing random Arabic text

              XSLT 1.0 generated:
              {\*\bkmkstart محمود_شمام_لرويترز:_القذافي_قتل_في_هجوم_للمجلس_الانتقالي}{\fs16 محمود شمام لرويترز: القذافي قتل في هجوم للمجلس الانتقالي}

              XLST2.0 generated:
              {\*\bkmkstart محمود_شمام_لرويترز:_القذافي_قتل_في_هجوم_للمجلس_الانتقالي} {\fs16\u1605?\u1581?\u1605?\u1608?\u1583?\u32?\u1588?\u1605?\u1575?\u1605?\u32?\u1604?\u1585?\u1608?\u1610?\u1578?\u1585?\u1586?\u58?\u32?\u1575?\u1604?\u1602?\u1584?\u1575?\u1601?\u1610?\u32?\u1602?\u1578?\u1604?\u32?\u1601?\u1610?\u32?\u1607?\u1580?\u1608?\u1605?\u32?\u1604?\u1604?\u1605?\u1580?\u1604?\u1587?\u32?\u1575?\u1604?\u1575?\u1606?\u1578?\u1602?\u1575?\u1604?\u1610?}

              But I don't know how to fix it. Both files are generated with Saxon, same Java code, only XSLTs are different.

              Any ideas ?