This discussion is archived
3 Replies Latest reply: Oct 27, 2011 2:30 AM by Dimitar Slavov RSS

JAXP XSLT transformation UTF-8 issue

Dimitar Slavov Journeyer
Currently Being Moderated
Hi all,

I have XML file generated from Java app. File is valid and is in UTF-8.
I have XSLT template created with Altova StyleVision. Template is to produce RTF output file from above XML.
I Java app there are two variant for XSLT transformation - with and without Saxon. I have simplified the code and it is almost the same in two variants.

-----------------------------------------------------------------------
File outputFile = new File("path");
*if (saxonToBeUsed) {*
ClassLoader saxonClassLoader = SaxonLoader.getInstance(saxonPath);
TransformerFactory transFact = TransformerFactory.newInstance("net.sf.saxon.TransformerFactoryImpl", saxonClassLoader); //$NON-NLS-1$
Transformer trans = transFact.newTransformer(xsltSource);
trans.setOutputProperty("encoding", "UTF-8"); //$NON-NLS-1$
StreamResult res = new StreamResult(outputFile);
trans.transform(xmlSource, res);
*} else {*
TransformerFactory transFact = TransformerFactory.newInstance();
Transformer trans = transFact.newTransformer(xsltSource);
trans.setOutputProperty("encoding", "UTF-8"); //$NON-NLS-1$
StreamResult res = new StreamResult(outputFile);
*// Same result -> StreamResult res = new StreamResult(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));*
*// Same result -> StreamResult res = new StreamResult(new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8")));*
trans.transform(xmlSource, res);
*}*

-----------------------------------------------------------------------
XML file contains Arabic or Chinese text. RTF is generated.
When RTF and XML are opened with text editor or web browser Arabic/Chinese whatever text is readable.

The issue is that when RTF is opened with MSWord/WordPad/Open Office Writer :
- Saxon generation -> RTF file can be opened and Arabic/Chinese text is normal i.e readable. English language is readable too.
- JAXP generation -> RTF file can be opened but Arabic/Chinese is scrabmbled, messed up. English language is readable.

All three files (XML,XSLT,RTF) are in UTF-8
I have installed Chinese and Arabic true type fonts. When I create manual MSWord doc i can write some of their symbols.

Please suggest some solution.
Thanks
  • 1. Re: JAXP XSLT transformation UTF-8 issue
    jtahlborn Expert
    Currently Being Moderated
    how are xmlSource and xsltSource opened?

    also, what is the actual byte-level difference in the 2 result files?

    Edited by: jtahlborn on Oct 24, 2011 11:49 AM
  • 2. Re: JAXP XSLT transformation UTF-8 issue
    Dimitar Slavov Journeyer
    Currently Being Moderated
    Hi jtahlborn,

    Source are opened like this :

    Source xmlSource = new StreamSource(new InputStreamReader(new FileInputStream(xmlFile), "UTF-8"));
    Source xsltSource = new StreamSource(new InputStreamReader(new FileInputStream(xsltFile), "UTF-8"));

    After long searching I found out there is something wrong with XSLT templates but I don't know what. There are two identical templates - for XSLT1.0 and XSLT2.0. Both of them are UTF-8 files and states <xsl:output method="text" encoding="UTF-8"/>.
    Template for XSLT1.0 does not output proper UTF-8.

    Thanks.
  • 3. Re: JAXP XSLT transformation UTF-8 issue
    Dimitar Slavov Journeyer
    Currently Being Moderated
    Hi jtahlborn,

    I found the difference between two RTFs. This is one table cell containing random Arabic text

    XSLT 1.0 generated:
    {\*\bkmkstart محمود_شمام_لرويترز:_القذافي_قتل_في_هجوم_للمجلس_الانتقالي}{\fs16 محمود شمام لرويترز: القذافي قتل في هجوم للمجلس الانتقالي}

    XLST2.0 generated:
    {\*\bkmkstart محمود_شمام_لرويترز:_القذافي_قتل_في_هجوم_للمجلس_الانتقالي} {\fs16\u1605?\u1581?\u1605?\u1608?\u1583?\u32?\u1588?\u1605?\u1575?\u1605?\u32?\u1604?\u1585?\u1608?\u1610?\u1578?\u1585?\u1586?\u58?\u32?\u1575?\u1604?\u1602?\u1584?\u1575?\u1601?\u1610?\u32?\u1602?\u1578?\u1604?\u32?\u1601?\u1610?\u32?\u1607?\u1580?\u1608?\u1605?\u32?\u1604?\u1604?\u1605?\u1580?\u1604?\u1587?\u32?\u1575?\u1604?\u1575?\u1606?\u1578?\u1602?\u1575?\u1604?\u1610?}

    But I don't know how to fix it. Both files are generated with Saxon, same Java code, only XSLTs are different.

    Any ideas ?

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points