Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

SAX (xerces) problem

805006Apr 14 2009 — edited Jun 8 2009

I have a big problem with Apache Xerces2 Java.

I have to parse and get data from very large xml files (100 MB to 20 GB). Because the files are very large I have to use SAX parser.

If I use internal xerces in any update of jdk/jre 1.6 then whole document gets into memory. I have found a bug report related at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6536111 . I am not sure that fix will solve my problem and fix has not delivered yet. According to the bug report it is going to be delivered with jdk6 update 14 in the mid May 2009.

I thougt maybe the problem is with the internal SAX parser. So I started to use source of xerces. (I use the last version - 2.9.1). At this point I have discovered that parse takes more time and need 24 byte for each node. Sometimes xml files have 80.000.000 nodes. It will take 1,5 - 2 GB of RAM which I don't have. Even if I have RAM that size I can not use it at windows 32 platform. (OS limits)

Has anyone got idea, solution?

Thanks..

jtahlborn

kalyon wrote:
I thougt maybe the problem is with the internal SAX parser. So I started to use source of xerces. (I use the last version - 2.9.1). At this point I have discovered that parse takes more time and need 24 byte for each node. Sometimes xml files have 80.000.000 nodes. It will take 1,5 - 2 GB of RAM which I don't have. Even if I have RAM that size I can not use it at windows 32 platform. (OS limits)

if i understand you correctly, you are using the sax parser, but storing the whole document in memory as it is parsed? if so, you are doing exactly what a dom parser does, and, of course you will run out of memory. the point of using a sax parser is that you process the document as you read it, and never store the entire parsed document in memory at one time.

805006

I am using SAX parser and I do not want to store whole document in the memory. But internal sax parser in jdk/jre 1.6 has a bug which stores whole document in the memory.

I prefer SAX parser because of the reason you mentioned. (In order not to store all document).

jtahlborn

kalyon wrote:
I am using SAX parser and I do not want to store whole document in the memory. But internal sax parser in jdk/jre 1.6 has a bug which stores whole document in the memory.

I prefer SAX parser because of the reason you mentioned. (In order not to store all document).

i understand all that. i looked at the bug you referenced and i understand there is a bug in the jdk. i was addressing the second part of your original post where it seemed like you were trying to work around the problem by using the latest xerces parser (which seems like a reasonable idea). if you are indeed having problems aside from the jdk bug, maybe you could re-post that question again.

843834

I have been developing an application that, amongst other features, transforms big XML files using XLST and SAX.
When using jdk 1.6.0_12 I got Out of Memory errors, and after a long search I ended up reading this post and checking (as suggested) the new features on update 14 to see if it would really help with my issue. I was feeling optimistic but when I finally upgraded to update 14 (jdk 1.6.0_14) and run the application I still got the old Out Of Memory error when parsing big XML files. In fact, I have checked that memory consumption is exactly the same than with the previous update (12).

Here is the most relevant part of my code, in case there's something wrong with it:

[...]

javax.xml.transform.sax.SAXTransformerFactory tFactory = (javax.xml.transform.sax.SAXTransformerFactory)javax.xml.transform.sax.SAXTransformerFactory.newInstance();

TransformerHandler transHand = tFactory.newTransformerHandler(new StreamSource(xsl));
Transformer transformer = transHand.getTransformer();

if(map!=null)
{
Iterator it=map.entrySet().iterator();
while(it.hasNext())
{
Entry e = (Entry)it.next();
String nombreParam = e.getKey().toString();
String rutaParam = e.getValue().toString();
transformer.setParameter(nombreParam,rutaParam);
}
}

OutputStream fos = null;
try{
fos = new BufferedOutputStream(new FileOutputStream(rutaXmlOut));
StreamResult flujoSalida = new StreamResult(fos);
SAXSource flujoEntrada = new SAXSource(new InputSource(new BufferedReader(FileUtil.readReader(xml_in))));
transformer.transform(flujoEntrada, flujoSalida);

xml_out = new File(rutaXmlOut);
}

finally
{
fos.close();
}

return xml_out;

Is it really impossible to perform average SAX transformations with Sun's JDK? I don't think so, there's got to be another way...
PLEASE help me, this is an extremely important part of my application, I'm new to SAX but I'm pretty deceived with its performance already. That code above works fine with small XML documents (although memory consumption is always relatively high) but crashes with big ones, consuming absurdly high amounts of memory. It behaves just like it is NOT using SAX for the transformations.

Is there any way to make XSLT transformations on big XML files with SAX (or similar) without running out of memory? Should I try [yet] another [older] JDK? Any workarounds or different technologies to get the job done?

805006

As you can see on that (http://www.java.com/en/download/inc/windows_new_xpi.jsp) page update 14 has not officially realesed yet. So I do not know how you get it.

But I can suggest something that works for me.
1. Use 1.5.3 of java. It does not have SAX problem. Be sure that you are using exact version you want. Because if you do not give the exact working path OS uses the default (newer one). Giving path before the java makes it clear. Something like that "C:\Program Files\java 1.5.3\bin\java"
2. If you have chance to make files smaller make it. Decrease file size and node count both. Because SAX parser internally consumes 24 bytes for each node for validation.
3. Use javaVM which comes with java 1.6 or over to analyse. Try to figure out it happens because of the SAX parser or other parts of your code.

kalyon/Istanbul

843834

Hi Kaylon, thanks for your response.
The update 14 is indeed relased and officially out at Sun's webpage :), you can find it here and download it from sun's download center just as I did. You should always check the source instead of other pages! ;)
I was afraid I'd have to go back to a previous release of the JDK to solve my problem. But it's very strange that I haven't seen any more complaints on the net about SAX malfunctioning on JDK 1.6.x and the problem not being solved on the latest release as it was stated on Sun's official 1.6.0u14 release "solved bugs" list. I'm also a newbie to SAX and I was hoping there'd be something wrong related to my code or application configuration instead of the JDK itself, but as far as I know it seems to be an unsolved bug, still ignored by the community.
I guess that in the end I'll have to go back to JDK 1.5, what a shame... I hope that step back doesn't have any more consequences for the rest of my application.
BTW, it's not possible to make my files smaller, but they aren't HUGE whatsoever (they're never bigger than a few hundred megs and most of the times they won't take more than a small amount of that size) and if they were processed in the way I expected SAX to work, they wouldn't take more than 20megs of memory (as far as I read). At this time, that part of the application is consuming more than 50 megs for big files... it seems that it tries to load them to memory (typical DOM behaviour?) instead of processing them via streaming (SAX).
Thanks anyway for your suggestion, I'll probably check out version 1.5.3.

DrClap

Isana wrote:
Is it really impossible to perform average SAX transformations with Sun's JDK?

What do you mean by "SAX transformations"? I don't understand that phrase. SAX is just a parser, it doesn't do transformations.

Is there any way to make XSLT transformations on big XML files with SAX (or similar) without running out of memory?

If you're doing XSLT, then it will build a tree in memory from your XML document. It must do that because there is no guarantee that the transformer will have all of the data it requires if it just reads it sequentially. It makes no difference which parser you used to parse your XML document, that's a separate step from transformation.

843834

Hi DrClap, thanks for your answer. As you can check on my code above, SAX is actually being used for XSLT transformations:

javax.xml.transform.sax.SAXTransformerFactory tFactory = (javax.xml.transform.sax.SAXTransformerFactory)javax.xml.transform.sax.SAXTransformerFactory.newInstance();

I didn't mean that SAX does them, but it rather parses the file throwing events in such a way that it doesn't need the whole file in memory. These events are then carried to the transformer which uses the XSLT file to produce the result. Or at least I understood that when reading the docs (as I said I'm new to XML transformations)

Yesterday I tried to change to JDK 1.5u3 which is supposed to run transformations using SAX (as a parser, yes) with no memory waste but I read that I should also change my Tomcat to an older version. I was already very disgusted with the fact that I had to get back to a previous release of the JDK to also change more things to older, more error-prune and less functional versions.

So, my big question remains, and I will be eternally grateful to anyone who answers it :)

Is there a way to perform XML transformations (using XSL) from Java in such a way that there is little memory usage, using the latest JDK? I don't mind the technology, as long as it's documented, easy to use, can be called from Java and uses an XML input and an XSLT file to produce a modified XML.

I thought that the main advantage of SAX was that it didn't take much memory to parse the files but in my case memory is being wasted. Instead of reading a portion of XML, processing it using the XSL file and writing the result to another XML via streaming, it seems like it's loading the whole xml file onto memory in some way or another (dom?)
Please, can someone answer my question and post some code or at least link to a page where I can find a 100% tested, 100% working solution? I read that Stax and SAX are the better options but as I said there's an important memory waste when using SAX. I'm having such a headache with this... and it's probably the most important part of my application!

Tolls

The only thing that comes to mind is STX , though I have no idea if that's still considered a candidate for anything. Joost is a version of it for Java. Joost has recently been updated, so work is still being done in this area. I think Apache might have something in their libraries somewhere for STX as well.

Note, it's not xslt. It has it's own (related) syntax, so you'll have to rewrite your xsl files. It's been three or four years since I looked at this, so I can't guarantee it's suitable.

Edited by: Tolls on 03-Jun-2009 12:28
Bouncy fingers.

843834

Fact is we are using extensively XSLT transformations and for integrity purposes (and also for not leaving such well-known standard) we want to stick to it.
Thanks anyway for your suggestion, Tolls, STX seems to be a very interesting tool for future purposes.

Anyone has any ideas involving XSLT, streaming -or at least not loading the whole document in memory- and XML? Or did anyone succeed using SAX with an acceptable use of memory? Any help will ge greatly appreciated.

jtahlborn

Isana wrote:
Anyone has any ideas involving XSLT, streaming -or at least not loading the whole document in memory- and XML? Or did anyone succeed using SAX with an acceptable use of memory? Any help will ge greatly appreciated.

i think DrClap already answered your question:

DrClap wrote:
If you're doing XSLT, then it will build a tree in memory from your XML document. It must do that because there is no guarantee that the transformer will have all of the data it requires if it just reads it sequentially. It makes no difference which parser you used to parse your XML document, that's a separate step from transformation.

Tolls

Isana wrote:
Fact is we are using extensively XSLT transformations and for integrity purposes (and also for not leaving such well-known standard) we want to stick to it.
Thanks anyway for your suggestion, Tolls, STX seems to be a very interesting tool for future purposes.

That is the problem. I used it once for a client, to solve a memory problem with the transformation of some large report which was killing their system. Copious notes and the like were required, though, to ensure that anyone who came afterwards knew what I'd done.

To do this for a bedded in project, involving numerous XSLs would be a chore, to say the least.

DrClap

Isana wrote:
Anyone has any ideas involving XSLT, streaming -or at least not loading the whole document in memory- and XML?

I believe the Saxon product has optimizations which can do streaming transformations in some cases. Of course not all transformations are amenable to streaming, so this might not work for you. But it might be worth looking into.

843834

Thank you both Toll and DrClap for your help. I'll take a look at Saxon, but I'm still intrigued why nobody is complaining about a tool (SAX) that's almost a standard for stream parsing... and yet not working for XSLT transformations! Maybe you were right after all when you said stream processing might not be possible for my XSLT file but I doubt it because the XML is representing "sort of" a table and therefore it's made up of thousands of structurally identical <row> elements which can be individually transformed...I can't think of anything more suitable for streaming transformation.
Thanks again for your time.

Edited: at the end I've decided to parse the document using SAX (which in my tests uses almost no memory at all and performs lightning-fast) and then applying a XSL transformation for each parsed node (I can do it in my case). But transforming a document will still be a huge problem -in terms of memory usage- for those who don't have a repeating pattern on their XML's, although I guess 99% of the times there'll probably be one for big/huge documents.

I think this will be very useful for other programmers facing the same problem I had. This code divides a xml file into several different files according to a repeating pattern (in this case InsurancePolicyData/Record) using SAX and then processes each chunk of xml separately, optimizing the use of memory:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.Writer;

import org.dom4j.io.SAXReader;

public class SingleThreadSplitXSLT
{

  public static void main(String[] args) 
    throws Exception
  {
    if (args.length != 3)
    {
      System.err.println(
        "Error: Please provide 3 inputs: + 
         inputXML XSLT outputXML");
      System.exit(-1);
    }

    long startTimeMs = System.currentTimeMillis();
    
    File xmlFile = new File(args[0]);
    File xsltFile = new File(args[1]);
    BufferedWriter outputWriter = new 
      BufferedWriter(new FileWriter(args[2]));

    styleDocument(xmlFile, xsltFile, outputWriter);

    outputWriter.close();

    long executionTime = 
      System.currentTimeMillis() - startTimeMs;
    System.err.println("Successful transformation took " 
      + executionTime);
  }

  
  public static void styleDocument(File xmlFile, 
    File xsltFile, Writer outputWriter)
    throws Exception
  {

    // start the output file
    outputWriter.write(
      "<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
    outputWriter.write("<InsurancePolicyData>");

    // read the input file incrementally
    SAXReader reader = new SAXReader();
    reader.addHandler( "/InsurancePolicyData/Record", 
      new SplitFileElementHandler(
        xsltFile, outputWriter));

    reader.read(xmlFile);

    // finish output file
    outputWriter.write("</InsurancePolicyData>");
    
  }
}

(I found it at http://www.devx.com/xml/Article/34677/1954)

That's exactly what I was looking for, hope it helps others as well :)

Edited by: Isana on Jun 4, 2009 7:56 AM

jtahlborn

i wouldn't recommend using that code for production use, however, as that is a recipe for generating broken xml files. the write "claims" the output file is in "utf-8" but in fact writes using the currently configured platform default encoding. in general, writing xml "by hand" (e.g. string concatenation) is full of potential pitfalls.

DrClap

Isana wrote:
I'm still intrigued why nobody is complaining about a tool (SAX) that's almost a standard for stream parsing... and yet not working for XSLT transformations!

It seems you still haven't understood that parsing and transforming are two separate operations. Just because a particular parser works sequentially and doesn't build a tree in memory, it doesn't follow that transforming should do that as well. They are two separate things and complaining that a transformer doesn't work identically to a particular parser makes no sense at all.

843834

Yes, Dr Clap, after reading some more this afternoon I finally understood what you were trying to say since the beginning. Sorry for my lack of knowledge and understanding on all this stuff :)

jtahlborn, thanks for pointing that out, but I'd be happier if you brought me an alternative solution for my problem :)

I've had already some problems with the code I posted in the last post. First problem was I couldn't isolate pieces of xml completely, because some of them where related to other parts of it. Secondly, even when making the simplest XSLT transformation, memory still seams to leak out somehow and I don't understand why. I've seen that parsing with SAX takes a low deal of memory and just about 2 seconds of processing time, but when I add transformations, even when they are only applied to little pieces of XML, the memory usage goes off the chart (more than 10 times the amount of memory needed to parse), the processing time takes almost a minute using 99% of the processor and finally the application crashes returning the classical "Out of memory" exception.

I guess I'm doing something wrong, but I'm starting to think about doing things in a different manner. Maybe use some other technology to make the changes, but XSLT is the standard and it's what's been used all through the application, and it's easier to maintain on future versions, so I still don't know which way to go. I'm even beginning to consider increasing the memory requirements of my app so that it can support the average SAX parsing-and-XSLT-transformation (the first piece of code I posted) It has always worked fine, but consuming more than 120 megs of RAM per transformation and since the application will be multiuser, memory requirements could exceed 1GB, maybe even 2... and that has always seemed crazy to me and very hard to defend in front of my bosses. For the love of god, all I'm doing is reading a text file through streaming and changing some parts, then writing the result to another file! It's crazy to use 100% of the processor capacity and take ALL the available memory in the machine for such a thing!!. It seems even crazier when I've seen that parsing with SAX is almost NOTHING for the machine in terms of memory and time.

I'm starting to get tired of this.... I should have been a soccer player or a F1 driver. I'd be richer, I'd have more fun, less headaches and a beautiful girlfriend!
Have a nice weekend.

Edited by: Isana on Jun 4, 2009 4:34 PM

Tolls

All I can suggest now is that you probably want to look at your XSLT and see how it's doing its transformation. Could you "streamline" it somehow so that, from the transformation engines point of view (and I'm thinking Saxon here), it doesn't have to load the whole (or a large part of the) document in in one go (which is what is happening at the moment)? Saxon has some clever stuff behind it whereby it doesn't hold onto information it knows is no longer needed, to try and reduce the memory footprint...but this requires that the XSLT is written so that it doesn't have to grab this stuff.

There is probably some Saxon based articles around as to how to go about doing that. If you are simply altering the odd detail in a node, which isn't reliant on data in other nodes, then this might be the route to take.

DrClap

Another suggestion: if your transformation is really amenable to being done on a single sequential pass through the XML document, you could read the XML document with a StAX reader and write the transformed version with a StAX writer, with the transformation logic written in Java instead of XSLT.

Yes, I know that's inconsistent with all of your other XSLT work which (apparently) is working just fine. But sometimes when you run into the limits of a scalability problem, you have to stop fighting that problem and use a different solution. This is another solution which you might want to consider.

843834

Thank you both for your suggestions. I'm already dealing with other issues of my application, but I'll check both options as soon as I get back to my XSL nightmare :)
Yes, I know that trying to modify the XSL so that it can process a part of XML instead of the whole document is a good way to go, but it's not going to be easy. That'll be however the first option on my list. If I don't succeed, I'll probably check other technology (yes, I had heard of StAX before) but at this point we're trying to stick to the well-known standards of XSLT transformations.

1 - 20

Locked Post

New comments cannot be posted to this locked post.

Locked on Jul 6 2009

Added on Apr 14 2009

#java-technology-xml

20 comments

320 views

Java EE (Java Enterprise Edition) General Discussion

SAX (xerces) problem

Comments

Post Details