5 Replies Latest reply on Dec 6, 2006 3:32 AM by 807607

    Extending included SAX parser in Java 5.

      Hi all,
      I have a somewhat unique problem and I could use some help. I�m tackling a project to integrate a mainframe system with a J2EE processing cluster. The main frame produces a data file that is formatted via template �overlays�. Since this process is very old, I cannot create the template that uses XML to delimit the data elements, and I don�t want to perform screen scraping. Also, the text in the tags will not be correctly escaped. Instead I can create tags based off of the same principal. I will probably use the format below for tags.
      {{{tag-name}}}My data goes here{{{/tag-name}}} 
      Since this is basically a structured tagged document, I would like to extend the existing SAX parser to use the tag syntax of
      for opening and
      for the closing tag. I would then also remove the parsing of escaped characters in the body. Is this feasible using the SAX parser, or will I need to create my own from scratch and just use the existing event interfaces?

        • 1. Re: Extending included SAX parser in Java 5.
          How about this: write a java.io.InputStream subclass that wraps another InputStream, and basically just passes the inner stream's content through, untouched, except that it scans for {{{ or }}} and replaces them with < and >.

          Then you could create an instance of one of these streams around whatever stream you use to read the mainframe data, and pass it to SAX.
          • 2. Re: Extending included SAX parser in Java 5.
            So, don't extend or re-invent SAX at all; just scrub the input to it.
            • 3. Re: Extending included SAX parser in Java 5.
              I thought of that initially too, but unfortunately that still doesn't fix the encoding problem. I could replace '{{{' with '<' in the templates, that is not the issue. Here is a more concise example. Let�s say I have the following basic output.

              Description: Barnes & Noble
              Amount: 25.50

              If this came out in well formed XML it would look like this
                   <description>Barns &amp; Noble</description>
              However, if I were to just use XML tags in my template on the mainframe, this is what I would get.
                   <description>Barns & Noble</description>
              As you can see, the �&� is not correctly XML encoded by the mainframe. When I run this through the SAX parser, it will not work because the �&� is not part of an escape sequence. I have a rudimentary implementation of org.xml.sax.XMLReader and it works with the SAX content handler implementation, it just seems like a somewhat clunky solution. The only other option I can think of is to run the whole file through sed on the mainframe to encode everything.
              • 4. Re: Extending included SAX parser in Java 5.
                I don't think this is feasible with any XML Parser.

                I mean you don't have tags. You don't have valid data. There isn't very much that is XMLish about this.

                Just write your own parser.
                • 5. Re: Extending included SAX parser in Java 5.
                  If the only other problem is entities...then make the wrapper InputStream also quote entities. If the differences between XML and this format are a handful of simple little quirks like this, then scrubbing the input in a component separate from the parser is still I think the cleanest solution.

                  If you have something that's different from XML in a hundred different little ways, but looks vaguely like XML if you stand back thirty feet and squint at it...then ignore the surface similarities. It doesn't matter if it looks vaguely like XML if there are thousand painful little differences. In that case don't even bother trying to subclass SAX; just write your own lexer and parser.