This discussion is archived
5 Replies Latest reply: Dec 5, 2006 7:32 PM by 807607 RSS

Extending included SAX parser in Java 5.

807607 Newbie
Currently Being Moderated
Hi all,
I have a somewhat unique problem and I could use some help. I�m tackling a project to integrate a mainframe system with a J2EE processing cluster. The main frame produces a data file that is formatted via template �overlays�. Since this process is very old, I cannot create the template that uses XML to delimit the data elements, and I don�t want to perform screen scraping. Also, the text in the tags will not be correctly escaped. Instead I can create tags based off of the same principal. I will probably use the format below for tags.
{{{tag-name}}}My data goes here{{{/tag-name}}} 
Since this is basically a structured tagged document, I would like to extend the existing SAX parser to use the tag syntax of
{{{.*}}}
for opening and
{{{\/.*}}}
for the closing tag. I would then also remove the parsing of escaped characters in the body. Is this feasible using the SAX parser, or will I need to create my own from scratch and just use the existing event interfaces?

Todd
  • 1. Re: Extending included SAX parser in Java 5.
    807607 Newbie
    Currently Being Moderated
    How about this: write a java.io.InputStream subclass that wraps another InputStream, and basically just passes the inner stream's content through, untouched, except that it scans for {{{ or }}} and replaces them with < and >.

    Then you could create an instance of one of these streams around whatever stream you use to read the mainframe data, and pass it to SAX.
  • 2. Re: Extending included SAX parser in Java 5.
    807607 Newbie
    Currently Being Moderated
    So, don't extend or re-invent SAX at all; just scrub the input to it.
  • 3. Re: Extending included SAX parser in Java 5.
    807607 Newbie
    Currently Being Moderated
    I thought of that initially too, but unfortunately that still doesn't fix the encoding problem. I could replace '{{{' with '<' in the templates, that is not the issue. Here is a more concise example. Let�s say I have the following basic output.

    Transaction
    Description: Barnes & Noble
    Amount: 25.50

    If this came out in well formed XML it would look like this
    <transaction>
         <description>Barns &amp; Noble</description>
         <amount>25.50</amount>
    </transaction>
    However, if I were to just use XML tags in my template on the mainframe, this is what I would get.
    <transaction>
         <description>Barns & Noble</description>
         <amount>25.50</amount>
    </transaction>
    As you can see, the �&� is not correctly XML encoded by the mainframe. When I run this through the SAX parser, it will not work because the �&� is not part of an escape sequence. I have a rudimentary implementation of org.xml.sax.XMLReader and it works with the SAX content handler implementation, it just seems like a somewhat clunky solution. The only other option I can think of is to run the whole file through sed on the mainframe to encode everything.
  • 4. Re: Extending included SAX parser in Java 5.
    807607 Newbie
    Currently Being Moderated
    I don't think this is feasible with any XML Parser.

    I mean you don't have tags. You don't have valid data. There isn't very much that is XMLish about this.

    Just write your own parser.
  • 5. Re: Extending included SAX parser in Java 5.
    807607 Newbie
    Currently Being Moderated
    If the only other problem is entities...then make the wrapper InputStream also quote entities. If the differences between XML and this format are a handful of simple little quirks like this, then scrubbing the input in a component separate from the parser is still I think the cleanest solution.

    If you have something that's different from XML in a hundred different little ways, but looks vaguely like XML if you stand back thirty feet and squint at it...then ignore the surface similarities. It doesn't matter if it looks vaguely like XML if there are thousand painful little differences. In that case don't even bother trying to subclass SAX; just write your own lexer and parser.