4 Replies Latest reply: Jun 18, 2007 11:58 PM by DrClap RSS

    Parsing HTML file into Document object?

    807605
      How do I parse a HTML document into a Document object?

      I have written some code that works with XML documents but not with HTML files, even with validation off. Here is the code:
      [...]
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance ();
            if (validate)
            {
               factory.setValidating(true);
               factory.setIgnoringElementContentWhitespace(true);
            }
            try
            {
               DocumentBuilder builder = factory.newDocumentBuilder ();
               if (validate) builder.setErrorHandler (new Reporter ());
               Document document = builder.parse (new File (filename));
               return document;
            }
      [...]
      When running with a HTML file I get the following error:
      [Fatal Error] sample1.html:70:4: The element type "br" must be terminated by the matching end-tag "</br>".
      The element type "br" must be terminated by the matching end-tag "</br>".


      Can anyone help? Do I need to use different classes for parsing HTML documents?
        • 1. Re: Parsing HTML file into Document object?
          807605
          you'll have to use something different... take a look at this article:
          http://www.samspublishing.com/articles/article.asp?p=31059&rl=1
          • 2. Re: Parsing HTML file into Document object?
            807605
            There are a lot of projects which you can use to parse HTMl into a DOM object depending on the context.

            As an example, JTidy will help you build a DOM tree from HTML while making this HTML "tidy". This library is pretty useful if you simply need the DOM structure for Read-Only purposes or if you don't care about the fact the HTML won't be the same when you will pprint it (pretty-print) because it will have been standardized.

            Another useful library is Cobra. What is nice with this library is that it won't try to bring any extra modifications when you are printing the result of your modifications on the DOM structures. Some modifications can be done though, such as pruning invalid HTML tags such as closing tags with no matching opening tags and the like.

            For more informations on other alternatives you have, look at these URLs:

            http://www.cafeconleche.org/books/xmljava/chapters/ch09s05.html
            http://ccil.org/~cowan/XML/tagsoup/
            http://sourceforge.net/projects/mozillaparser
            http://jerichohtml.sourceforge.net/doc/index.html
            • 3. Re: Parsing HTML file into Document object?
              807605
              OK I took a look at JTidy, which seems fairly popular for this sort of thing. However I am totally at a loss as to how to use it in my code.

              I ran the build.bat file in the download but had the following error:
              Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tools/ant/Main

              The page http://jtidy.sourceforge.net/howto.html talks about importing org.w3c.tidy.Tidy but this doesn't seem to work. It also mentions a jar file with the name jtidy-{version}.jar; the only jar file in the zip file is Tidy.jar, which seems like it is the right one - however I tried copying it to my working directory and I still can't run anything...

              Can anyone provide any help with this?! How do I get my code to recognise org.w3c.tidy... ?
              • 4. Re: Parsing HTML file into Document object?
                DrClap
                How do I get my code to recognise org.w3c.tidy... ?
                First you find what jar file it's in. Probably it's in that Tidy.jar file, so look in there to see if it is. (Hint: Jar files are structured exactly like Zip files.)

                When you find the right jar file, include it in your classpath.