14 Replies Latest reply: Dec 23, 2009 7:39 AM by 807580 RSS

    html paser of regular expression

    800344
      Dear all,

      I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.

      I would like to parse a html file and fetch the hyper links from it.

      I wrote the following regular expression and it works in most cases:
      .*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*
      However, I have until now two troubles:

      1. For "<a href="directory.html">Directory</a> | <a href="a-z.html">A - Z</a>", I expceted to fetch "directory.html" and "a-z.html" but I only got the last one.

      2. I expected to exclude "http://www.javaeye.com/upload.jpg" in "<img alt="subwayline13" class="logo" src="http://www.javaeye.com/upload.jpg" title="subject" />". I still could not find a solution for this.

      Therefore, I would wish that you can give me some new advices.

      Merry Chirstmas and Happy New Year!

      Pengyou
        • 1. Re: html paser of regular expression
          800344
          Sorry, the title should be " html parser or regular expression".
          • 2. Re: html paser of regular expression
            DrClap
            So... you can't get an HTML parser to work, and you can't get regular expressions to work. If it was me, I would just use an HTML parser. I would be able to find all of the links using that. But if you're asking what you should use, given the results so far, I would recommend getting a new programmer.
            • 3. Re: html paser of regular expression
              807580
              pengyou wrote:
              I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.
              I don't see your requirement. I see a regular expression that you say doesn't work. What is your requirement?
              I would like to parse a html file and fetch the hyper links from it.
              My recommendation, then, is to use an HTML parser. That is the purpose for which they are designed; namely, parsing HTML.

              ~
              • 4. Re: html paser of regular expression
                796440
                pengyou wrote:
                I would like to parse a html file and fetch the hyper links from it.
                I'm sure that any HTML parser can do that.
                However, I have until now two troubles:
                1. You're not using an HTML parser to parse HTML.

                2. You're using regex to parse HTML.
                Therefore, I would wish that you can give me some new advices.
                I'm going to keep giving you the same advice: Use an HTML parser for parsing HTML, not regex.
                • 5. Re: html paser of regular expression
                  jschellSomeoneStoleMyAlias
                  pengyou wrote:
                  Dear all,

                  I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.
                  Then you did something wrong when you were using the parser.
                  I would like to parse a html file and fetch the hyper links from it.

                  I wrote the following regular expression and it works in most cases:
                  .*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*
                  However, I have until now two troubles:

                  1. For "<a href="directory.html">Directory</a> | <a href="a-z.html">A - Z</a>", I expceted to fetch "directory.html" and "a-z.html" but I only got the last one.

                  2. I expected to exclude "http://www.javaeye.com/upload.jpg" in "<img alt="subwayline13" class="logo" src="http://www.javaeye.com/upload.jpg" title="subject" />". I still could not find a solution for this.

                  Therefore, I would wish that you can give me some new advices.
                  Same advice as before.
                  1. Use an existing html parser correctly.
                  2. Write you own html parser. An actual parser. A parser would be part of your solution, not the entire solution.

                  And more advice...do not attempt to use regexes to parse html nor xml for that matter. The reason for that is because by the time you get it right, if ever, you will have built a parser. So instead start with one right away.

                  I suspect that your actual problem is that you don't know what a parser is and what it should do. So you think that a "parser" should give you there result you want rather than giving you tokens. A parser parses a source based on a grammer and produces tokens. A token is not an image file until you further interpret a particular token that way.

                  Finally note that in the above I said you could build your own parser if you wanted. But then you must in fact build a parser. If you do it correctly then you are going to end up with something that is functionally equivalent to one of the existing parsers. If you do it wrong then it won't.
                  • 6. Re: html paser of regular expression
                    800344
                    Thanks for all of you.

                    I have just limitted experiences of xml parser using jdom.

                    Now I need to fetch some links in a html page and reject some other links. To this end, I felt no bad to use regular expression. However, as all of you said html parser is better, I should invest more time for that.

                    Pengyou
                    • 7. Re: html paser of regular expression
                      807580
                      pengyou wrote:
                      Now I need to fetch some links in a html page and reject some other links. To this end, I felt no bad to use regular expression.
                      It's not a bad idea at the outset. When you run into all the subtle problems associated with it, however, it should be clear why "using regular expressions" is not a simple or holistic solution to your problem. It's also important to realize that the bulk of your problem (i.e., parsing HTML) has already been solved. The remaining portion (i.e., selecting certain hyperlinks) is a simpler problem on which you can focus.
                      However, as all of you said html parser is better, I should invest more time for that.
                      That sounds like a good plan.

                      ~
                      • 8. Re: html paser of regular expression
                        EJP
                        I know some of you will think my problem can be solved by an open-source html parser
                        Not some of us. All of us. BTW the one to use is NekoHTML, as under the hood it uses the same Apache Xerces XML parser we already know and love in the JDK. It is fabulous.
                        I wrote the following regular expression and it works in most cases:
                        .*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*
                        The following XPath expression will work in all cases:
                        "@src|@href|@url|@action"
                        XML DOM plus XPath is the correct technology to use here.
                        • 9. Re: html paser of regular expression
                          796440
                          ejp wrote:
                          XML DOM plus XPath is the correct technology to use here.
                          Doesn't that assume XHTML? Or does Neko maybe do some magic transmogrification before delegating to the underlying XML utilities?
                          • 10. Re: html paser of regular expression
                            EJP
                            It puts the Apache parser into error-correcting mode.
                            • 11. Re: html paser of regular expression
                              jschellSomeoneStoleMyAlias
                              ejp wrote:
                              I know some of you will think my problem can be solved by an open-source html parser
                              Not some of us. All of us. BTW the one to use is NekoHTML, as under the hood it uses the same Apache Xerces XML parser we already know and love in the JDK. It is fabulous.
                              Interesting.

                              The license looks good for commercial applications.

                              Is there some other specific observations about it that makes it good? Or better than others?
                              • 12. Re: html paser of regular expression
                                EJP
                                IMO the architecture makes it better than others. It is essentially the Apache parser with a line of code that turns on its error-correcting mode, so it is practically all very mature and well-tested code.

                                I also tried JTidy but it's only distributed as a 1.6 binary which I can't use just at the moment. I've been using NekoHTML for several months on all kinds of HTML with zero problems.
                                • 13. Re: html paser of regular expression
                                  800344
                                  I tested my regular expression based solution and appache cobra html parser. It is evident now my regular expression based solution has better performance that a html parser for my dedicated objective.
                                       public static List<String> parseHtml(String inputHtml) {
                                            List<String> links = new ArrayList<String>();
                                            Pattern pattern = Pattern.compile(PATTERN_FOR_LINK,
                                                      Pattern.CASE_INSENSITIVE);
                                            Matcher matcher = pattern.matcher(inputHtml);
                                            while (matcher.find()) {
                                                 links.add(matcher.group(2).trim());
                                            }
                                            return links;
                                       }
                                  
                                       public static List<String> parseHtml2(String inputHtml) {
                                            InputSource inputSource = new InputSource(new StringReader(inputHtml)); 
                                            HTMLCollection htmlCollection = null;
                                            UserAgentContext uacontext = new SimpleUserAgentContext();
                                            DocumentBuilder builder = new DocumentBuilderImpl(uacontext);
                                            try {
                                                 HTMLDocument document = (HTMLDocumentImpl)builder.parse(inputSource);
                                                 htmlCollection = document.getLinks();
                                            } catch (SAXException e) {
                                                 log.info("error - SAXException caught at parseHtml");
                                            } catch (IOException e) {
                                                 log.info("error - IOException caught at parseHtml");
                                            }
                                  
                                            List<String> links = new ArrayList<String>(); 
                                  
                                            for (int i=0; i<htmlCollection.getLength(); i++ ) {
                                                 Node node = (Node) htmlCollection.item(i);
                                                 links.add(node.toString());
                                            }
                                  
                                            return links;
                                       }
                                  • 14. Re: html paser of regular expression
                                    807580
                                    pengyou wrote:
                                    I tested my regular expression based solution and appache cobra html parser. It is evident now my regular expression based solution has better performance that a html parser for my dedicated objective.
                                    So you've solved your problem, then?

                                    Continue discussion here, in duplicate thread: [http://forums.sun.com/thread.jspa?threadID=5421194&tstart=0]

                                    ~