3 Replies Latest reply: May 3, 2007 7:28 PM by 807606 RSS

    regex help: parsing href and .mp3

    807606
      Hi i need help parsing a href links in html. For instance i need to find <a href='[blink[/b'>text and then return the link and the text. I also need to find all mp3 links as well. I accomplished both n php but porting it to java is harder than I thought it would be..heres my php code for it, can anyone help me?
      function get_links($url, $mp3search) {
      if ($mp3search==false)
      $preg = "/a[\s]+[^>]*?class=l[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/";
      else
      $preg = "/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?(mp3))[\"\']+.*?>"."([^<]+|.*?)?<\/a>/i";   preg_match_all(trim($preg), 
                 file_get_contents($url), $out, PREG_PATTERN_ORDER);
         $keys = $out[1];
      if ($mp3search==false)
         $values = $out[2];
      else
         $values = $out[3];
         array_walk($values, 'remove_html');
         return (array_combine_emulated($keys, $values));
      }
      this would be used like such:

      if i want to grab all links from http://chris-malcolm.com,

      i would do
      get_links("http://chris-malcolm.com", false)
      and if I want all mp3s from chris-malcolm.com,

      i would do
      get_links("http://chris-malcolm.com", true)
      of course, in java i would use just a string to be searched versus using a url and using file_get_contents() for the html source.

      Message was edited by:
      cjm771
        • 1. Re: regex help: parsing href and .mp3
          jschellSomeoneStoleMyAlias
          Unless xml/html is very strictly format (machine generated) you should use a xml/html parser rather than trying to build one piecemeal.
          • 2. Re: regex help: parsing href and .mp3
            807606
            Short answer: too many forward-slashes, not enough backslashes.

            PHP takes Perl-style regex literals, which use '/' as their (default) delimiter, and reproduces them inside PHP string literals. It also retains the Perl-style modifiers, meaning the 'i', 's', 'm', etc., following the closing delimiter. In Java, you drop the regex-specific delimiters, and replace the modifiers with symbolic constants like Pattern.CASE_INSENSITIVE, which are passed as the second argument to Pattern.compile() factory method. They can also be included in the regex itself in the form of inline modifiers, like (?i).

            The other big difference is that PHP is much more lenient than Java when it comes to backslashes in string literals. In a Java string literal, any backslash that isn't part of a recognized escape sequence like \t, \\ or \", is flagged a an error. That means, in order to use a regex escape sequence like \w, you have to escape the backslash to get it through the string literal.

            So, for the most part, converting PHP regexes to Java means dropping the forward slashes (or whatever other regex delimiter you were using) and doubling all the backslashes. The exception is the double-quote character; it still has to be escaped, and you only use one backslash. Here's a first cut at translating your regexes.
            String urlRegex = "a\\s+[^>]*?class=l\\s+[^>]*?href\\s?=[\\s'\"]+(.*?)['\"]+.*?>[^<]*</a>";
            
            String mp3Regex = "(?i)a\\s+[^>]*?href\\s?=[\\s'\"]+(.*?(mp3))['\"]+.*?>[^<]*</a>";
            Acknowledgements to Jeffrey Friedl, with special thanks for adding a PHP chapter to the third edition of The Book. ^_^
            • 3. Re: regex help: parsing href and .mp3
              807606
              thanks, works perfectly! now all I need is figure out how to append rows to JTables heh!