This discussion is archived
8 Replies Latest reply: Mar 24, 2010 9:08 AM by 796283 RSS

RegEx To Parse Links From Text

800581 Newbie
Currently Being Moderated
I'm attempting to identify URLs in Strings to wrap them in HTML code so they're clickable when displayed in a JTextPane. [This post|http://forums.sun.com/thread.jspa?messageID=2818053#2818053] has been helpful, but it doesn't quite suit my requirements. The linked example only detects links that contain the "http://" text. I would also like to handle ones that do not have this. Both of the following examples should be detected:
www.java.com
http://www.java.com
I've tried using a secondary replacement after the initial replacement just to deal with links that begin with "www." (that aren't preceeded by a forward slash) but at the moment it's not producing the output I want. SSCCE:
package tester2;

import java.awt.BorderLayout;
import java.awt.Color;
import java.awt.Desktop;
import java.io.IOException;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;

import javax.swing.JFrame;
import javax.swing.JOptionPane;
import javax.swing.JScrollPane;
import javax.swing.JTextPane;
import javax.swing.SwingUtilities;
import javax.swing.event.HyperlinkEvent;
import javax.swing.event.HyperlinkListener;
import javax.swing.event.HyperlinkEvent.EventType;

public class TextPaneDocumentTesting extends JFrame {
     /**
      * 
      */
     private static final long serialVersionUID = 1144986476433185349L;
     
     private JTextPane textPane;
     
     private final String NEW_LINE_STRING = "<br>";
     private final Pattern HTTP_PATTERN = Pattern.compile("(http://[^ ]+)"), WWW_PATTERN = Pattern.compile("([^/]www.[^ ]+)");
     
     public static void main(String[] args) {
          new TextPaneDocumentTesting();
     }
     
     private TextPaneDocumentTesting() {
          SwingUtilities.invokeLater(new Runnable() {
               public void run() {
                    setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
                    setLayout(new BorderLayout());

                    textPane = new JTextPane();
                    textPane.setContentType("text/html");
                    textPane.setEditable(false);

                    textPane.addHyperlinkListener(new HyperlinkListener() {     
                         /*
                          * (non-Javadoc)
                          * @see javax.swing.event.HyperlinkListener#hyperlinkUpdate(javax.swing.event.HyperlinkEvent)
                          */
                         public void hyperlinkUpdate(HyperlinkEvent e) {
                              handleHyperlinkUpdate(e);
                         }
                    });
                    
                    List<OneMessage> messages = getMessages();

                    StringBuilder toInsert = new StringBuilder();

                    for(int i=0; i<messages.size(); i++) {
                         toInsert.append(messages.get(i).getDisplayMessage());
                         
                         if(i < messages.size()-1) {
                              toInsert.append(NEW_LINE_STRING);
                         }
                    }

                    textPane.setText(toInsert.toString());
                    textPane.getCaret().setDot(0);
                    
                    JScrollPane textScrollPane = new JScrollPane();
                    textScrollPane.getViewport().add(textPane);
                    
                    add(textScrollPane, BorderLayout.CENTER);
                    setSize(400, 400);
                    setLocationByPlatform(true);
                    setVisible(true);
               }
          });
     }
     
     private List<OneMessage> getMessages() {
          List<OneMessage> messageList = new ArrayList<OneMessage>();
          
          messageList.add(new OneMessage("Foo", Color.black));
          messageList.add(new OneMessage("http://www.google.com", Color.red));
          messageList.add(new OneMessage("Bar", Color.blue));
          messageList.add(new OneMessage("www.java.com", Color.black));
          messageList.add(new OneMessage("dummy url: www.", Color.red));
          messageList.add(new OneMessage("http://www.java.com http://www.google.com", Color.gray));
          messageList.add(new OneMessage("The quick brown fox jumps over the lazy dog", Color.black));
          messageList.add(new OneMessage("Here's a link: http://www.google.com blah", Color.black));
          messageList.add(new OneMessage("and another: www.google.com blah", Color.black));

          return messageList;
     }
     
     private void handleHyperlinkUpdate(HyperlinkEvent e) {
          if(e.getEventType() == EventType.ACTIVATED) {
               URL clickedURL = e.getURL();
               
               if(clickedURL != null) {
                    try {
                         if(!Desktop.isDesktopSupported()) {
                              throw new UnsupportedOperationException("desktop isn't " +
                                        "supported - cannot open URL");
                         }

                         Desktop desktop = Desktop.getDesktop();

                         if(!Desktop.getDesktop().isSupported(Desktop.Action.BROWSE)) {
                              throw new UnsupportedOperationException("browse isn't " +
                                        "supported - cannot open URL");
                         }

                         try {
                              desktop.browse(clickedURL.toURI());
                         } catch(IOException ioe) {
                              ioe.printStackTrace();
                              JOptionPane.showMessageDialog(this, 
                                        ioe.getClass().getSimpleName()+": "+ioe.getMessage(), 
                                        ioe.getClass().getSimpleName(), 
                                        JOptionPane.ERROR_MESSAGE);
                         }
                    } catch(URISyntaxException urise) {
                         urise.printStackTrace();
                         JOptionPane.showMessageDialog(this, 
                                   urise.getClass().getSimpleName()+": "+urise.getMessage(), 
                                   urise.getClass().getSimpleName(), 
                                   JOptionPane.ERROR_MESSAGE);
                    }
               } else {
                    System.out.println("clickedURL is null");
               }
          }
     }
     
     private String getHyperlinkString(String input) {
          String afterHTTPReplacement = HTTP_PATTERN.matcher(input).replaceAll("<a href=\"$1\" target=_blank>$1</a>");
          String afterWWWReplacement = WWW_PATTERN.matcher(afterHTTPReplacement).replaceAll("<a href=\"http://$1\" target=_blank>http://$1</a>");
          
          return afterWWWReplacement;
     }
     
     private class OneMessage {
          private String messageText;
          private Color textColour;

          public OneMessage(String messageText, Color textColour) {               
               this.messageText = messageText;
               this.textColour = textColour;
          }
          
          public String getDisplayMessage() {
               return "<font color=\"rgb("+textColour.getRed()+", "+textColour.getGreen()+", "+textColour.getBlue()+")\">"+getHyperlinkString(messageText)+"</font>";
          }
     }
}
The problems with the current output:

The 4th message ("www.java.com") is ignored completely.
In the last message ("and another: www.google.com blah") the space between the colon and the URL is removed and there is an extra space inserted between the "http://" and "www.google.com".

In order to try and solve the first issue I tried changing WWW_PATTERN to the following: "([^/]*www.[^ ]+)" to account for zero or more non forward slash characters before the www but this gave unexpected results.
The second problem is linked to the first because currently I'm matching links that have a space before the www. When such a group is found it contains the leading space. I don't see a way to trim this leading space from the match.

I'd appreciate any pointers here, please.
  • 1. Re: RegEx To Parse Links From Text
    791266 Explorer
    Currently Being Moderated
    Post a short example that only contains the regexp parts, and an example message. There's no reason to post a whole Swing app.
  • 2. Re: RegEx To Parse Links From Text
    YoungWinston Expert
    Currently Being Moderated
    amp88 wrote:
    I'm attempting to identify URLs in Strings to wrap them in HTML code so they're clickable when displayed in a JTextPane. [This post|http://forums.sun.com/thread.jspa?messageID=2818053#2818053] has been helpful, but it doesn't quite suit my requirements. The linked example only detects links that contain the "http://" text. I would also like to handle ones that do not have this. Both of the following examples should be detected:
    www.java.com
    http://www.java.com
    Well, strictly speaking, the first is a domain name, not a URL, so I think, as kajbj said, you need to define your terms of reference better.
    I Googled "regex for URLs" and got a pile of hits, including some incredibly arcane stuff.

    One approach might be to split up the components of a URL (protocol, domain name, IP address, subdirectories, target name and filters (I've probably missed some out)), solve them separately and then create a regex that puts them all together the way you want. You'll still have some issues you need to solve - for example, I have no idea if URLs have a standard for IPv6 address encoding - but it may provide a 95% solution.

    Winston
  • 3. Re: RegEx To Parse Links From Text
    796283 Newbie
    Currently Being Moderated
    Here's the regex I tried that worked, however your formatting didn't work (looking at Java's tutorial, they use a Matcher instead of pattern.match):
    //matches http:// urls, www. urls, and should match https:// urls
    private final Pattern HTTP_PATTERN = Pattern.compile("(http[s]?://)?(www\\.[\\S]+)");
    Where you were using your regex to build your html string, I added:
              Matcher matcher = HTTP_PATTERN.matcher(input);
              boolean found = false;
              System.out.println("===============================");
              while(matcher.find()) {
                   System.out.println("Found match: "+matcher.group());
                   found = true;
              }
              if(!found) {
                   System.out.println("Did not find a match for: "+input);
              }
    And this is what I got:
    ===============================
    Did not find a match for: Foo
    ===============================
    Found match: http://www.google.com
    ===============================
    Did not find a match for: Bar
    ===============================
    Found match: www.java.com
    ===============================
    Did not find a match for: dummy url: www.
    ===============================
    Found match: http://www.java.com
    Found match: http://www.google.com
    ===============================
    Did not find a match for: The quick brown fox jumps over the lazy dog
    ===============================
    Found match: http://www.google.com
    ===============================
    Found match: www.google.com
    Edit:
    matcher.start() and matcher.end() should help you with your string replacement.

    Edited by: bogdana on Mar 23, 2010 9:00 AM
  • 4. Re: RegEx To Parse Links From Text
    800581 Newbie
    Currently Being Moderated
    Thank you, much appreciated.
  • 5. Re: RegEx To Parse Links From Text
    800581 Newbie
    Currently Being Moderated
    It turns out I jumped the gun a bit. Your solution works for all of the examples I gave above (plus HTTPS as you suggested). However, it doesn't work when the URL doesn't contain www. For example: http://google.com. Is it possible to identify the http part or the www part (so that all of the following examples would be detected):
    http://www.google.com
    www.java.com
    http://www.java.com http://www.google.com
    www.google.com blah
    https link: https://www.paypal.com
    http://google.com
  • 6. Re: RegEx To Parse Links From Text
    791266 Explorer
    Currently Being Moderated
    Try this pattern.
    "(http[s]?://)?([\\w]+\\.[\\S]+)"
    Note that it might give false positives, just as the original. E.g. www..some...thing is a match.


    Kaj
  • 7. Re: RegEx To Parse Links From Text
    800581 Newbie
    Currently Being Moderated
    kajbj wrote:
    Try this pattern.
    Thanks, works :)
    Note that it might give false positives, just as the original. E.g. www..some...thing is a match.


    Kaj
    I'm willing to live with a few false positives as long as it captures legal ones.
  • 8. Re: RegEx To Parse Links From Text
    796283 Newbie
    Currently Being Moderated
    "(http[s]?://)?(([\\w]+\\.)+[\\w]+)"
    I tried this regex, but the results weren't that much better.

    http://www.google.com - found
    www.google.com - found
    http://google.com - found
    https://www.google.com/search?q=stuff - found
    http://www....something...com - not found
    java.util.ArrayList - found
    java..util.arraylist - found: util.arraylist
    http://www..something.com - found: something.com
    http://www.something..com - found: http://www.something
    http://www.something.com. - found: http://www.something.com (note lack of period at end)

    You could try this, but it also gives false positives (in a different form), only slightly more restrictive.