Skip to Main Content

Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

java.net.MalformedURLException: no protocol

843834Feb 19 2008 — edited Feb 21 2008
Hi,

I'm trying to parse an html website to a w3c document, but I'm getting the following exception:
java.net.MalformedURLException: no protocol

The code is:
import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;

.
.
.

        URL u; 
        InputStream is = null; 
        DataInputStream dis; 
        String s;
        StringBuffer xmlFeed = new StringBuffer();

        try {
            u = new URL("http://www.google.com"); 
            is = u.openStream(); 
            dis = new DataInputStream(new BufferedInputStream(is)); 
            while ((s = dis.readLine()) != null) {
                 xmlFeed.append(s);
         } catch (Exception ex) {
               System.out.println("no good");
         }

// So far so good.....

        try {
	DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
	docBuilderFactory.setIgnoringComments(true);
	docBuilderFactory.setValidating(false);
	DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
	Document doc = docBuilder.parse(xmlFeed.toString());  // The exception is caught here.....
          } catch (Exception ex) {
   	  System.out.println(ex.getMessage());
          }
Can anyone offer some assistance?

Comments

843834
OK, I figured out that I needed to parse the DataInputStream.... so:
try {
            u = new URL("http://www.google.com"); 
            is = u.openStream(); 
            dis = new DataInputStream(new BufferedInputStream(is)); 
            DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
            docBuilderFactory.setIgnoringComments(true);
            docBuilderFactory.setValidating(false);
            DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
            Document doc = docBuilder.parse(dis);  // The exception is caught here.....

        } catch (Exception ex) {
               System.out.println(ex.getMessage());
         }
But now I get this exception:
class org.xml.sax.SAXParseException
The entity name must immediately follow the '&' in the entity reference.

Any idea?
843834
Or in other words...

What is the best way to parse a URL and ignore all SAXParseException.
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse("http://www.gogle.com");
Thanks.
DrClap
n_k wrote:
What is the best way to parse a URL and ignore all SAXParseException.
The best way is to not use an XML parser. XML parsers are required to throw an exception and stop parsing as soon as they encounter malformed XML.

And since (according to your example) you're trying to parse HTML, you would be better off to use an HTML parser that produces a DOM.
843834
Thanks.

I used the html parser (http://htmlparser.sourceforge.net/) and it works fine.
1 - 4
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Mar 20 2008
Added on Feb 19 2008
4 comments
644 views