1 Reply Latest reply: Mar 21, 2008 12:27 PM by 807591 RSS

    Java and XPath Question

    807591
      I am writing a program with java and xpath, but it doesn't work like I want it to.
      public void run() {
                String url = "http://en.wikipedia.org/wiki/List_of_American_actresses";
                String regex = "http://en.wikipedia.org/wiki/.*";
                Scraper scraper = new WikiActorScraper(regex,xpath);
                if(scraper.checkURL(url)){
                     try {
                          String content = scraper.processSite(url);
                          DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
                             domFactory.setNamespaceAware(true); // never forget this!
                             DocumentBuilder builder;
                     
                          builder = domFactory.newDocumentBuilder();
                          StreamSource stream = new StreamSource(new ByteArrayInputStream(content.getBytes("utf-8")));
                          InputStream input =stream.getInputStream();
                          Document doc = builder.parse(input);
                          
                          Node root = doc.getDocumentElement();
                          NodeList children= root.getChildNodes();
      
                          XPathFactory factory = XPathFactory.newInstance();
                         XPath xpath1 = factory.newXPath();
                      //   XPathExpression expr = xpath1.compile("/*");
                               //   XPathExpression expr = xpath1.compile("/html");
                         Object result = expr.evaluate(doc, XPathConstants.NODESET);
                        
                         NodeList nodes = (NodeList) result;
                         System.out.println(nodes.getLength());
                         for (int i = 0; i < nodes.getLength(); i++) {
                             System.out.println(nodes.item(i).getNodeName()); 
                         }
      
                     } catch (SAXException e) {
                          // TODO Auto-generated catch block
                          e.printStackTrace();
                     } catch (IOException e) {
                          // TODO Auto-generated catch block
                          e.printStackTrace();
                     } catch (ParserConfigurationException e) {
                          // TODO Auto-generated catch block
                          e.printStackTrace();
                     } catch (XPathExpressionException e) {
                          // TODO Auto-generated catch block
                          e.printStackTrace();
                     }
      Now when I evaluate the first xpath expression "/*" I get the root node, which is "html", but the second expression "/html" returns nothing instead of returning the html node. can somebody help me?

      cheers,
      liam21c
        • 1. Re: Java and XPath Question
          807591
          >
          Now when I evaluate the first xpath expression "/*" I get the root node, which is "html", but the second expression "/html" returns nothing instead of returning the html node. can somebody help me?
          >

          Actually, you answered the question in your code:
          domFactory.setNamespaceAware(true); // never forget this!
          And yet you did forget it, because you didn't create a namespace resolver:
          XPathFactory factory = XPathFactory.newInstance();
          XPath xpath1 = factory.newXPath();
          //   XPathExpression expr = xpath1.compile("/*");
          //   XPathExpression expr = xpath1.compile("/html");
          Object result = expr.evaluate(doc, XPathConstants.NODESET);
          But if you look at your example URL, the <html> element is in fact namespaced:
          <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
          Probably your best solution is to parse HTML documents with a non-namespace-aware parser, because most sites aren't anywhere near as anal as Wikipedia. Actually, you'll probably run into problems parsing arbitrary HTML with an XML parser, because most sites are downright sloppy in their standards adherence.