This discussion is archived
6 Replies Latest reply: Jan 1, 2007 8:34 AM by camickr RSS

Read in HTML file, now need ot get rid of the HTML part

807607 Newbie
Currently Being Moderated
I'm able to connect to a website and read in the HTML fine, I now need to be able to get rip of the HTML coding, which will leave me with only the text.

So if I was to read in this website:

http://news.bbc.co.uk/1/hi/world/africa/6220797.stm

Then all I want is a string that holds the story text.

How is the best way of approaching this? Are there any in built classes that will help me get rid of the HTML coding.

Thanks
  • 1. Re: Read in HTML file, now need ot get rid of the HTML part
    camickr Expert
    Currently Being Moderated
    I now need to be able to get rip of the HTML coding, which will leave me with only the text.
    http://forum.java.sun.com/thread.jspa?forumID=57&threadID=637059
  • 2. Re: Read in HTML file, now need ot get rid of the HTML part
    807607 Newbie
    Currently Being Moderated
    You see this line...
    Reader rd = getReader(args[0]);[/code[
    
    What exactly is in args[0]?
    
    Is it a string that represents the whole HTML file?                                                                                                                                                                                                                                                                                                        
  • 3. Re: Read in HTML file, now need ot get rid of the HTML part
    796365 Newbie
    Currently Being Moderated
    See this tutorial for the answer:
    http://java.sun.com/docs/books/tutorial/essential/environment/cmdLineArgs.html
  • 4. Re: Read in HTML file, now need ot get rid of the HTML part
    807607 Newbie
    Currently Being Moderated
    You see this line...
    Reader rd = getReader(args[0]);
    What exactly is in args[0]?
    Is it a string that represents the whole HTML file?
    No, it is the URL (actually URI) that you want to read the HTML from - see the getReader() method in camickr's code.
  • 5. Re: Read in HTML file, now need ot get rid of the HTML part
    807607 Newbie
    Currently Being Moderated
    I can't get the code to compile, it says it cannot find symbol method getReader(java.lang.stirng), has it been taken out of the 1.5 API?

    How can I adapt the code so it works with 1.5?
  • 6. Re: Read in HTML file, now need ot get rid of the HTML part
    camickr Expert
    Currently Being Moderated
    Keep your postings in the original thread. There is no need to post in the other thread and clutter it as well.
    it says it cannot find symbol method getReader(java.lang.stirng),
    Did you copy the entire code? The getReader() method was included in the example
    I now need to be able to get rip of the HTML coding, which will leave me with only the text.
    Your question was how to get all the text.
    is there a way of having more control over what it takes out, because I only want to main text article?
    Now your question is how to get the "main text article". Well the question is how do you define the 'main text article.

    Is this some procedure you want to automate? If so how can you identify the start and end of the text in the HTML. Here is a similiar example that shows you how to iterate through the Elements in the Document. Basically, it displays all the values found in a table in a CSV format. Your problem is in how to determine when you've found the "main text article" element. Note the code includes the getReader() method as well.
    import java.io.*;
    import java.net.*;
    import java.util.*;
    import javax.swing.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    
    class GetCSV
    {
         public static void main(String[] args)
              throws Exception
         {
            // Create a reader on the HTML content
    
            Reader reader = getReader( args[0] );
    
            // Parse the HTML
    
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            kit.read(reader, doc, 0);
    
              // Iterate through the elements of the HTML document.
    
              int columnCount = 0;
              ElementIterator it = new ElementIterator(doc);
              Element element = null;
    
              while ( (element = it.next()) != null )
              {
                   String elementName = element.getName();
    
                   if ("table".equals(elementName))
                   {
                        System.out.println("\nNew Table");
                        columnCount = 0;
                   }
                   else if ("tr".equals(elementName))
                   {
                        if (columnCount > 0) System.out.println();
                        columnCount = 0;
                   }
                   else if ("td".equals(elementName))
                   {
                        if (columnCount > 0) System.out.print(",");
    
                        int start = element.getStartOffset();
                        int end = element.getEndOffset();
                        System.out.print( doc.getText(start, end - start - 1) );
                        columnCount++;
                   }
              }
         }
    
         // Returns a reader on the HTML data. If 'uri' begins
         // with "http:", it's treated as a URL; otherwise,
         // it's assumed to be a local filename.
         static Reader getReader(String uri)
              throws IOException
         {
              // Retrieve from Internet.
              if (uri.startsWith("http:"))
              {
                   URLConnection conn = new URL(uri).openConnection();
                   return new InputStreamReader(conn.getInputStream());
              }
              // Retrieve from file.
              else
              {
                   return new FileReader(uri);
              }
         }
    }
    Is this a more manual procedure where you enter the starting and ending text strings. If so then you can just search the text you found in the first example and then substring out the relevant text.