8 Replies Latest reply: Oct 18, 2006 3:15 AM by 807598 RSS

    strip content from custom tags from html files

    807598
      hello,

      i need to strip content from html files in order to create xml files with that data ...

      for example inside an html file there will be a tag called <betreft>text</betreft> i need to strip this out of the html file ... do you guys have any suggestions ???


      this is what i thought about: it strips the <betreft> tag en shows it on the screen:
                BufferedReader in=null;
                StringBuffer sb =new StringBuffer();
                StreamTokenizer st=null;
                try {
           in = new BufferedReader(new FileReader("05113310.htm"));
           String str;
           while ((str = in.readLine()) != null) {
           process(str,sb);
           }
           in.close();
           } catch (Exception e) {         }
           
           System.out.println(sb.toString());
           StringBuffer b = new StringBuffer();
           System.out.println( sb.indexOf("<betreft>"));     
           System.out.println(sb.indexOf("</betreft>"));     
           char betreftTag[] =new char [500];
           sb.getChars(587, 992+10, betreftTag, 0);     
           for (int i=0;i<500;i++)
           {
                System.out.print(betreftTag);
           }


      but i don't like it that I need to use an array of chars to store the result of getChars ... (what if the tagcontent is longer than 500 for example... and taking 10.000 to be safe is not a good practice i think ...)

      any suggestions tot this first draw ???

      thanks
        • 1. Re: strip content from custom tags from html files
          791266
          Hi,

          You can use regular expressions if you know that the betreft tags wont' be nested, and if it's ok to remove them even if they are inside of comments.

          Kaj
          • 2. Re: strip content from custom tags from html files
            807598
            but i don't like it that I need to use an array of
            chars to store the result of getChars ... (what if
            the tagcontent is longer than 500 for example... and
            taking 10.000 to be safe is not a good practice i
            think ...)

            any suggestions tot this first draw ???

            thanks
            Instead of getChars() use a String's substring() method. (Note that a StringBuffer also has a substring() method that returns a String):
            public class AATest1
            {
                 private static void process(StringBuffer sb, String str)
                 {
                      //process str here
                      String result = str.substring(3,7);
                      
                      sb.append(result);
                 }
            
                 
                 public static void main(String[] args)
                 {
                      StringBuffer betreftTag = new StringBuffer();
                      String str = "hello world";
            
                      process(betreftTag, str);
            
                      System.out.println(betreftTag);
                      
                 }
            }
            for example inside an html file there will be
            a tag called <betreft>text</betreft> i need to
            strip this out of the html file ... do you guys
            have any suggestions ???
            You can use regular expressions. If you don't know how to use them, they can be a bit daunting, but you have to jump in and get your feet wet at some point. There are lots of issues when reading html, though. For instance, if you look for an opening tag <betreft> and a closing tag </betreft>, what happens here:

            <betreft>text<betreft>inner text</betreft>text</betreft>

            Or, if you have two tags adjacent to each other:

            <betreft>text</betreft><betreft>other text</betreft>

            you could get everything inside the first tag and the last tag.
            • 3. Re: strip content from custom tags from html files
              807598
              hello,thanks for the anwser...


              now i have strings that still contain html code ....the html characters for example => &#xxxx; & regular html tags <a blabla></xa> ....

              is there a way to clean these up???

              also, a break has to become a new line character, etc etc ...
              • 4. Re: strip content from custom tags from html files
                807598
                "Parsing" html files is very difficult. Using regular expressions is one way to do it, but it's very challenging.

                The best way is probably to read in the html file as a DOM document. That way you can step through the heirarchy of tags and identify tags by their position on the page, and then grab their content.
                • 5. Re: strip content from custom tags from html files
                  807598
                  Using java.util.Sanner can help you locate the pattern in the file.
                  You can test the following code:
                  import java.util.Scanner;

                  public class ScannerTest {

                       
                                 public static void main(String[] args) {   
                                 System.out.print("input: ");
                                 //System.out.flush();
                                 try {      Scanner s = new Scanner(System.in);     
                                 String token;
                                 do {        token = s.findInLine("<\\/?\\w*>");   //fine <> and <>with zero or one "/" and zero or more word characters in it.    
                                 System.out.println("found " + token); } while (token != null); }
                                 catch (Exception e) { System.out.println("scan exc"); }
                                 }
                            }
                  • 6. Re: strip content from custom tags from html files
                    807598
                    "Parsing" html files is very difficult. Using
                    yes, there are a lot of ways and pitfalls ... every parsing is different
                    regular expressions is one way to do it, but it's
                    very challenging.

                    The best way is probably to read in the html file as
                    a DOM document. That way you can step through the
                    heirarchy of tags and identify tags by their position
                    on the page, and then grab their content.
                    it is very crappy html ... and when parsed in jtidy, i gave a lot of errors, so i doubt i would be able to use this method for the 47.000 html-documents (luckily loosely based on only two templates) i have to process ....

                    i'm just sure of one thing: alle the data i need is in custom tags, thanks god for that one !!! so i process the html files, disgarding the html-code and focussing on those specific tags i need ...


                    all i need to do now is stripping the desired html-tags out of strings(buffers) ...
                    before that i need to convert breaks to newlines etc, i also found a way to convert html characters (with htmlparser.decode())


                    i will probably use regex for that

                    Message was edited by:
                    drgonzo120
                    • 7. Re: strip content from custom tags from html files
                      807598
                      regex...


                      So, let's say, after having replaced all breaks into new lines, i want to get rid of all tags in it, tags have a definition of :

                      they begin with "<" immediately after this you have a letter, upper or lower case, and the pattern always closes with a ">" or a "/>"

                      in between, you can have what you want ...

                      i don't want to use this on a whole document , just on strings...


                      thanks for the replies

                      Message was edited by:
                      drgonzo120
                      • 8. Re: strip content from custom tags from html files
                        807598
                        hello,

                        i'm looking for a free UML-tool... i need to create class model & sequence diagram ...