This discussion is archived
0 Replies Latest reply: Aug 1, 2012 12:28 PM by 588697 RSS

Removing Tags from HTML

588697 Newbie
Currently Being Moderated
Environment: 11gR2 SE1 running on RHEL and Windows Server 2008 R2 Standard

I have to write a little scraper. The plan is to store the scraped page as a CLOB and then extract the plain text (including, in some cases, getting rid of everything between SCRIPT and STYLE tags or everything between HEAD tags) for further processing.

There was a similar requirement on another project I had to work on a good while back (I think it was a 10gR2 database) and we used Oracle Text machinations to convert the HTML to plain text: it worked very well.

I think the requirements on this project will need a custom parser anyway but, with respect to this thread: Unable to generate plaintext version of document can a SME please confirm that there is no longer an Oracle Text way to strip the tags out of an HTML document and return it as plaint text?

Thank You!
-Tom

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points