Environment: 11gR2 SE1 running on RHEL and Windows Server 2008 R2 Standard
I have to write a little scraper. The plan is to store the scraped page as a CLOB and then extract the plain text (including, in some cases, getting rid of everything between SCRIPT and STYLE tags or everything between HEAD tags) for further processing.
There was a similar requirement on another project I had to work on a good while back (I think it was a 10gR2 database) and we used Oracle Text machinations to convert the HTML to plain text: it worked very well.
I think the requirements on this project will need a custom parser anyway but, with respect to this thread: Unable to generate plaintext version of document can a SME please confirm that there is no longer an Oracle Text way to strip the tags out of an HTML document and return it as plaint text?