This discussion is archived
10 Replies Latest reply: May 8, 2012 3:51 AM by chris227 RSS

Query to extract HTML tag with data

Nimish Garg Guru
Currently Being Moderated
Hi All,

I have a string.
'<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>'

I want to extract a html tag including its opening & closing tab with data as

if i say P1
then the output should be 
'<P>THIS IS P1.</P>'

for P2
then the output should be 
<P>THIS IS P2.</P>

please help me in writing this query with regular expression

i have tried it as following but it is not giving desired result:
WITH T AS
(
SELECT
    '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>' STR
FROM    
    DUAL
)
SELECT REGEXP_SUBSTR(STR, '<P>.+P2.+</P>') FROM T


Thanks & Regards
Nimish Garg
Edited by: Nimish Garg on May 7, 2012 5:49 PM
  • 1. Re: Query to extract HTML tag with data
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    This only really works for XHTML. There are often tags that do not need to use a closing tag - which makes parsing very complex. Never mind HTML pages that often contains errors and unclosed tags (which still renders correctly due to how the browser creates the DOM).

    Yes, do-able for a simplistic example as you have given above. But you will almost never never find that simplistic and easily parsed HTML in the wild.

    Which raises the question as to what is your requirement? Why do you want to parse HTML?

    Better to have a good idea what the road to travel entails, before setting off on a journey.
  • 2. Re: Query to extract HTML tag with data
    Nimish Garg Guru
    Currently Being Moderated
    My requirement is to extract a <tag>data</tag> from a HTML/XML string
    where data contains any specified value.

    I have written following query for finding tag which contains P2.
    SELECT * FROM (
    select regexp_substr ( txt, '[^>]+', 1, level) || '>' data from 
    (SELECT '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>' TXT FROM DUAL) 
    CONNECT BY level <= length (txt) - length (replace (txt, '>'))
    )
    WHERE regexp_LIKE ( data, 'P2')
    please suggest a better query with XML or REGEXP.
  • 3. Re: Query to extract HTML tag with data
    918949 Pro
    Currently Being Moderated
    cann't write correct sql in
     block :( , i have output select regexp_substr ( txt, '<' ||TAG_|| '>.*?</' ||TAG_|| '>', 1, level) data1                                                                                                                                                                                                                                                                                        
  • 4. Re: Query to extract HTML tag with data
    chris227 Guru
    Currently Being Moderated
    what about
    select 
    regexp_substr (
      '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>'
    , '<P>.*?</P>'
    ) s from dual
    
    S
    <P>THIS IS P1.</P>
     
    ok missed requirement, solution already given above

    regards

    Edited by: chris227 on 07.05.2012 06:30

    Edited by: chris227 on 07.05.2012 06:34
    cleared nonsens
  • 5. Re: Query to extract HTML tag with data
    sybrand_b Guru
    Currently Being Moderated
    Please look up the function EXTRACTVALUE in the SQL Language reference manual, which you seem to avoid using like hell.

    -------------
    Sybrand Bakker
    Senior Oracle DBA
  • 6. Re: Query to extract HTML tag with data
    MichaelS Guru
    Currently Being Moderated
    if i say P1
    then the output should be
    '&lt;P&gt;THIS IS P1.&lt;/P&gt;'

    for P2
    then the output should be
    &lt;P&gt;THIS IS P2.&lt;/P&gt;
    Assuming your html can be interpreted as valid xml:
    SQL> with t as (
     select '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>' html_string from dual
    )
    --
    --
    select x.* from t, xmltable('//P' passing xmltype(html_string)) x
    /
    COLUMN_VALUE          
    ----------------------
    <P>THIS IS P1.</P>    
    <P>THIS IS P2.</P>    
    
    2 rows selected.
  • 7. Re: Query to extract HTML tag with data
    chris227 Guru
    Currently Being Moderated
    for any tag
    select 
    regexp_substr (
      '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>'
    , '<[^<>]+>[^>]*?P2.*?</[^<>]+>'
    ) s from dual
    
    S
    <P>THIS IS P2.</P>
  • 8. Re: Query to extract HTML tag with data
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    Nimish Garg wrote:
    My requirement is to extract a <tag>data</tag> from a HTML/XML string
    where data contains any specified value.
    HTML is not XML.

    And that is a critical distinction to make. HTML parsing is horribly complex. XML is quite easy. For HTML you have to code your own parser in PL/SQL. XML can be parsed using the XMLTYPE class/data type in PL/SQL.

    So if you need to find a single specific tag in HTML - I would not try to treat it as XML. I may not even try to use regular expressions.

    I would do a basic substring search for the start of the tag. Read the data following the tag. Ensure that there are no nested or embedded tags in the data. Until the end tag is read. Because HTML is that much abused - and because that is an accepted norm as parsers used by browsers deals with that abuse without complaining.

    Proper HTML is mostly a myth in my experience of "screen scraping" web servers for data extraction as they do not have web services supplying the data.
  • 9. Re: Query to extract HTML tag with data
    Nimish Garg Guru
    Currently Being Moderated
    Thanks for your solution,
    Can you please explain how your regular expression is working
     
    '<[^<>]+>[^>]*?P2.*?</[^<>]+>' 
  • 10. Re: Query to extract HTML tag with data
    chris227 Guru
    Currently Being Moderated
    It searches first for the occurence of a tag, which is defined by (&lt;,any non &lt;&gt; cardinality 1-n,&gt;).
    This must be follwed by any no gt; character (not mandantory) follwed by the search term. This is to get over the greediness of the regexp, since without it, all from the start would be returned ( so in fact the first ? is not needed, i guess).
    P2 could be followed by any characters until the next end tag. This is ensured by the second ?, which defines the star as non-greedy and thereby no catching all characters until the end.
    The pattern will not work with a &gt; before the search term, but there shouldnt be any unescaped &lt;, &gt; outside tags in the html by definition anyway.

    regards

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points