Forum Stats

  • 3,768,556 Users
  • 2,252,810 Discussions
  • 7,874,618 Comments

Discussions

RETURN Nth word and Nth character from a string

BeefStu
BeefStu Member Posts: 283 Blue Ribbon


I have a query that can get the  Nth word, second word (separated by spaces) 'Two'. How can I get the first character of the Nth word  in my example I want the 'T' from the word 'Two' also to be returned


Select REGEXP_SUBSTR('one Two three four','\w+',1,2) wordN  from dual

Expected output:
Two T


Tagged:

Best Answer

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,208 Red Diamond
    edited Oct 28, 2021 1:59AM Accepted Answer

    Hi, @BeefStu

    Select REGEXP_SUBSTR('one Two three four','\w+',1,2) wordN  from dual
    

    You're close! You said you want the words separated by spaces, but \w' will get words separated by other chracters (like commas) as well. '\S' (with a capital S) means any non-space character. To get the x-th character, use SUBSTR.

    If table_x contains a string column called str, you can get the n-th word and the x-th character of that word like this:

    WITH  got_wordn  AS
    (
    	SELECT REGEXP_SUBSTR ( str
    			     , '\S+'
    			     , 1
    			     , &word_num
    			     ) AS wordn
    	FROM	table_x
    )
    SELECT  wordn
    ,	SUBSTR (wordn, &char_num, 1)	AS charx
    FROM	got_wordn
    ;
    

    Of course, you don't need to use substitution variables like &word_num and &char_num. You can use any kind of expression in those places.

Answers

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,208 Red Diamond
    edited Oct 28, 2021 1:59AM Accepted Answer

    Hi, @BeefStu

    Select REGEXP_SUBSTR('one Two three four','\w+',1,2) wordN  from dual
    

    You're close! You said you want the words separated by spaces, but \w' will get words separated by other chracters (like commas) as well. '\S' (with a capital S) means any non-space character. To get the x-th character, use SUBSTR.

    If table_x contains a string column called str, you can get the n-th word and the x-th character of that word like this:

    WITH  got_wordn  AS
    (
    	SELECT REGEXP_SUBSTR ( str
    			     , '\S+'
    			     , 1
    			     , &word_num
    			     ) AS wordn
    	FROM	table_x
    )
    SELECT  wordn
    ,	SUBSTR (wordn, &char_num, 1)	AS charx
    FROM	got_wordn
    ;
    

    Of course, you don't need to use substitution variables like &word_num and &char_num. You can use any kind of expression in those places.

  • BeefStu
    BeefStu Member Posts: 283 Blue Ribbon

    @Frank Kulash perfect thanks. To satisfy my curiosity could my original test CASE be expanded to include the substr too? For example, find the Nth word and from that word find the Nth character in one statement? I just couldn't get that too work so I posted the question. Thanks for your help and expertise its greatly appreciated

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,208 Red Diamond

    Hi, @BeefStu

    So, you want a single regular expression that returns a given character from a given word, is that right?

    Here's one way:

    SELECT REGEXP_SUBSTR ( str
    		     , '(\S+\s+){' || TO_CHAR (&word_num - 1)
    		    		   || '}.{'
    				   || TO_CHAR (&char_num - 1)
    				   || '}(.)'
    		     , 1
    		     , 1
    		     , NULL
    		     , 2
    		     ) AS charx
    FROM	table_x
    ;
    

    If &word_num = 1 and &char_num = 2, then the second argument to REGEXP_SUBSTR is

     '(\S+\s+){0}.{1}(.)'
    
  • mathguy
    mathguy Member Posts: 10,155 Blue Diamond

    The solution proposed by Mr. Kulash for this last request is in the correct general direction, but it has one significant mistake. After reading past the first n - 1 words (defined as non-empty substrings separated by whitespace), it starts counting characters regardless of whether they are "word" characters or whitespace. This causes incorrect results when the n'th word is shorter than m characters - the query will return the m'th character counting from the first character of the n'th word, regardless of whether that m'th character is still in the same word. It may be a whitespace character, or a word character from a different word further down (to the right) in the input string.

    This is easy to fix. Instead of "dot" use \S in all places. Or, if you really meant \w for "word characters" (meaning that words may also be delimited by dash, comma, etc. - not just by whitespace), you can rewrite the whole thing using \w and \W instead of \S and \s. Something like this (using bind variables for n and m, instead of substitution variables):

    select str,
           regexp_substr(str, '\w+', 1, :n) as wordN,
           regexp_substr(str,
             '^\W*(\w+\W+){' || to_char(:n - 1) || '}\w{' || to_char(:m - 1) || '}(\w)'
             , 1, 1, null, 2)
             as wordN_charM
    from   (select 'one Two three four' as str from dual);
    


    Note also that I anchored the regexp at the beginning of the input string (so I had to allow for 0 or more non-word characters at the beginning, too). If n is larger than the number of words in the string, the regexp matching starting from the beginning of the string will fail. If we don't anchor at the beginning of the string (with the ^ anchor), the regexp engine will try again, attempting to match from the second character, then from the third, ... - wasting a huge amount of time for no reason. The regexp engine doesn't use logic (like we do), to figure out that the additional attempts will also automatically fail - it will just blindly try them all.

    For testing, try all combinations where m may be 1, or 2, or the word count, or GREATER THAN the word count, and n may be 1, or 2, or the length of the n'th word, or GREATER THAN the length of the n'th word. Make sure you get the correct answer in all cases (including NULL when n is too great or m is too great).