8 Replies Latest reply: Jan 17, 2008 7:21 PM by 800351 RSS

    A problem with regex and special characters

    807603
      Hello,
      I am using regex in my application but i have a problem with special characters. Here is the explanation of what i am doing:

      I have a certain piece of text that i want to parse and replace every occurrence of a given word with some sort of a tag which have the word found inside it.

      so that: go Going Go to gOschool by bus and to learn and to play GO Go
      and i need to replace the word "go" (case insensitive and only at word boundaries) should be:
      *<start>go<end> Going <start>Go<end> to gOschool by bus and to learn and to play <start>GO<end> <start>Go<end>*

      Consider the following code and call the method with the parameter"go?"
      The Matcher finds a weird match at the word "G?oing" with only the letter G !!!
      It also ignores the "?" in the pattern completely.

      Any clue of what is happening i would be very grateful...

      private static String replaceMatches(String strToFind)
           {        
              String resultArticle="";
              String article = " "+"go? G?oing Go? to gOschool by bus and to learn and to play GO? Go?*"+" ";
             
              strToFind = "\\b"+ strToFind +"\\b";
              String linkPart1= "<start>";
              String linkPart2 = "<end>";
              
              Pattern p = null;
              try{
                  p=Pattern.compile(strToFind, Pattern.CASE_INSENSITIVE);
              
              Matcher m = p.matcher(article);
              String[] res = p.split(article);
             
              int i=0;
              //System.out.println("result of split: "+res.length );
              while(m.find())
              {
                  resultArticle+=(res[i]+" ");
                  resultArticle+=linkPart1;
                  resultArticle+=m.group().trim(); 
                  resultArticle+=(linkPart2+" ");
                  i++;
              }
              if(i<res.length)
                  resultArticle+=res;
      //System.out.println("result of match: " + i);
      System.out.println(article);
      //System.out.println(resultArticle.trim()+scripts);
      }
      catch(PatternSyntaxException ex){}
      return resultArticle.trim();
      }
      Thanks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
        • 1. Re: A problem with regex and special characters
          796440
          You're misunderstanding how split works. You should read its docs more carefully. It sounds like replaceAll is more like what you want.
          m.replaceAll("<start>$0<end>");
          
          gives
          
           <start>go<end>? G?oing <start>Go<end>? to gOschool by bus and to learn and to play <start>GO<end>? <start>Go<end>?* 
          Bye the way, why
          " "+"go?" + " "
          
          instead of
          " go? "
          ?
          • 2. Re: A problem with regex and special characters
            807603
            because split will not work when trying to replace the first word if i don't append a space at the beginning.

            replaceAll will replace all the occurrences in the text with only one word. without taking into consideration the case of the word i need to replace.

            If i use replacaAll(article, strToFind) the output will be:

            <start>go?<end> G?oing <start>go?<end> to gOschool by bus and to learn and to play <start>go?<end> <start>go?<end>

            while the original string is
            go? G?oing Go? to gOschool by bus and to learn and to play GO? Go?*

            which is not what i want as i need to keep the case of the words i am replacing
            • 3. Re: A problem with regex and special characters
              796440
              tarek.mamdouh wrote:
              because split will not work when trying to replace the first word if i don't append a space at the beginning.
              Split doesn't work anyway. And my question wasn't why do you add spaces (which you really don't need to do), but why do you do them with " " + "go" rather than just " go"
              replaceAll will replace all the occurrences in the text with only one word. without taking into consideration the case of the word i need to replace.
              No.

              >
              If i use replacaAll(article, strToFind) the output will be:

              <start>go?<end> G?oing <start>go?<end> to gOschool by bus and to learn and to play <start>go?<end> <start>go?<end>
              No. I showed you the actual output of an actual replaceAll.
              which is not what i want as i need to keep the case of the words i am replacing
              The replaceAll I showed you does that.

              Please study the examples given and read the docs carefully rather than making claims based on inaccurate guesses.
              • 4. Re: A problem with regex and special characters
                800351
                so that: go Going Go to gOschool by bus and to learn and to play GO Go
                and i need to replace the word "go" (case insensitive and only at word boundaries) should be:
                <start>go<end> Going <start>Go<end> to gOschool by bus and to learn and to play <start>GO<end> <start>Go<end>
                public class Tarek{
                
                  public static void main(String[] args){
                    String text 
                     = "go Going Go to gOschool by bus and to learn and to play GO Go";
                
                    System.out.println(text.replaceAll("(?i:\\bgo\\b)", "<start>$0<end>"));
                  }
                }
                • 5. Re: A problem with regex and special characters
                  807603
                  Thanks all for your replies :)

                  I am sorry i just remembered why did i append spaces...because at first i was using \s character to show that this is the end of word not \b
                  I think it is useless here.

                  But part of my problem still exists which is the special characters problem.
                  article = "go? G?oing Go? to gOschool by bus and to learn and to play GO? Go?";
                  article.replaceAll("(?i:\\bgo?\\b)","<start>$0<end>")
                  will output the following:
                  *<start>go<end>? <start>G<end>?oing <start>Go<end>? to gOschool by bus and to learn and to play <start>GO<end>? <start>Go<end>?*

                  The question mark in the pattern is ignored when replacing. Also "<start>G<end>?oing" is not a match for the regex as per my little knowledge in regex.
                  • 6. Re: A problem with regex and special characters
                    796440
                    tarek.mamdouh wrote:
                    The question mark in the pattern is ignored when replacing.
                    No, it's not.

                    Question mark means "zero or one of the preceding." So "go?" will match both "g" and "go".
                    Also "<start>G<end>?oing" is not a match for the regex as per my little knowledge in regex.
                    Yes, it is a match, for the same reason.

                    You'll do much better at this stuff if, instead of assuming the advice you're given is wrong or that regex is wrong, you assume that your assumptions are wrong, and you do a little research to find out just how they're wrong. You admit you don't know much about regex. You're already shown one place where you made an incorrect assumption and were corrected on it, and yet you go on to make another unfounded assertion based on what you think you know, rather than reading a little bit.
                    • 7. Re: A problem with regex and special characters
                      807603
                      Dear jverd,

                      I am not assuming that i am right or my assumptions are right, That's why i posted a question here because there is something missing for me.
                      Your help had corrected something wrong for me and i appreciate this, all what i am asking for is an explanation for what is happening. Yes i missed that "?" means zero or one of the preceding because of my little knowledge in regex and i admitted this before. I think you were a little bit offensive, but thanks anyway for your help.
                      • 8. Re: A problem with regex and special characters
                        796440
                        tarek.mamdouh wrote:
                        Dear jverd,

                        I am not assuming that i am right or my assumptions are right
                        Yes you are. You were given answers by people who know more than you about this subject, and you simply assumed they were wrong.
                        Yes i missed that "?" means zero or one of the preceding because of my little knowledge in regex
                        It's not a problem that you missed it. The problem is that you just assumed that the answers given were wrong and that regex was behaving incorrectly, rather than doing research to find out why what you saw here was different from what you asumed.
                        I think you were a little bit offensive
                        I find it frustrating when someone asks for help, I give them a correct answer, and they tell me the answer is wrong.
                        , but thanks anyway for your help.
                        You're welcome. Next time please question your own assumptions more. It'll go better for everyone. :-)