1 2 Previous Next 18 Replies Latest reply: Jul 15, 2007 5:44 PM by 807605 RSS

    String replacing and pattern matching

    807605
      Hi everybody,
      I'm pretty sure my earlier post got deleted in the system maintenance, so I'm going to repost, just to be sure.

      I'm writing a version of a grep program, which can take either a string of words separated by spaces or a string containing punctuation marks. What I need to to is compile a pattern that replaces the spaces with a predefined list of punctuation marks, and then can match any string of punctuation marks in that place.

      So, as an example:
      Query string: This is a test
      Should match: This is a test; This, is--a. test; This: is, a test; or any variant, basically, all the words of the query, in order, with any number of punctuation marks in it.

      I was looking at java.util.regex, and this is what I have so far:
      String query = args[1];
                     Pattern pattern = Pattern.compile( query.replaceAll( " ", "[ ,:�()�����\"]+" ) );
      Matcher matcher = pattern.matcher( element );
      if( matcher.matches() )  {
            chapterCounter++;
      }
      Something about this isn't working, though--I never get matches. Can anyone see where I'm going wrong?

      EDIT: I'm realizing that this probably isn't clear, since I haven't gotten responses yet, so I'll try to pare it down to the essentials: Instead of matching a string that contains spaces, I need to match a string of words that can have any number of punctuation marks in place of the simple spaces. The key is that the words are the same, and in the same order.

      Thank you!
      Jezzica85

      Message was edited by:
      jezzica85
        • 1. Re: String replacing and pattern matching
          796440
          For one thing, you probably want " +" or even "\\s+" instead of just " " so that 1 or more spaces (or one or more of any whitespace) can be replace by one or more punct. Also you might want "\\p{Punct}". Also, "-" in he middle of a [x-y] character class means "x through y". If you want a literal "-", it has to go at the beginning or the end. Either one may be okay. I forget.

          Start with this (it's in beanshell, which is a Java scripting shell.)
          bsh % str = "this is a test";
          <this is a test>
          bsh % tester1 = "this-is-a-test";
          <this-is-a-test>
          bsh % rep = str.replaceAll(" +", "\\\\p{Punct}\\+");
          <this\p{Punct}+is\p{Punct}+a\p{Punct}+test>
          bsh % p = Pattern.compile(rep)
          
          <this\p{Punct}+is\p{Punct}+a\p{Punct}+test>
          bsh % m = p.matcher(t1);
          <java.util.regex.Matcher[pattern=this\p{Punct}+is\p{Punct}+a\p{Punct}+test region=0,14 lastmatch=]>
          bsh % m.matches();
          <true>
          • 2. Re: String replacing and pattern matching
            807605
            Thank you jverd,
            I've never used that shell before, so I really don't know what your code means, sorry. In the pattern I put down, that's actually an em dash, not a hyphen, but do you think it could still be causing a problem? I'll try that thing you were suggesting with the spaces and see if it works. Was I instantiating the pattern right?

            EDIT: I think I'm still missing something, neither the \\s or the + signs work.

            Jezzica85

            Message was edited by:
            jezzica85
            • 3. Re: String replacing and pattern matching
              796440
              Thank you jverd,
              I've never used that shell before, so I really don't
              know what your code means,
              It's Java code. You can ignore the bsh % prompt and the <...> responses.

              put down, that's actually an em dash, not a hyphen,
              but do you think it could still be causing a problem?
              I wouldn't think so, no.

              Start simple--maybe just comma and colon or something--and make that work (using my code as a starting point). Once that works, add additional punct marks (assuming \p{Punct} doesn't cover the chars you want).
              • 4. Re: String replacing and pattern matching
                807605
                Thanks again jverd,
                I will try and get this working now; I think I have enough to go on.

                Jezzica85
                • 5. Re: String replacing and pattern matching
                  796440
                  Cool. Good luck!

                  Post again if you get stuck, but hopefully the proof of concept is close enough to what you're trying to do that you can close the gap.
                  • 6. Re: String replacing and pattern matching
                    796440
                    Thank you jverd,
                    I've never used that shell before, so I really
                    don't
                    know what your code means,
                    It's Java code. You can ignore the bsh %
                    prompt and the <...> responses.
                    By the way, you might want to play with it. www.beanshell.org. I find it very handy for quick, interactive tests of Java code.
                    • 7. Re: String replacing and pattern matching
                      807605
                      \p{Punct} only matches punctuation characters in the 7-bit ASCII range. In other words, it will not match the em-dash, en-dash, curly quotes, or other fancy punctuation characters that Microsoft apps keep trying to sneak into our documents. Even listing those characters explicitly, as you did in your original post, isn't safe, because it requires you to use a certain encoding when you save and compile the source file. You could list them by their Unicode escapes (\u2018 for Left Single Quote, \u2014 for Em Dash), but you'd probably be better off using the Unicode "punctuation" property, \p{P} (which can also be written as \pP). Combined with jverd's advice, that leaves you with:
                        Pattern pattern = Pattern.compile( query.replaceAll( "\\s+", "[\\\\s\\\\pP]+" ) );
                      • 8. Re: String replacing and pattern matching
                        807605
                        Hi everybody,
                        I'm still having trouble with this pattern matching; it doesn't match anything, whether it has spaces or punctuation marks. I think I'm doing something small and silly. Here's the code I have so far, essentially.
                        String query = args[1];               
                        Pattern pattern = Pattern.compile( query.replaceAll( "\\s+", "[\\\\s\\\\pP]+" ) );
                        Matcher matcher = pattern.matcher( element );
                        if( matcher.matches() ) {
                                System.out.println( element );
                        }
                        I'm sure I'm still doing something wrong, but I don't know what. If anybody would mind giving me some more help, I'd be very glad.

                        Thanks,
                        Jezzica85
                        • 9. Re: String replacing and pattern matching
                          796440
                          Provide a small but complete program, including input and output, as well as what output you expected to get.
                          • 10. Re: String replacing and pattern matching
                            807605
                            It seems \pP doesn't match all of the ASCII punctuation characters. In order to cover all of them as well as the fancy ones from cp1252, you need to use both forms:
                              Pattern p = Pattern.compile( query.replaceAll( "\\s+", "[\\\\s\\\\pP\\\\p{Punct}]+" ) );
                            But this is getting ridiculous; you're probably better off doing this:
                              Pattern p = Pattern.compile( query.replaceAll( "\\s+", "[_\\\\W]+" ) );
                            That will also match non-ASCII letters and digits, but I'm guessing your data won't contain any of those anyway.
                            • 11. Re: String replacing and pattern matching
                              807605
                              Thanks everybody,
                              Here's the small program with the input and output you asked for.

                              Jezzica85
                              import java.util.regex.*;
                              
                              public static void main( String args[] ) {
                                   String[] tests = { "This is a test", "This-is-a,test", "This Is A test...", "This,IS.a:test;", "This isa test", "This is a tests" };
                              
                                   String query = "This is a test";
                                   Pattern pattern = Pattern.compile( query.replaceAll( "\\s+", "[\\\\s\\\\pP]+" ) );
                                   for( int i = 0; i < tests.length; i++ ) {
                                        Matcher matcher = pattern.matcher( element );
                                        if( matcher.matches() ) {
                                             System.out.println( tests[i] );
                                        }
                                   }
                              }
                              
                              // Right now, this program doesn't print anything out. It should print out indices 0-3 in tests, but not 4 and 5.
                              Message was edited by:
                              jezzica85
                              • 12. Re: String replacing and pattern matching
                                796440
                                Thanks everybody,
                                Here's the small program with the input and output
                                you asked for.
                                I asked for that to help me figure out what was wrong, but it seems uncle_alice already figured it out. Did you try what he suggested?
                                • 13. Re: String replacing and pattern matching
                                  807605
                                  That code doesn't even compile because you've got a bad variable name in there, but once I fixed that I got matches on the first two strings. If you want it to match the third one, you'll have to make the generated regex case insensitive and allow for non-word characters at the end as well as between every pair of words. You can't do that as part of the replaceAll() though, because the trailing characters are optional. Once I did all that I got matches on the first four strings.
                                      String[] tests = { "This is a test", "This-is-a,test", "This Is A test...",
                                                         "This,IS.a:test;", "This isa test", "This is a tests" };
                                     
                                      String query = "This is a test";
                                      query = query.trim().replaceAll( "\\s+", "[\\\\s\\\\pP]+" ) + "[\\s\\pP]*";
                                      Pattern pattern = Pattern.compile( query, Pattern.CASE_INSENSITIVE );
                                  
                                      for ( String test : tests ) {
                                        Matcher matcher = pattern.matcher( test );
                                        if ( matcher.matches() ) {
                                          System.out.println( matcher.group() );
                                        }
                                      }
                                  With this particular data \pP works as expected, but if you need to match '<', '>', '=', or any of a few other ASCII punctuation characters, you'll need to add \p{Punct} to the character class or just use the regex I suggested in my previous reply.
                                  • 14. Re: String replacing and pattern matching
                                    807605
                                    Hi uncle_alice,
                                    I'm so sorry, but I cannot seem to get this to work. I can't think of anything else to do but post my raw code. There must be something else I'm doing that I don't realize. This all compiles and runs fine, so maybe you can see. Again, I'm sorry I'm so dense, and thanks for bearing with me! One of these days, maybe I'll get this regex stuff.

                                    Jezzica85
                                    // The first part of the program splits up chapter lines
                                    
                                    import java.io.BufferedReader;
                                    import java.io.FileReader;
                                    import java.util.ArrayList;
                                    import java.util.Date;
                                    import java.util.LinkedHashSet;
                                    import java.util.List;
                                    import java.util.regex.Matcher;
                                    import java.util.regex.Pattern;
                                    
                                    public class PhraseGrep {
                                         private static LinkedHashSet<String> occurrences = new LinkedHashSet<String>();
                                         
                                         public static void main( String args[] ) {
                                              try {
                                                   List<List<String>> lines = new ArrayList<List<String>>();
                                                   BufferedReader reader = new BufferedReader( new FileReader( args[0] ) );
                                                   String query = args[1];
                                                   
                                                   query = query.trim().replaceAll( "\\s+", "[\\\\s\\\\pP]+" ) + "[\\s\\pP]*";
                                                  Pattern pattern = Pattern.compile( query, Pattern.CASE_INSENSITIVE );
                                    
                                                   int chapters = 0;
                                                   List<String> chapterLines = new ArrayList<String>();
                                    
                                                   String line = "";
                                                   while( ( line = reader.readLine() ) != null ) {
                                                        if( line.length() > 8 && line.substring( 0, 8 ).equalsIgnoreCase( "CHAPTER " ) ) {
                                                             if( chapters != 0 ) {
                                                                  lines.add( chapterLines );
                                                             }
                                                             
                                                             chapterLines = new ArrayList<String>();
                                                             chapters++;
                                                        } else if( line.length() > 0 ) {
                                                             chapterLines.add( line );
                                                        }
                                                   }
                                                   lines.add( chapterLines );
                                                    
                                    // Matching starts here
                                                   int times = 0;
                                                   for( int i = 0; i < lines.size(); i++ ) {
                                                        List<String> singleChapter = lines.get( i );
                                                        int chapterCounter = 0;
                                                        
                                                        for( String element: singleChapter ) {
                                                             Matcher matcher = pattern.matcher( element );
                                                             if( matcher.matches() ) {
                                                                  chapterCounter++;
                                                                  if( chapterCounter == 1 ) {
                                                                       occurrences.add( "Chapter " + Integer.toString( i + 1 ) );
                                                                  }
                                                                  
                                                                  for( int j = 0; j < element.length() - query.length(); j++ ) {
                                                                       if( element.charAt( j ) == query.charAt( 0 ) ) {
                                                                            String testString = element.substring( j, j + query.length() );
                                                                            Matcher matcher2 = pattern.matcher( testString );
                                                                            if( matcher2.matches() ) {
                                                                                 times++;
                                                                            }
                                                                       }
                                                                  }
                                                                  
                                                                  occurrences.add( element );
                                                             }
                                                        }
                                                   }
                                                   
                                                   System.out.println( "File: " + args[0] );
                                                   System.out.println( "Date: " + ( new Date() ) );
                                                   System.out.println( "Chapters: " + lines.size() );
                                                   System.out.println( "Query string: \"" + args[1] + "\"" );
                                                   
                                                   System.out.println( "Occurrences of query string (may be more than one in each line): " + times );
                                                   System.out.println();
                                                   
                                                   int counter = 1;
                                                   for( String element: occurrences ) {
                                                        if( ( element.startsWith( "Chapter" ) ) && ( counter > 1 ) ) {
                                                             System.out.println();
                                                        }
                                                        System.out.println( element );
                                                        counter++;
                                                   }
                                                   
                                              } catch( Exception e ) {
                                                   e.printStackTrace();
                                              }
                                         }
                                    }
                                    1 2 Previous Next