9 Replies Latest reply: Jul 2, 2010 12:48 AM by 807580 RSS

    why can't i match this pattern?

    807580
      I am removing some text from an xml file. This is one technique I use:
      data = data.replaceAll(Pattern.compile("<[a-z/].*?>").pattern(), "");
      This has worked, except there is one pattern it does not match. here it is:

      As a test, I matched letter by letter until the regular expression stopped matching.

      def = def.replaceAll(Pattern.compile("<a href............").pattern(), ""); // match
      def = def.replaceAll(Pattern.compile("<a href.............").pattern(), ""); // no match [i added just one "."]

      This means I correctly match the string as far as: <a href="?se=on&sm=

      So, I can match the string as far as 19 total characters ( 12 wildcards and 7 characters), one more ." and nothing matches anymore. Any advice welcomed.

      footnote
      The files I am working with came from a web server that uses the EUC-JP charset. My default character encoding is UTF-8, and the files are saved in the UTF-8 format. I query the server using the EUC-JP format. As can be seen in the string, if you look close, that string contains the EUC-JP encoding (in hex) for 2 japanese characters:
      B5-F5
      B5-B6
      I don't do any "decoding" of the bytes received from the server. Somehow, I just save it to a file, and that file format is UTF-8. And then I perform pattern matching on that file.Now, because I can match everything else that I've tried, I don't think its a systemic char encoding issue, because then no matching should work. But maybe there is a tiny subset of characters that get fowled by EUC-JP / UTF-8 conversion. But I am new to charsets, and I really don't know. thanks.

      Edited by: rerf on Jul 1, 2010 8:24 PM
        • 1. Re: why can't i match this pattern?
          699554
          rerf wrote:
          def = "<a href=\"?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900\">";
          def = def.replaceAll(Pattern.compile("<a href............").pattern(), ""); // match
          def = def.replaceAll(Pattern.compile("<a href.............").pattern(), ""); // no match [i added just one "."]
          Do you know what method replaceAll does? Do you know what happens when you assign a variable a value? Clearly def only contains a single match which is replaced in line 2 and the resulting String is assigned to the variable def. Therefore def now contains no matches, so of course line 3 will not find a match, because there are no more matches left to find!

          By the way, I've never seen the integration of Pattern.compile().pattern() used inside the replaceALL method of String. You can simply just add the regular expression as a String. Patterns are useful to avoid the overhead of recompiling the same pattern for multiple matches, which makes your use of it a little strange to me.

          Mel
          • 2. Re: why can't i match this pattern?
            807580
            I don't understand why you are using
            data = data.replaceAll(Pattern.compile("<[a-z/].*?>").pattern(), "");
            rather than
            data = data.replaceAll("<[a-z/].*?>", "");
            though both forms work for me.

            I can only imagine that the data is not as you presented it. Can you post it at the original bytes hex encoded?
            • 3. Re: why can't i match this pattern?
              807580
              Melanie_Green wrote:
              rerf wrote:
              def = "<a href=\"?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900\">";
              def = def.replaceAll(Pattern.compile("<a href............").pattern(), ""); // match
              def = def.replaceAll(Pattern.compile("<a href.............").pattern(), ""); // no match [i added just one "."]
              Do you know what method replaceAll does?
              I think so.
              Do you know what happens when you assign a variable a value?
              A lot of things happen.
              Clearly def only contains a single match which is replaced in line 2 and the resulting String is assigned to the variable def.
              I guess to highlight the similarity of the expressions (the difference being one period), I put them side-by-side. Sorry I did not make this simple enough for you. Given the context of the surrounding comments, it is pretty clear. In the test code, one is always commented out resulting in dramatically different results. What do you think?
              Therefore def now contains no matches, so of course line 3 will not find a match, because there are no more matches left to find!

              By the way, I've never seen the integration of Pattern.compile().pattern() used inside the replaceALL method of String.
              Why not?
              You can simply just add the regular expression as a String. Patterns are useful to avoid the overhead of recompiling the same pattern for multiple matches, which makes your use of it a little strange to me.
              Why? It is a good way to write self-documenting code. Unless there is a need for blazing speed, I'd write code that programmers could easily understand. Let the hot-spot compilers worry about optimation. Please focus more on clearly written code then even non-Java guys can easily understand.
              Mel
              Thanks.
              • 4. Re: why can't i match this pattern?
                807580
                sabre150 wrote:
                I don't understand why you are using
                data = data.replaceAll(Pattern.compile("<[a-z/].*?>").pattern(), "");
                rather than
                data = data.replaceAll("<[a-z/].*?>", "");
                though both forms work for me.
                Yeah. I am just starting at this. And I was having trouble. So, I decided not to use any shortcuts. I wanted to remove the possibility of the error being my misuse of a shortcut.
                I can only imagine that the data is not as you presented it. Can you post it at the original bytes hex encoded?
                yes! thanks for an interest.

                But I am going on vacation tomorrow at 8-am. I gotta sleep. And I won't have internet until next Tuesday. I will work on this issue on my laptop over the holiday and get back to you tuesday (if I haven't already solved this). Again, thanks.
                • 5. Re: why can't i match this pattern?
                  699554
                  rerf wrote:
                  Melanie_Green wrote:
                  >
                  Clearly def only contains a single match which is replaced in line 2 and the resulting String is assigned to the variable def.
                  I guess to highlight the similarity of the expressions (the difference being one period), I put them side-by-side. Sorry I did not make this simple enough for you.
                  Simplicity is out of the question, clarity is. We can't all read minds.
                  resulting in dramatically different results.
                  I beg to differ.
                  What do you think?
                  public class Foo {
                       public static void main(String[] args) {
                            final String def = "<a href=\"?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900\">";
                            
                            Pattern p1 = Pattern.compile("<a href............");
                            Pattern p2 = Pattern.compile("<a href.............");
                            
                            Matcher m1 = p1.matcher(def);
                            Matcher m2 = p2.matcher(def);
                            
                            System.out.println("Def equals: ");
                            System.out.println(def);
                            System.out.println();
                            
                            System.out.println("P1 matches: ");
                            while(m1.find()) {
                                 System.out.println(m1.group());
                            }
                            System.out.println();
                            
                            System.out.println("P2 matches: ");
                            while(m2.find()) {
                                 System.out.println(m2.group());
                            }
                            System.out.println();
                       }
                  }
                  Output
                  Def equals: 
                  <a href="?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900">
                  
                  P1 matches: 
                  <a href="?se=on&sm=
                  
                  P2 matches: 
                  <a href="?se=on&sm=1
                  This is testing each pattern in isolation (i.e. the underlying String is never changed.
                  By the way, I've never seen the integration of Pattern.compile().pattern() used inside the replaceALL method of String.
                  Why not?
                  Because its such a long winded way when you can simply use the String as the regular expression.
                  You can simply just add the regular expression as a String. Patterns are useful to avoid the overhead of recompiling the same pattern for multiple matches, which makes your use of it a little strange to me.
                  Why? It is a good way to write self-documenting code. Unless there is a need for blazing speed, I'd write code that programmers could easily understand. Let the hot-spot compilers worry about optimation. Please focus more on clearly written code then even non-Java guys can easily understand.
                  I completely disagree, I think you have made it harder to read and I don't see how this is self documenting code. All this documents is that the String is a pattern used in one of the most commonly used methods which everyone with at least 6 months experience in Java should know by now is indeed a regular expression.

                  Mel
                  • 6. Re: why can't i match this pattern?
                    807580
                    Melanie_Green wrote:
                    rerf wrote:
                    Melanie_Green wrote:
                    >
                    Clearly def only contains a single match which is replaced in line 2 and the resulting String is assigned to the variable def.
                    I guess to highlight the similarity of the expressions (the difference being one period), I put them side-by-side. Sorry I did not make this simple enough for you.
                    Simplicity is out of the question, clarity is. We can't all read minds.
                    resulting in dramatically different results.
                    I beg to differ.
                    What do you think?
                    public class Foo {
                         public static void main(String[] args) {
                              final String def = "<a href=\"?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900\">";
                              
                              Pattern p1 = Pattern.compile("<a href............");
                              Pattern p2 = Pattern.compile("<a href.............");
                              
                              Matcher m1 = p1.matcher(def);
                              Matcher m2 = p2.matcher(def);
                              
                              System.out.println("Def equals: ");
                              System.out.println(def);
                              System.out.println();
                              
                              System.out.println("P1 matches: ");
                              while(m1.find()) {
                                   System.out.println(m1.group());
                              }
                              System.out.println();
                              
                              System.out.println("P2 matches: ");
                              while(m2.find()) {
                                   System.out.println(m2.group());
                              }
                              System.out.println();
                         }
                    }
                    Output
                    Def equals: 
                    <a href="?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900">
                    
                    P1 matches: 
                    <a href="?se=on&sm=
                    
                    P2 matches: 
                    <a href="?se=on&sm=1
                    This is testing each pattern in isolation (i.e. the underlying String is never changed.
                    Now I do appreciate your writing that code. I didn't run it yet, but I take your word. So, I am going to focus on character encoding this weekend. And then on regular expressions. I will prep a really thorough explanation and post it Tuesday. Thanks for the interest as well.
                    • 7. Re: why can't i match this pattern?
                      807580
                      Melanie_Green wrote:
                      rerf wrote:
                      Melanie_Green wrote:
                      >
                      Clearly def only contains a single match which is replaced in line 2 and the resulting String is assigned to the variable def.
                      I guess to highlight the similarity of the expressions (the difference being one period), I put them side-by-side. Sorry I did not make this simple enough for you.
                      Simplicity is out of the question, clarity is. We can't all read minds.
                      resulting in dramatically different results.
                      I beg to differ.
                      What do you think?
                      public class Foo {
                           public static void main(String[] args) {
                                final String def = "<a href=\"?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900\">";
                                
                                Pattern p1 = Pattern.compile("<a href............");
                                Pattern p2 = Pattern.compile("<a href.............");
                                
                                Matcher m1 = p1.matcher(def);
                                Matcher m2 = p2.matcher(def);
                                
                                System.out.println("Def equals: ");
                                System.out.println(def);
                                System.out.println();
                                
                                System.out.println("P1 matches: ");
                                while(m1.find()) {
                                     System.out.println(m1.group());
                                }
                                System.out.println();
                                
                                System.out.println("P2 matches: ");
                                while(m2.find()) {
                                     System.out.println(m2.group());
                                }
                                System.out.println();
                           }
                      }
                      Output
                      Def equals: 
                      <a href="?se=on&sm=1&gr=ml&qt=%B5%F5%B5%B6&sv=KO&lp=0&item_id=07458900">
                      
                      P1 matches: 
                      <a href="?se=on&sm=
                      
                      P2 matches: 
                      <a href="?se=on&sm=1
                      This is testing each pattern in isolation (i.e. the underlying String is never changed.
                      btw: I think Strings are always immutable?
                      • 8. Re: why can't i match this pattern?
                        699554
                        rerf wrote:
                        This is testing each pattern in isolation (i.e. the underlying String is never changed.
                        btw: I think Strings are always immutable?
                        Correct but I used the final keyword which means the reference to the object cannot change, but the object state itself can. And since Strings are immutable like you mentioned, then the String is treated like a constant (i.e. cannot change).

                        Mel
                        • 9. Re: why can't i match this pattern?
                          807580
                          rerf wrote:
                          Yeah. I am just starting at this. And I was having trouble. So, I decided not to use any shortcuts. I wanted to remove the possibility of the error being my misuse of a shortcut.
                          Sorry but the form you are using is more likely to result in errors. Creating a Pattern object from a regex String and then just extracting that regex string again (and throwing away the Pattern object you have just created) to use in the replaceAll() which immediately has to create a new Pattern object from the string is just silly.