13 Replies Latest reply: Oct 18, 2011 2:13 PM by 835847 RSS

    Regex pattern

    835847
      I am using (?!^)\\b which produces all tokens and delimiters from a string.split();

      I would like it to also separate a ');' pair into ')' and ';'

      Any way to modify my expression to do so?
        • 1. Re: Regex pattern
          835847
          I need help with regular expression pattern matching please.

          I would like to separate words, punctuation, parenthesis, and other characters using split()

          So far I have come up with string.split("(?!^)\\b|;") which separates things nicely and lops off the end of line.

          I would like it to ignore spaces and return just the things desired above.

          Can you help? Seems hard to find folks with regex experience.
          • 2. Re: Regex pattern
            Kayaman
            Always Learning wrote:
            Seems hard to find folks with regex experience.
            And can you guess why that is...? : )
            There are a couple of regex wizards on the forums, but I'm not one of them.
            • 3. Re: Regex pattern
              796440
              Always Learning wrote:
              I need help with regular expression pattern matching please.

              I would like to separate words, punctuation, parenthesis, and other characters using split()
              You'll need to provide a much more precise description than that.
              • 4. Re: Regex pattern
                sabre150
                Always Learning wrote:
                I need help with regular expression pattern matching please.

                I would like to separate words, punctuation, parenthesis, and other characters using split()

                So far I have come up with string.split("(?!^)\\b|;") which separates things nicely and lops off the end of line.
                You can't have tested it very thoroughly since it can't possibly get even close to "separate things nicely" .

                >
                I would like it to ignore spaces and return just the things desired above.
                I really don't understand this.

                >
                Can you help? Seems hard to find folks with regex experience.
                There are plenty of us here who can help with regex but most like me expect a better requirement specification. Please spend some time providing a better specification.
                • 5. Re: Regex pattern
                  835847
                  My apologies gentlemen. Allow me to post what I have done thus far and also present it in code tags so the forum doesn't start pattern matching things itself.

                  Here is an example entry:
                  create table instructor (ID integer, name string);
                  What I would like to do is cleanly produce each word, punctuation, and parenthesis as separate components; i.e. every word - ( - , - ) - ; all separate entities.

                  What I have come up with so far that is fairly strong is
                  "(?!^)\\b|;"
                  This cleanly separates every element but excludes the ; because it binds to ) and I have been unable to separate them. I also have to trim the ( token because of leading whitespace.
                  • 6. Re: Regex pattern
                    sabre150
                    The best I can come up with at the moment is
                          String[] tokens = line.replaceAll("\\s+(\\p{Punct})","$1").split("\\s+|(?<=\\p{Punct})|(?=\\p{Punct})");
                    I used the punctuation character set but you can replace that with whatever set of punctuation characters you want. Of course this will all fail if any of your tokens contains a punctuation character which is likely to happen when your SQL contains a string literal. I couldn't find a way of getting rid of the empty token without taking the two stage approach. I'm sure it can be done but since I'm not getting paid for this ...

                    Why are you not using an SQL parser for this?
                    • 7. Re: Regex pattern
                      835847
                      Wow that is quite a regular expression and much more significantly than I expected. To answer your question, I am not using an SQL parser because I need to recursively descend on syntax that is not necessarily SQL compatible. It is pseudo science and requires a custom tokenizer and parser for a school project.

                      I will credit your handle in my code for such a beautiful set of regular expressions.

                      Edited by: Always Learning on Oct 18, 2011 9:51 AM
                      • 8. Re: Regex pattern
                        835847
                        Tested your regular expression and I am indeed satisfied. Tell where to mail the check, you earned it. I am also not concerned about punctuation within a string literal as there is nothing I can do about it anyway and I devised a way to reconstruct the string literal so not a problem. Well thought out and well approached solution. Thank you.
                        • 9. Re: Regex pattern
                          sabre150
                          Always Learning wrote:
                          Tested your regular expression and I am indeed satisfied. Tell where to mail the check, you earned it. I am also not concerned about punctuation within a string literal as there is nothing I can do about it anyway and I devised a way to reconstruct the string literal so not a problem.
                          Sorry but you cannot reconstruct the literal since the replaceAll() part throws away spaces before a punctuation character and multiple spaces are removed by the split().

                          Having done this sort of thing in the past I believe it will be much much better to adapt one of the free SQL parsers or if you can't do that then write your own tokenizer that doesn't use regular expressions.
                          • 10. Re: Regex pattern
                            835847
                            I have just come to that realization after speaking so quickly. I am discouraged from using StringTokenizer which would make my life simple.. While I can use it and it is not deprecated, it is suggested I use split() or StreamTokenizer. I am at a bit of loss now.
                            • 11. Re: Regex pattern
                              796440
                              Always Learning wrote:
                              I have just come to that realization after speaking so quickly. I am discouraged from using StringTokenizer which would make my life simple.. While I can use it and it is not deprecated, it is suggested I use split() or StreamTokenizer. I am at a bit of loss now.
                              He didn't suggest using StringTok or StreamTok. He's saying you should write your own tokenizer.
                              • 12. Re: Regex pattern
                                835847
                                jverd wrote:
                                Always Learning wrote:
                                I have just come to that realization after speaking so quickly. I am discouraged from using StringTokenizer which would make my life simple.. While I can use it and it is not deprecated, it is suggested I use split() or StreamTokenizer. I am at a bit of loss now.
                                He didn't suggest using StringTok or StreamTok. He's saying you should write your own tokenizer.
                                I agree with you and his assessment. I have done that modifying a previously written tokenizer for my own needs. I suppose if you need it done right, do it yourself. I had wanted to go with mainstream Java implementations but in the end I am using my own.
                                • 13. Re: Regex pattern
                                  835847
                                  In the end, modifying an existing tokenizer is what I did. I appreciate the guidance within this thread pointing me to do something I figured I would eventually need to do if I could not use Java's tokenizers.