This discussion is archived
1 2 Previous Next 22 Replies Latest reply: Mar 21, 2008 3:44 AM by 807591 RSS

Regex help wanted, split into three groups, escape the split character

807591 Newbie
Currently Being Moderated
Hi,

I've been breaking my head on this probably not that hard problem. I want to break an input string into three parts, there is an comma (,) between each part. For example,
a,b,c
should be split in three groups, 'a', 'b', and 'c'.

However, the input string might also read
a\,,b,c
, or worse
a\\,b,c
, or
a\\\,,b,c
.

How would you go ahead and tackle this problem? Thanx
  • 1. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    using a tokenizer using both
    "\\"  
    and
    ","
    (and
    "\," 
    ?) as delimiters
  • 2. Re: Regex help wanted, split into three groups, escape the split character
    DarrylBurke Guru Moderator
    Currently Being Moderated
    I would use a simple two step process.
    String str = "a\,,b,c"; // could have more backslashes
    str = str.replace ("\\", ""); // or replaceAll ("\\\\", "")
    String [] splitStr = str.split (",");
    If you want a one-liner, take a look at regexes and Pattern.split.

    db
  • 3. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    Yeah, I'm looking for a one-liner, and am looking at patterns. I just found out about lookaheads and lookbehinds (which I need to use I think), but I think Java won't let me use lookbehinds with quantifiers. I need to match up unto a comma, which is not preceeded by an odd number backslashes. Something like
    Pattern p = Pattern.compile("(?<!(\\\\)*\\),");
    I'm going to try something like that right now, but think it'll fail, since I don't think the * is allowed in the lookbehind, and I still don't capture all the text before the , (which is not preceeded by an odd number of backslashes) ... Hmm, trial and error :)

    Edit:
    As I thougt.
    String input = "a,b,c";
    String [] parts = input.split("(?<!(\\\\\\\\)*\\\\),");
    yields the error
    Exception in thread "main" java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 12
    (?<!(\\\\)*\\),
    ^
         at java.util.regex.Pattern.error(Pattern.java:1650)
         at java.util.regex.Pattern.group0(Pattern.java:2415)
         at java.util.regex.Pattern.sequence(Pattern.java:1715)
         at java.util.regex.Pattern.expr(Pattern.java:1687)
         at java.util.regex.Pattern.compile(Pattern.java:1397)
         at java.util.regex.Pattern.<init>(Pattern.java:1124)
         at java.util.regex.Pattern.compile(Pattern.java:817)
         at java.lang.String.split(String.java:2103)
         at java.lang.String.split(String.java:2145)
         at test.Test.main(Test.java:10)
    Edited by: drvdijk on Mar 21, 2008 10:39 AM
  • 4. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
               String[] splitString = "a\\,,b,c".split("(\\\\+,)?,");
     
  • 5. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    That yields
    a
    b
    c
    While it should yield
    a,
    b
    c
    Edited by: drvdijk on Mar 21, 2008 10:48 AM
  • 6. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    Not on my computer!
            {
                String[] splitString = "a\\,,b,c".split("(\\\\+,)?,");
                System.out.println(Arrays.toString(splitString));
            }
            {
                String[] splitString = "a\\\\\\,,b,c".split("(\\\\+,)?,");
                System.out.println(Arrays.toString(splitString));
            }
            {
                String[] splitString = "a\\\\,,b,c".split("(\\\\+,)?,");
                System.out.println(Arrays.toString(splitString));
            }
    yields

    [a, b, c]
    [a, b, c]
    [a, b, c]
  • 7. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    Change the comma to something else, like an @:
              {
                String[] splitString = "a\\@@b@c".split("(\\\\+@)?@");
                System.out.println(Arrays.toString(splitString));
            }
            {
                String[] splitString = "a\\\\\\@@b@c".split("(\\\\+@)?@");
                System.out.println(Arrays.toString(splitString));
            }
            {
                String[] splitString = "a\\\\@b@c".split("(\\\\+@)?@");
                System.out.println(Arrays.toString(splitString));
            }
    That yields
    [a, b, c]
    [a, b, c]
    [a\\, b, c]
    while it should've yielded
    [a\@, b, c]
    [a\\\@, b, c]
    [a\\, b, c]
    It's a little hell to escape slashes in patterns, isn't it.. I can't really for sure tell if the output I just entered I wish I have is 100% correct. I'm going to use another escape character to clear things up. So, the string should be split at the @ from now on, and the escape character is an -.
    So, the input "a@b@c" should yield "a", "b", and "c".
    The input "a-@b@c" should yield "a-@b" and "c"
    The input "a--@b@c" should yield "a--", "b", and "c".
    The input "a---@b@c" should yield "a---@", "b", and "c".

    Now, let's try your thing again sabre150 :)

    Edited by: drvdijk on Mar 21, 2008 11:05 AM
  • 8. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    You have lost me. Why have you put @ chars in my regex!

    I have been though your original post again and my regex does exactly what I think you asked for. What am I missing?

    Edited by: sabre150 on Mar 21, 2008 10:13 AM
  • 9. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    I lost myself a little as well ;-) I added @'s and -'s to the discussion because I think it would be a lot clearer not to talk about \ escape characters, since when you want to use them in a String in java, you have to use two slashes \ \ (without the space, the forum doesn't eat that), and if you want to use a \ in a pattern, you have to use \ \ \ \ (without spaces, forum again). I changed the comma, because the Arrays.toString you use, displays the elements of the array separated by a comma. Just for clarity :)

    I got a little further with the split problem, by using a lookbehind that matches a single character (the escape character -), and a non-capturing group that matches an even number of escape characters:
    String [] parts = input.split("(?<!-)(?:(--)*)@");
    for (int i = 0; i < parts.length; i++) {
         System.out.println(parts);
    }
    This yields almost good results:
    input "a@b@c" yields "a", "b", and "c"
    input "a-@b@c" yields "a-@b" and "c" (yay!)
    input "a--@b@c" yields "a", "b", and "c". Why is the -- consumed?!?
    input "a---@b@c" yields "a---@b" and "c" (yay!)
    input "a----@b@c" yields "a", "b", and "c". Why is the ---- consumed?!?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  • 10. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    You have still lost me! Are you saying that with an even number of slashes you want to leave them in but with an odd number you want to ignore them.

    P.S. Non-capturing groups is a red-herring and will not help you. Look behind will only help if you have a finite look behind expression.
  • 11. Re: Regex help wanted, split into three groups, escape the split character
    800351 Newbie
    Currently Being Moderated
    This OP has finally described his/her problem and requirement in the reply #5 and #7 but we haven't got the formal definition of the input string(s). Do only the backslashe(s) before the otherwise delimiter char matter?
  • 12. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    hiwa wrote:
    This OP has finally described his/her problem and requirement in the reply #5 and #7 but we haven't got the formal definition of the input string(s). Do only the backslashe(s) before the otherwise delimiter char matter?
    I can only get a hint at what the OP is asking so until the OP defines the problem better I can't help any further.

    Bye.
  • 13. Re: Regex help wanted, split into three groups, escape the split character
    800351 Newbie
    Currently Being Moderated
    I can't help any further
    Ditto.
  • 14. Re: Regex help wanted, split into three groups, escape the split character
    807591 Newbie
    Currently Being Moderated
    I indeed did not state my expectations of the output clearly in the opening post. My apologies for that. And thanks for the help so far :-)

    The input string is a string that represents one or more fields, separated by a delimiter character, the @ character from reply 7 and further on. However, the fields themselves can also contain the @ character, but those will be escaped with an escape character (-). If the field contains the escape character, it is escaped itself (--).

    Examples. These are single fields:
    "a-@b", which unescaped reads "a@b".
    "a--b", which unescaped reads "a-b"
    "a---@b", which unescaped reads "a-@b".
    These are two fields:
    "a@b", the fields "a" and "b".
    "a-@@b", the fields (escaped) "a-@" and "b", or unescaped "a@" and "b"
    "a--@b", the fields (escaped) "a--" and "b", or unescaped "a-" and "b"

    So, what I want to do with the split function (or any other regular expression), is to get the string into an array of unescaped fields.

    What I got with the code I posted in reply 9 almost works, except that the non-capturing group actually captures the even number of escape characters.
1 2 Previous Next