This discussion is archived
12 Replies Latest reply: Nov 29, 2012 3:10 PM by 801904 RSS

comma and quote separated tokens

801904 Newbie
Currently Being Moderated
hello everyone

I am new to Java.... So my apologies if this is old question (if so a pointer to the solution would be great!)

I am reading a CSV flat-file of string values separated by (either or) commas or a combination of comma's and double quotes, with spaces optionally present between separators.

The double-quotes also need to be escaped whenever there are two of them together i.e. "[hello ""world""]" becomes [hello "world"]

i can't seem to get the correct Regex formula to handle all cases, it handles some but not others.... and rapidly becomes unreadable too.... 8^)

what is the best/most efficient way of reading and splitting such CSV flat file into a collection of individual strings?

is there a standard Java6/7 API that can do it or do I have to hand-code it ?



example:

---------------------------------------------------------------------------------
hello, "this,is column 2",this is col3 ,"1",col4,"col5", "col7 ", this, is more, "[ "" ]"
---------------------------------------------------------------------------------

this needs to be broken into separate Strings:

hello
this,is column 2 *(note embeded comma)*
this is col3
1
col4
col5
col7
this
is more
["]

--
Finally, the files can be rather large, many GB in size and the number of columns in each line will NOT be the same i.e. some lines will have 1 or 2 tokens and others in the same file may have 100 tokens....

Many thanks in advance
Yuri
  • 1. Re: comma and quote separated tokens
    EJP Guru
    Currently Being Moderated
    I'd be surprised if you could handle quotes correctly with a single regular expression. In compilers it requires hand coding.
  • 2. Re: comma and quote separated tokens
    gimbal2 Guru
    Currently Being Moderated
    EJP wrote:
    I'd be surprised if you could handle quotes correctly with a single regular expression. In compilers it requires hand coding.
    Indeed. Regular expressions are not the solution to all problems (but they may be PART of the solution). I've written plenty of parsers/importers in my day and they require really gnarly specific handling to make them production-stable; generic least-effort parsing code simply does not cover what the big bad unpredictable outside world throws at it.
  • 3. Re: comma and quote separated tokens
    Kayaman Guru
    Currently Being Moderated
    gimbal2 wrote:
    (but they may be PART of the solution)
    Also part of the problem ;)
  • 4. Re: comma and quote separated tokens
    gimbal2 Guru
    Currently Being Moderated
    Kayaman wrote:
    gimbal2 wrote:
    (but they may be PART of the solution)
    Also part of the problem ;)
    I find it quite enjoyable that of late we are more in agreement ;)
  • 5. Re: comma and quote separated tokens
    sabre150 Expert
    Currently Being Moderated
    EJP wrote:
    I'd be surprised if you could handle quotes correctly with a single regular expression. In compilers it requires hand coding.
    JosAH was fond of referring to the "Pumping Lemma" in relation to the problems of using regular expression for parsing. I can't say it's applicable here but I know from experience that covering all the cases in one regular expression is going to be difficult at best. I can see nothing in what the OP posted to indicate that just about any one of the free Java CSV parsers would do the task but I leave it to the OP to test them.
  • 6. Re: comma and quote separated tokens
    801904 Newbie
    Currently Being Moderated
    ok,
    so it sounds like i just have to hand-code this solution for the best performance and maximum flexibility and maintainability ?
    thanks again.
  • 7. Re: comma and quote separated tokens
    aksarben Journeyer
    Currently Being Moderated
    I think a simple call to String.split() will handle most of what you want. All you have to is provide a simple regx that covers the delimiter variations.
  • 8. Re: comma and quote separated tokens
    801904 Newbie
    Currently Being Moderated
    thanks. Will do.
  • 9. Re: comma and quote separated tokens
    EJP Guru
    Currently Being Moderated
    I think a simple call to String.split() will handle most of what you want. All you have to is provide a simple regx that covers the delimiter variations.
    All you have to do is overcome the Pumping Lemma first.
  • 10. Re: comma and quote separated tokens
    Alex Geller Newbie
    Currently Being Moderated
    This answer may help to clarify whether or not Pumping Lemma has any relevance here which I think is not the case.

    The following grammar hopefully reflects what the OP meant:

    file: lines
    lines: line ('\n' line)*
    line: token (',' token)*
    token: plainToken|stringToken
    plainToken: plainChar+
    stringToken: '"' plainCharOrEscapedQuote'"'
    plainCharOrEscapedQuote: plainChar|'"' '"'
    plainChar: '[a-zA-Z0-9]'

    We transform the grammar for "token" into a right regular grammar:

    L0:'"' L1 |[a-zA-Z0-9] L4
    L1:'"' L2 |[a-zA-Z0-9] L3
    L2:'"' L3 |
    L3:'"' L2 |[a-zA-Z0-9] L3
    L4=[a-zA-Z0-9] L5|
    L5:[a-zA-Z0-9] L5|

    Drawn as a DFA using ASCII art:
    <pre>
    (0) --'"'--> (1) --'"'--> ((2)) --'"'--> (3) --'"'--> ((2))
    --'[a-zA-Z0-9]'--> (3)
    --'[a-zA-Z0-9]'--> (3)
    --'[a-zA-Z0-9]'--> ((4)) --'[a-zA-Z0-9]'--> ((5)) --'[a-zA-Z0-9]'--> ((5))
    </pre>


    Examples:
    abc: (0)-a->((4))-b->((5))-c->((5))
    "abc: (0)-'"'->(1)-a->(3)-b->(3) // not accepting, need more
    "abc": (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2))
    "abc"a: (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2)) // error: rejecting a
    "abc"": (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2))-'"'->((3))
    "abc""a: (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2))-'"'->((3))-a->((3))
    "": (0)-'"'->(1)-'"'-> ((2))
    ""a: (0)-'"'->(1)-'"'-> ((2)) //error: rejecting a
    """a": (0)-'"'->(1)-'"'->((2))-'"'->(3)-a->((3))

    Since "token" is regular so is "line" since regular languages are closed under concatenation and Kleene star.
    The same hold true for "lines" so that csv files are regular.
    A language being regular does not imply that a regular expression for that language can be constructed (only the opposite it true).
    I suspect though, that it is possible for this language.

    Edited by: Alex Geller on 28.11.2012 13:46
  • 11. Re: comma and quote separated tokens
    Alex Geller Newbie
    Currently Being Moderated
    Find below a little class using a regular expression to match csv files.
    I am not familiar with the capabilities of "Matcher" regarding the evaluation of a stream and retrieving the content of the capturing groups and I didn't spend much time to find out what is going on.
    If you have the line "one,two,three,four,five" then you will see "one" in capturing group 1 and "five" in group 3. The values "two","three" and "four" are also in group 3 but you will not see them since Matcher.find() advances to the token "five" overwriting the values in group 3. Most likely I am using the class the wrong way and/or misunderstanding the concept of capturing groups.
    In my previous post I had missed the fact that comma (',') characters can appear in the strings. I also assume that both empty strings '""' and empty fields ',,' are allowed.
    import java.util.regex.*;
    public class CsvAcceptorTest {
        public static void main(String[] args) {
    /*
    csvfile lines
    lines line ( '\n' line) *
    line token ( ',' token) *
    token plainToken | stringToken
    plainToken plainChar *
    stringToken '"' plainCharEscapedCharOrComma * '"'
    plainCharEscapedCharOrComma plainChar | '"' '"' | ','
    plainChar '[a-zA-Z0-9 ]'
    */
            Pattern p=Pattern.compile("(?:(?:([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\"))(?:,(?:([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\")))*)(?:\\n(?:(?:([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\"))(?:,(([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\")))*))*");
            assert p.matcher("abc").matches();
            assert p.matcher("abc,,efg").matches();
            assert p.matcher("one,two,three\nfour,five,six").matches();
            assert p.matcher("\"abc\"").matches();
            assert p.matcher("\"a\"\"bc\"").matches();
            assert p.matcher("\"a\"\"b,,c\"").matches();
            assert !p.matcher("\"").matches(); // not complete
            Matcher m=p.matcher("one,two,three\n\"Two words\",\"\"\"Huh\"\" he said\",\"A list of 1,2,3 and 4\"");
            assert m.matches();
            m.reset();
            while(m.find()) {
                for(int i=0;i<m.groupCount();i++) {
                    System.out.println("Group "+i+":"+m.group(i));
                }
                System.out.println();
            }
        }
    }
    {code}
    
    Edited by: Alex Geller on Nov 29, 2012 9:23 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
  • 12. Re: comma and quote separated tokens
    801904 Newbie
    Currently Being Moderated
    thanks Alex!

    this sort of Regex is truly frightening!! 8^)

    I ended up hand-coding it and used the simple Replace and Split methods.
    There is more code (perhaps 50 lines?) but its simple to follow, easy to understand and does what I need (handles leading and trailing multiple emtpy strings as well as other conditions.
    It's not water-tight but it will suffice.
    I dont know if "ful-strength" Regex will outperform my own hand-code, may be worth a test too, as I need to process files with several hundred million records and tens of GB in size.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points