12 Replies Latest reply on Nov 29, 2012 11:10 PM by 801904

    comma and quote separated tokens

      hello everyone

      I am new to Java.... So my apologies if this is old question (if so a pointer to the solution would be great!)

      I am reading a CSV flat-file of string values separated by (either or) commas or a combination of comma's and double quotes, with spaces optionally present between separators.

      The double-quotes also need to be escaped whenever there are two of them together i.e. "[hello ""world""]" becomes [hello "world"]

      i can't seem to get the correct Regex formula to handle all cases, it handles some but not others.... and rapidly becomes unreadable too.... 8^)

      what is the best/most efficient way of reading and splitting such CSV flat file into a collection of individual strings?

      is there a standard Java6/7 API that can do it or do I have to hand-code it ?


      hello, "this,is column 2",this is col3 ,"1",col4,"col5", "col7 ", this, is more, "[ "" ]"

      this needs to be broken into separate Strings:

      this,is column 2 *(note embeded comma)*
      this is col3
      is more

      Finally, the files can be rather large, many GB in size and the number of columns in each line will NOT be the same i.e. some lines will have 1 or 2 tokens and others in the same file may have 100 tokens....

      Many thanks in advance
        • 1. Re: comma and quote separated tokens
          I'd be surprised if you could handle quotes correctly with a single regular expression. In compilers it requires hand coding.
          • 2. Re: comma and quote separated tokens
            EJP wrote:
            I'd be surprised if you could handle quotes correctly with a single regular expression. In compilers it requires hand coding.
            Indeed. Regular expressions are not the solution to all problems (but they may be PART of the solution). I've written plenty of parsers/importers in my day and they require really gnarly specific handling to make them production-stable; generic least-effort parsing code simply does not cover what the big bad unpredictable outside world throws at it.
            • 3. Re: comma and quote separated tokens
              gimbal2 wrote:
              (but they may be PART of the solution)
              Also part of the problem ;)
              • 4. Re: comma and quote separated tokens
                Kayaman wrote:
                gimbal2 wrote:
                (but they may be PART of the solution)
                Also part of the problem ;)
                I find it quite enjoyable that of late we are more in agreement ;)
                • 5. Re: comma and quote separated tokens
                  EJP wrote:
                  I'd be surprised if you could handle quotes correctly with a single regular expression. In compilers it requires hand coding.
                  JosAH was fond of referring to the "Pumping Lemma" in relation to the problems of using regular expression for parsing. I can't say it's applicable here but I know from experience that covering all the cases in one regular expression is going to be difficult at best. I can see nothing in what the OP posted to indicate that just about any one of the free Java CSV parsers would do the task but I leave it to the OP to test them.
                  • 6. Re: comma and quote separated tokens
                    so it sounds like i just have to hand-code this solution for the best performance and maximum flexibility and maintainability ?
                    thanks again.
                    • 7. Re: comma and quote separated tokens
                      I think a simple call to String.split() will handle most of what you want. All you have to is provide a simple regx that covers the delimiter variations.
                      1 person found this helpful
                      • 8. Re: comma and quote separated tokens
                        thanks. Will do.
                        • 9. Re: comma and quote separated tokens
                          I think a simple call to String.split() will handle most of what you want. All you have to is provide a simple regx that covers the delimiter variations.
                          All you have to do is overcome the Pumping Lemma first.
                          • 10. Re: comma and quote separated tokens
                            Alex Geller
                            This answer may help to clarify whether or not Pumping Lemma has any relevance here which I think is not the case.

                            The following grammar hopefully reflects what the OP meant:

                            file: lines
                            lines: line ('\n' line)*
                            line: token (',' token)*
                            token: plainToken|stringToken
                            plainToken: plainChar+
                            stringToken: '"' plainCharOrEscapedQuote'"'
                            plainCharOrEscapedQuote: plainChar|'"' '"'
                            plainChar: '[a-zA-Z0-9]'

                            We transform the grammar for "token" into a right regular grammar:

                            L0:'"' L1 |[a-zA-Z0-9] L4
                            L1:'"' L2 |[a-zA-Z0-9] L3
                            L2:'"' L3 |
                            L3:'"' L2 |[a-zA-Z0-9] L3
                            L4=[a-zA-Z0-9] L5|
                            L5:[a-zA-Z0-9] L5|

                            Drawn as a DFA using ASCII art:
                            (0) --'"'--> (1) --'"'--> ((2)) --'"'--> (3) --'"'--> ((2))
                            --'[a-zA-Z0-9]'--> (3)
                            --'[a-zA-Z0-9]'--> (3)
                            --'[a-zA-Z0-9]'--> ((4)) --'[a-zA-Z0-9]'--> ((5)) --'[a-zA-Z0-9]'--> ((5))

                            abc: (0)-a->((4))-b->((5))-c->((5))
                            "abc: (0)-'"'->(1)-a->(3)-b->(3) // not accepting, need more
                            "abc": (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2))
                            "abc"a: (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2)) // error: rejecting a
                            "abc"": (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2))-'"'->((3))
                            "abc""a: (0)-'"'->(1)-a->(3)-b->(3)-'"'->((2))-'"'->((3))-a->((3))
                            "": (0)-'"'->(1)-'"'-> ((2))
                            ""a: (0)-'"'->(1)-'"'-> ((2)) //error: rejecting a
                            """a": (0)-'"'->(1)-'"'->((2))-'"'->(3)-a->((3))

                            Since "token" is regular so is "line" since regular languages are closed under concatenation and Kleene star.
                            The same hold true for "lines" so that csv files are regular.
                            A language being regular does not imply that a regular expression for that language can be constructed (only the opposite it true).
                            I suspect though, that it is possible for this language.

                            Edited by: Alex Geller on 28.11.2012 13:46
                            • 11. Re: comma and quote separated tokens
                              Alex Geller
                              Find below a little class using a regular expression to match csv files.
                              I am not familiar with the capabilities of "Matcher" regarding the evaluation of a stream and retrieving the content of the capturing groups and I didn't spend much time to find out what is going on.
                              If you have the line "one,two,three,four,five" then you will see "one" in capturing group 1 and "five" in group 3. The values "two","three" and "four" are also in group 3 but you will not see them since Matcher.find() advances to the token "five" overwriting the values in group 3. Most likely I am using the class the wrong way and/or misunderstanding the concept of capturing groups.
                              In my previous post I had missed the fact that comma (',') characters can appear in the strings. I also assume that both empty strings '""' and empty fields ',,' are allowed.
                              import java.util.regex.*;
                              public class CsvAcceptorTest {
                                  public static void main(String[] args) {
                              csvfile lines
                              lines line ( '\n' line) *
                              line token ( ',' token) *
                              token plainToken | stringToken
                              plainToken plainChar *
                              stringToken '"' plainCharEscapedCharOrComma * '"'
                              plainCharEscapedCharOrComma plainChar | '"' '"' | ','
                              plainChar '[a-zA-Z0-9 ]'
                                      Pattern p=Pattern.compile("(?:(?:([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\"))(?:,(?:([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\")))*)(?:\\n(?:(?:([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\"))(?:,(([a-zA-Z0-9 ]*)|(\"(?:[a-zA-Z0-9 ]|\"\"|,)*\")))*))*");
                                      assert p.matcher("abc").matches();
                                      assert p.matcher("abc,,efg").matches();
                                      assert p.matcher("one,two,three\nfour,five,six").matches();
                                      assert p.matcher("\"abc\"").matches();
                                      assert p.matcher("\"a\"\"bc\"").matches();
                                      assert p.matcher("\"a\"\"b,,c\"").matches();
                                      assert !p.matcher("\"").matches(); // not complete
                                      Matcher m=p.matcher("one,two,three\n\"Two words\",\"\"\"Huh\"\" he said\",\"A list of 1,2,3 and 4\"");
                                      assert m.matches();
                                      while(m.find()) {
                                          for(int i=0;i<m.groupCount();i++) {
                                              System.out.println("Group "+i+":"+m.group(i));
                              Edited by: Alex Geller on Nov 29, 2012 9:23 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                              1 person found this helpful
                              • 12. Re: comma and quote separated tokens
                                thanks Alex!

                                this sort of Regex is truly frightening!! 8^)

                                I ended up hand-coding it and used the simple Replace and Split methods.
                                There is more code (perhaps 50 lines?) but its simple to follow, easy to understand and does what I need (handles leading and trailing multiple emtpy strings as well as other conditions.
                                It's not water-tight but it will suffice.
                                I dont know if "ful-strength" Regex will outperform my own hand-code, may be worth a test too, as I need to process files with several hundred million records and tens of GB in size.