2 Replies Latest reply: Apr 10, 2007 11:48 PM by 807606 RSS

    String.split issue using regular expressions

      Hello All,

      First time posting, so please forgive me if I violate any etiquette....

      I have a standard csv text file that I am reading line by line. I thought each line was formatted as follows:

      FieldA, FieldB, FieldC, ......, FieldX

      and I was using str.split(",") to separate into tokens.

      However, I have found that some lines contain commas that are not supposed to be part of the parsing. For example, one line may look like

      FieldA, FieldB, "Field C has some commas, commas, and more commas in it", ...., FieldX

      Anytime that the line contains non-separating commas, the author is very careful to enclose the entire field containing the ignorable commas in double quotes. So, what I would like to do is to create a split expression that will split the line based on commas that are not inside of double quotes, but I have no idea how to do it. I have looked at the regex area of the tutorial and tried
      but it does not work.

      Any help appreciated!!
        • 1. Re: String.split issue using regular expressions
          For example you can split using simple "," then assemble few fragments back to one string, starting from fragment started with " and stopping at fragment stopped with " :)
          • 2. Re: String.split issue using regular expressions
            The best way to parse CSV data is to use a dedicated tool, like the ones listed in this article. If you have to use regexes, or just want to learn how, a positive matching approach is preferable to split(). The following code, a modification of some sample code[1] that appears in The Book, assumes quoted fields in your data may contain escaped quotation marks in addition to commas, but may not contain line separators.
            import java.util.*;
            import java.util.regex.*;
            public class Test
              public static void main(String... args)
                String str = 
                  "FieldA, FieldB, \"Field C with commas, commas, and more commas\", , FieldX";
                List<String> fields = parseCsvLine(str);
                int i = 0;
                for (String s : fields)
                  System.out.printf("%nField %d: [%s]%n", i++, s);
              public static List<String> parseCsvLine(String line)
                String regex =
                    "(?<=^|,)[ \t]*+"                 + // Optional leading whitespace,
                    "(?:"                             + // followed by either...
                    "\"([^\"]*+(?:\"\"[^\"]++)*+)\""  + // ...by a quoted field...
                    "|"                               + // ...or...
                    "([^\",]*+)"                      + // ...some non-quoted text,
                    ")[ \t]*+";                         // and optional trailing whitespace.
                // Create a matcher for CSV fields, using the regex above.
                Matcher mMain = Pattern.compile(regex).matcher(line);
                // Create a matcher for doubled double-quotes
                Matcher mQuote = Pattern.compile("\"\"").matcher("");
                List<String> result = new ArrayList<String>();
                while (mMain.find())
                  // If field was not quoted, take it as it is; if it was quoted, 
                  // unescape any embedded quotation marks.
                  String field = (mMain.start(2) != -1) ? mMain.group(2).trim()
                               : mQuote.reset(mMain.group(1)).replaceAll("\"");
                return result;
            [1] http://regex.info/listing.cgi?ed=3&p=401