11 Replies Latest reply: Aug 7, 2008 8:17 AM by 843785 RSS

    Comma Separated Values (CSV) Splitter. The guts of a CsvReader.

    800308
      Folks,

      I'm posting this in the hope that someone else will find it useful... took me a while to figure it out.
      package krc.utilz.io;
      
      import java.util.List;
      import java.util.ArrayList;
      
      /**
       * Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
       * <p>
       * Splits a line of text into fields on a field-separator character. When the
       * field-separator character is quoted it is treated as data (not a separator).
       * The quote-character may be escaped using two consecutive quote-character.
       * <p>
       * <strong>Important Note:</strong><ul>
       * This implementation DOES NOT support line-breaks within data-fields as
       * specified, by the CSV standard. I chose not to support this because I don't
       * know how, and I don't need it, and because it's just silly!
       * If you realy must put linebreaks within fields than I suggest that you
       * encode/decode them as <br>, or just use XML FFS!
       * <p>
       * <strong>Default Configuration:</strong><ul>
       * <li>the fieldSeperator is ','
       * <li>the quoteCharacter is '"'
       * <li>treatConsecutiveSeperatorsAsOne is false
       * <li>trimFields is false
       * </ul>
       * <p>
       * <strong>Usage Example:<strong><br>
       * <code>
       *  // "line" would normally be read from an input file.
       *  String line = "  10101179664,,\"SMITH, JOHN D.\",  \"\"\"SINGLE\"\"\",ACTIVE,\"JOHN \"\"THE DODGER\"\" SMITH\",19/06/1997,\"\"\"\"\"\",\"\"\"    \"\"\"";
       *  // something to verify the result against.
       *  String[] expected = new String[] {"10101179664","","SMITH, JOHN D.","\"SINGLE\"","ACTIVE","JOHN \"THE DODGER\" SMITH","19/06/1997","\"\"","\"    \""};
       *  // Get a splitter with default configuration. See the javadoc.
       *  CSVSplitter splitter = new CSVSplitter();
       *  // Split the given line into fields
       *  String[] fields = splitter.split(line);
       *  System.err.println("DEBUG: "+java.util.Arrays.toString(fields));
       *  // verify that the fields are as expected.
       *  assert java.util.Arrays.equals(fields, expected);
       * </code>
       * <strong>References:<strong><ul>
       * <li>http://supercsv.sourceforge.net/csvSpecification.html - An clear
       *  statement of what one person believes the CSV format should be. By the way,
       *  SuperCSV is amazing, but it's missing some configurations, especially the
       *  trimFields option, which is contentious and MUST therefore be configurable.
       * <li>http://en.wikipedia.org/wiki/Comma-separated_values - is a good summary.
       * <li>http://tools.ietf.org/html/rfc4180 - is a memo, NOT a recommendation, but
       *  THERE IS NO STANDARD!
       * </ul>
       * <strong>The Legals:</strong>
       * Please feel free to use and adapt this software however you see fit, except
       * if you are military, psuedo-military, para-military, a military contractor,
       * or a manufacturer of guns, bombs, war planes, tanks, war ships, or any other
       * instrument of mass murder, in which case I forbid you from retaining a
       * copy of this software in way shape or form. If you are George Bush please
       * consider suicide.
       *
       * This software is provided, as is, where is. The author accepts no liability
       * for loss or damage pursuent to it's use, not even that implied by any
       * alleged fitness for merchantability.
       *
       * Please do not charge for this software. You'd be a prat to ask someone to
       * charge for something I gave you for free... except if you utilize this code
       * in a substantive project, of which this but a wee insignificant component,
       * in which case it'd be nice if you sling some bucks at your prefered charity
       * on my behalf. Everyone's gotta eat.
       *
       * If you find a repeatable bug in this software then please write a test-case
       * and post it back here... maybe some kind sole will fix it.
       * If you fix a bug please post the new version here. If you enhance this
       * software then please post it back here. 
       *
       * If you like this software then do a little jig.
       */
      
      public class CSVSplitter
      {
      
        /** The field-separator character. The default is ','. */
        public final char fieldSeperator;
      
        /** The quotation character. The default is '"'. Eg: 12345,"Builder, Bob",BUILDER,07 389 3896. */
        public final char quoteCharacter;
      
        /** if true then "cat,,,dog" is equivalent to "cat,dog". The default is false. */
        public final boolean treatConsecutiveSeperatorsAsOne;
      
        /** if true then " cat, dog  " is equivalent to "cat,dog". The default is true. */
        public final boolean trimFields;
      
        /**
         * Initialise the CSVSplitter with default values
         *   field-seperator=','
         *   quoteCharacter='"'
         *   treatConsecutiveSeperatorsAsOne=false
         *   trimFields=true
         *
         */
        public CSVSplitter() {
          this(',', '"', false, true);
        }
      
        /**
         * Initialise the CSVSplitter.
         * @param fieldSeperator char - the field-seperator character. The default is ','.
         * @param quoteCharacter char - The quotation character. The default is '"'.
         * @param treatConsecutiveSeperatorsAsOne boolean - if true then "cat,,,dog" is equivalent to "cat,dog".
         * @param trimFields boolean - if true then " cat, dog  " is equivalent to "cat,dog".
         */
        public CSVSplitter(char fieldSeperator, char quoteCharacter, boolean treatConsecutiveSeperatorsAsOne, boolean trimFields) {
          this.fieldSeperator = fieldSeperator;
          this.treatConsecutiveSeperatorsAsOne = treatConsecutiveSeperatorsAsOne;
          this.quoteCharacter = quoteCharacter;
          this.trimFields = trimFields;
        }
      
        public String[] split(String line) {
          List<String> tokens = new ArrayList<String>();
          char[] characters = line.toCharArray(); // an array allows for lookaround
          int n = characters.length; // (my one micro-optimization;-)
          boolean quoted = false;
          StringBuilder token = new StringBuilder();
          for (int i=0; i<n; i++) { // for each character in the line.
            char character = characters; // the current character
      char next = i+1==n ? '\0' : characters[i+1]; // the next character
      boolean discard = false; // if discard then throw this character away
      if (character == quoteCharacter) {
      if ( !quoted ) {
      quoted = true;
      discard = true;
      } else {
      if (next==quoteCharacter) {
      token.append(quoteCharacter);
      i++; // and skip the next character
      continue; // leaving quoted=true
      } else {
      quoted = false;
      discard = true;
      }
      }
      } else if (character == fieldSeperator) {
      if ( !quoted ) {
      if ( treatConsecutiveSeperatorsAsOne && token.length()==0 ) {
      discard = true; // chuck subsequent fieldSeperator(s) ,,,
      } else {
      // encountered an active field separator
      tokens.add(asString(token));
      token.setLength(0); // clear the token
      discard = true;
      }
      }
      }
      if(!discard) {
      token.append(character);
      }
      }
      // append the final token to the result dealing with the special case of
      // an empty input line, which should return an empty list. The caller
      // should filter out empty lines if that is the required behaviour.
      String strtok = asString(token);
      if ( !(tokens.isEmpty() && strtok.length()==0) ) {
      tokens.add(strtok);
      }
      return tokens.toArray(new String[0]);
      }

      private String asString(StringBuilder token) {
      String strtok = token.toString();
      return (trimFields ? strtok.trim() : strtok);
      }

      public static void main(String[] args) {
      try {
      usageExample();
      } catch (Exception e) {
      e.printStackTrace();
      }
      }

      private static void usageExample() {
      // "line" would normally be read from an input file.
      String line = " 10101179664,,\"SMITH, JOHN D.\", \"\"\"SINGLE\"\"\",ACTIVE,\"JOHN \"\"THE DODGER\"\" SMITH\",19/06/1997,\"\"\"\"\"\",\"\"\" \"\"\"";
      // something to verify the result against.
      String[] expected = new String[] {"10101179664","","SMITH, JOHN D.","\"SINGLE\"","ACTIVE","JOHN \"THE DODGER\" SMITH","19/06/1997","\"\"","\" \""};
      // Get a splitter with default configuration. See the javadoc.
      CSVSplitter splitter = new CSVSplitter();
      // Split the given line into fields
      String[] fields = splitter.split(line);
      System.err.println("DEBUG: "+krc.utilz.StringArrayz.join("|",fields));
      // verify that the fields are as expected.
      assert java.util.Arrays.equals(fields, expected);
      }

      }


      Cheers. Keith.
        • 1. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
          PhHein
          Keith, have you considered using [SDN Share|http://sdnshare.sun.com/] ?
          • 2. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
            843785
            My Hero!

            I was just looking to do this using regex but it didn't seem to be possible. Your solution came as if it was requested.

            Thanks.
            • 3. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
              843785
              enter_name wrote:
              My Hero!

              I was just looking to do this using regex but it didn't seem to be possible. Your solution came as if it was requested.

              Thanks.
              There are several open source Java csv parsers. Google finds them.
              • 4. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                800308
                have you considered using SDN Share ?
                I had no idea that such a thing exists... thank you.... I shall certainly do so in future.
                • 5. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                  843785
                  There are several open source Java csv parsers. Google finds them.
                  I like [Ostermiller's|http://ostermiller.org/utils/CSV.html].

                  ~
                  • 6. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                    800308
                    I was just looking to do this using regex but it didn't seem to be possible.
                    I'm sure it's possible using regex's... I just didn't know how (I'm no regex wizz) but I did have some idea of how to do it "manually", so I just did it "manually".

                    Sometimes doing things the easy way is too difficult.

                    http://forums.sun.com/thread.jspa?forumID=54&threadID=5318897&start=6 and the explanation at post 8...
                    and my fumbling attempts to demystify the explanation sprawled over the next few pages.
                    Your solution came as if it was requested.
                    I'm glad you found it useful... if you find any bugs in the course of the next few days please be sure to post back.... after that it can sink or swim on it's own merits.

                    Cheers. Keith.
                    • 7. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                      800308
                      yawmark wrote:
                      There are several open source Java csv parsers. Google finds them.
                      I like [Ostermiller's|http://ostermiller.org/utils/CSV.html].

                      ~
                      I've used the [appache commons CSV|http://commons.apache.org/sandbox/csv/] parser in previous projects, but this time I needed to be able to suppess trimming of trailing spaces, for justifying fields in a report... I'm too dumb to follow the appache code.

                      I took a look at SuperCsv, but it also insisted on eating white spaces... and the size of the code... it's a sledge-hammer for walnuts.

                      [Ostermiller's|http://ostermiller.org/utils/CSV.html] looks nice and simple (at face value)... do you know off hand if it has the ability to not trim leading/trailing whitespaces?

                      FYI...
                       * <strong>References:<strong><ul>
                       * <li>http://supercsv.sourceforge.net/csvSpecification.html - An clear
                       *  statement of what one person believes the CSV format should be. By the way,
                       *  SuperCSV is amazing, but it's missing some configurations, especially the
                       *  trimFields option, which is contentious and MUST therefore be configurable.
                       * <li>http://en.wikipedia.org/wiki/Comma-separated_values - is a good summary.
                       * <li>http://tools.ietf.org/html/rfc4180 - is a memo, NOT a recommendation, but
                       *  THERE IS NO STANDARD!
                       * </ul>
                      Also it'd be nice to replace this (one day when I grow up) with a regex implementation, for flexibility... then you could call it a WSV... Whatever Seperated Values.

                      I'd also like a SV reader/writer pair which strips/adds line prefixes and suffixes... I've got a quick hack which does it specificly, but nothing generic enough to be reusable (except via the ole' copy-paste-edit).

                      Cheers. Keith.
                      • 8. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                        843785
                        corlettk wrote:
                        I was just looking to do this using regex but it didn't seem to be possible.
                        I'm sure it's possible using regex's...
                        At best it is not easy doing this using regex. One problem comes from nested ',' and '"' characters where the rules for splitting a line are fairly complex.
                        • 9. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                          843785
                          Also it'd be nice to replace this (one day when I grow up) with a regex implementation, for flexibility... then you could call it a WSV... Whatever Seperated Values.
                          That'd be String.split(). ;o)

                          ~
                          • 10. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                            843785
                            yawmark wrote:
                            Also it'd be nice to replace this (one day when I grow up) with a regex implementation, for flexibility... then you could call it a WSV... Whatever Seperated Values.
                            That'd be String.split(). ;o)

                            ~
                            But, as stated by Sabre, escaping the splitter is a real pain.

                            Vincent
                            • 11. Re: Comma Separated Values (CSV) Splitter. The guts of a CsvReader.
                              800308
                              This post completes the WSVparser (Whatever Separated Values reader) picture.

                              The abstract helper krc.utilz.io.Filez class now exposes a parse method, which takes an implementation of LineParser... LineParser is implemented by a couple of [anonymous inner classes|http://mindprod.com/jgloss/anonymousclasses.html] which leverage an undercooked version of the [decorator pattern|http://en.wikipedia.org/wiki/Decorator_pattern] to create a "processing chain".

                              from: krc.utilz.io.test.FilezTest.java // usage examples
                                /**
                                 * Test the simple single parser usage of Filez.parse
                                 */
                                private static void testParse() {
                                  try {
                                    final String filename = "C:/Java/home/src/krc/xml/account/test/Accounts.csv";
                                    BufferedReader reader = new BufferedReader(new FileReader(filename));
                              
                                    Filez.LineParser lineParser = new Filez.LineParser() {
                                      private final krc.utilz.io.CSVSplitter splitter = new krc.utilz.io.CSVSplitter();
                                      @Override
                                      public String parseLine(String line, int lineNumber) throws java.text.ParseException {
                                        String[] fields = splitter.split(line);
                                        fields[1] = null;
                                        System.out.println(java.util.Arrays.toString(fields));
                                        return null; // no chaining here
                                      }
                                    };
                              
                                    Filez.parse(reader, lineParser);
                                  } catch (Exception e) {
                                    e.printStackTrace();
                                  }
                                }
                              
                                /**
                                 * Test the "chained parsers" usage of Filez.parse
                                 */
                                private static void testChainedParse() {
                                  try {
                                    final String filename = "C:/Java/home/src/krc/xml/account/test/Accounts.csv";
                                    BufferedReader reader = new BufferedReader(new FileReader(filename));
                              
                                    final Filez.LineParser innerLineParser = new Filez.LineParser() {
                                      private final krc.utilz.io.CSVSplitter splitter = new krc.utilz.io.CSVSplitter();
                                      @Override
                                      public String parseLine(String line, int lineNumber) throws java.text.ParseException {
                                        String[] fields = splitter.split(line);
                                        fields[1] = "";
                                        String result = java.util.Arrays.toString(fields);
                                        System.out.println("innerLineParser:"+result);
                                        return result; // for chaining
                                      }
                                    };
                              
                                    final Filez.LineParser outerLineParser = new Filez.LineParser() {
                                      @Override
                                      public String parseLine(String line, int lineNumber) throws java.text.ParseException {
                                        // http://mindprod.com/jgloss/anonymousclasses.html
                                        // you can access the calling method's FINAL variables inside an anonymous-inner-class.
                                        String parsed = innerLineParser.parseLine(line, lineNumber);
                                        parsed = parsed.substring(1,parsed.length()-2).replaceAll(", ",",").replaceAll(",,+",",");
                                        System.out.println("outerLineParser:"+parsed+"\n");
                                        return null; // no chaining here
                                      }
                                    };
                              
                                    Filez.parse(reader, outerLineParser);
                                  } catch (Exception e) {
                                    e.printStackTrace();
                                  }
                                }
                              The above isn't very practical... I still haven't actually figured out where you might use the parser chaining... maybe to implement a bunch of optional operations on a tab-seperated-values file.

                              from: krc.utilz.io.Filez.java
                                /////////////////////////////////////////////////////////////////////////////
                                // Parsing files
                                /////////////////////////////////////////////////////////////////////////////
                              
                                /**
                                 * Parses each line from the given BufferedReader by calling the
                                 * parseLine() method of given LineParser to process each line.
                                 * At the end-of-file the reader is ALWAYS closed, and the number of
                                 * ParseException's encountered is return.
                                 * <p>
                                 * <strong>Error Handling</strong><br>
                                 * The parse() method catches any ParseException's thrown by parseLine(),
                                 * increments the parseExceptionCount and keeps on chugging.
                                 * All other Exception's are fatal.
                                 * If you want parse() to keep-going then your parseLine() should catch
                                 * Exception (log it) and throw a new ParseException("", 0).
                                 * Note that the return-value of parse() is the number of ParseException's
                                 * <strong>caught</strong>.
                                 * If you want to completely ignore parsing errors (NOT recommended)
                                 * then just eat all Exception's in parseLine().
                                 * <p>
                                 *
                                 * @param BufferedReader reader - The input text-stream.
                                 * @param LineParser parser - The parser used to process each line.
                                 * @return int - parseExceptionCount: The number of ParseException's caught.
                                 */
                                public static int parse(BufferedReader reader, LineParser parser) throws java.text.ParseException {
                                  int lineNumber = 0;
                                  try {
                                    int parseExceptionCount = 0;
                                    try {
                                      String line = null;
                                      while ( (line=reader.readLine()) != null ) {
                                        try {
                                          parser.parseLine(line, ++lineNumber);
                                        } catch (ParseException e) {
                                          parseExceptionCount++;
                                        }
                                      }
                                    } finally {
                                      if(reader!=null)reader.close();
                                    }
                                    return parseExceptionCount;
                                  } catch (Exception e) {
                                    e.printStackTrace();
                                    throw new ParseException("Parse failed! cause="+e, lineNumber);
                                  }
                                }
                                /**
                                 * The LineParser interface is accepted by the parse method. It specificies
                                 * just the parseLine() method, which is called by parse() to process each
                                 * line. The parse() method takes and returns a humble String, so you can
                                 * create a chain of LineParsers with a constructor which takes a 
                                 * LineParser, then you call chainedParser.parseLine() in parseLine().
                                 * Note that parse() discards parseLine()'s return value. Your parseLine() 
                                 * should output to a Stream/Collection/whatever. It can't return anything
                                 * back through parse().
                                 */
                                public static interface LineParser {
                                  public String parseLine(String line, int lineNumber) throws java.text.ParseException;
                                }
                              I'm really rather proud of this... it's my first "conscious" foray into pattern-land... but if you know how to "do this better" then please please please feel free... I'm open to constructive critiques.... and I have a feeling that this could be done better... maybe with inheritance from an abstract LineProcessor?

                              Cheers all.

                              PS: No, I can't post follow-up code to SDN Share... and anyway: google finds this one.