This discussion is archived
1 2 Previous Next 15 Replies Latest reply: May 1, 2013 5:27 PM by 801904 RSS

ASCII character/string processing and performance - char[] versus String?

801904 Newbie
Currently Being Moderated
Hello everyone

I am relative novice to Java, I have procedural C programming background.

I am reading many very large (many GB) comma/double-quote separated ASCII CSV text files and performing various kinds of pre-processing on them, prior to loading into the database.

I am using Java7 (the latest) and using NIO.2.

The IO performance is fine.

My question is regarding performance of using char[i] arrays versus Strings and StringBuilder classes using charAt() methods.
I read a file, one line/record at a time and then I process it. The regex is not an option (too slow and can not handle all cases I need to cover).

I noticed that accessing a single character of a given String (or StringBuilder too) class using String.charAt(i) methods is several times (5 times+?) slower than referring to a char of an array with index.

My question: is this correct observation re charAt() versus char[i] performance difference or am I doing something wrong in case of a String class?

What is the best way (performance) to process character strings inside Java if I need to process them one character at a time ?

Is there another approach that I should consider?

Many thanks in advance
  • 1. Re: ASCII character/string processing and performance - char[] versus String?
    Kayaman Guru
    Currently Being Moderated
    While generally performance wouldn't be an issue in deciding whether to go for char[] or String, in this case (even if short lived objects are quite cheap) you seem to be doing such low level character manipulation that there's no reason you should turn the bytes you read into Strings.
  • 2. Re: ASCII character/string processing and performance - char[] versus String?
    jtahlborn Expert
    Currently Being Moderated
    Kayaman wrote:
    While generally performance wouldn't be an issue in deciding whether to go for char[] or String, in this case (even if short lived objects are quite cheap) you seem to be doing such low level character manipulation that there's no reason you should turn the bytes you read into Strings.
    huh? why would the OP want to be working with bytes?
  • 3. Re: ASCII character/string processing and performance - char[] versus String?
    gimbal2 Guru
    Currently Being Moderated
    yurib wrote:
    My question: is this correct observation re charAt() versus char[ i ] performance difference or am I doing something wrong in case of a String class?
    I can't believe it would be 5 times slower. String still wraps around a char array, what charAt() adds is range checking of the index and nothing more. I would expect the following things to slow it down because you call it quite a large number of times:

    - method call overhead
    - additional range checking code being executed per iteration
    - the optimizer can probably be more efficient when working directly on an array

    But 5 times slower... no. Perhaps the way you measure performance is the source of your odd results.
  • 4. Re: ASCII character/string processing and performance - char[] versus String?
    rp0428 Guru
    Currently Being Moderated
    >
    I am reading many very large (many GB) comma/double-quote separated ASCII CSV text files and performing various kinds of pre-processing on them, prior to loading into the database.

    I am using Java7 (the latest) and using NIO.2.

    The IO performance is fine.

    My question is regarding performance of using char arrays versus Strings and StringBuilder classes using charAt() methods.
    I read a file, one line/record at a time and then I process it. The regex is not an option (too slow and can not handle all cases I need to cover).

    I noticed that accessing a single character of a given String (or StringBuilder too) class using String.charAt(i) methods is several times (5 times+?) slower than referring to a char of an array with index.

    My question: is this correct observation re charAt() versus char performance difference or am I doing something wrong in case of a String class?

    What is the best way (performance) to process character strings inside Java if I need to process them one character at a time ?

    Is there another approach that I should consider?
    >
    Your post suggests you are suffering from CTD: compulsive tuning disorder. :D

    Nothing you said suggests that you actually HAVE a performance problem. And two things you said suggest that if you did have a performance problem it wouldn't be in the area you are concerned about:
    >
    prior to loading into the database
    . . .
    IO performance is fine
    >
    The relatively slow performance of those IO operations will 'swamp' any CPU processing you do to manipulate the data. Any file-base IO is at least tens of thousands of times slower than the slowest CPU you can get today.

    The best 'file to database' performance you could hope to achieve would be loading simple, 'known to be clean', record of a file into ONE table column defined, perhaps, as VARCHAR2(1000); that is, with NO processing of the record at all to determine column boundaries.

    That performance would be the standard you would measure all others against and would typically be in the hundreds of thousands or millions of records per minute.

    What you would find is that you can perform one heck of a lot of processing on each record without slowing that 'read and load' process down at all.

    I have often seen performance issues with reading files and loading data into a DB but those performance problems have ALWAYS been due to the file operations that read the file, the JDBC code that loads the data or both. The most common file performance issue is to not use a buffered reader.

    The most common DB load issues are using auto commit, not using JDBC batching, not using BULK loading (e.g. not using an APPEND hint) and performing BULK inserts into a table that already has one or more indexes on it. DB loads are fastest when BULK features of the DB (Oracle in particular) are used and there are no indexes that need to be maintained for each row.

    So I am skeptical that you have an actual performance problem with any CPU-base processing that you are performing in between the reading and the DB loading that reduces performance below the basic 'read and load'. I am especially skeptical if all you are doing is simply identifying the field/column boundaries of each record so the data can be inserted into the proper column.

    To guard against CTD you should first confirm that you actually have a performance problem that you need to deal with.

    Post the code and/or more details about any actual problem you have and we can help you solve it. In particular post more information about the known format of the delimited files.

    1. Is every field enclosed in double quotes? If so then why? Fields generally only need to be enclosed if the field delimiter itself (e.g. a COMMA) IS a characater within the field. Using double quotes around EVERY field will bloat your file significantly and will affect the performance by the simple fact that the file is so much bigger than it needs to be.

    2. Can the records have embedded record terminators? For example can a file with records terminated by CRLF (carriage-return/linefeed) have one or more CR or LF characters embedded within a field of a record? If so, what escape mechanish is being used in the file that can be used to identify those?

    3. Do you have the metadata about the file prior to the load: number of fields/columns per record, datatype of each field, whether trailing NULLs are allowed, etc?

    I suggest you create a test case for that reference test I mentioned and make sure that you get your expected performnace from that. Then compare the performance you are getting from your actual load to that reference.

    Then if you still actually have a performance issue post the details.
  • 5. Re: ASCII character/string processing and performance - char[] versus String?
    801904 Newbie
    Currently Being Moderated
    thanks, yes, I accept that the performance difference could be due to the method charAt() call overhead more than the internal method operation of extra range checking, etc, but nevertheless, the end result of using charAt() solution instead of char[] indexed array is about 3-5 times, in pure CPU.

    I removed all IO from the program and generated synthetic String of 1000 characters, loaded ASCII chars from space to ~ (decimal 32-126) and was just executing a simple loop over each character using a char[] and using String using charAt().

    The result is about 3-5 times slower using charAt() compared to char[] on my laptop.

    I realise that I am new to Java, hence the question was if this is expected and if I should be using char[] or using charAt() "better" or using some other technique to analyse my ASCII character strings....

    I am happy with my char[] solution performance but it feels as if I have written yet another C procedural program, but using Java syntax !! 8^)

    Many thanks for your contributions.
  • 6. Re: ASCII character/string processing and performance - char[] versus String?
    801904 Newbie
    Currently Being Moderated
    I am nowhere near database yet, I am just looking at performance of input string processing.
    I can read the input file very quickly and I can run multiple Java jobs reading files or use new Asynch IO too (I have Java7 and NIO.2 at my disposal).

    I don't have a database or file IO performance problem, at least not yet.

    I am only asking about performance and approach of using char[] arrays versus Java String classes.

    But since you asked.....

    The only thing that is fixed is that all input files are ASCII (not Unicode) characters in range of 'space' to '~' (decimal 32-126) or common control characters like CR,LF,etc.

    My files have no metadata, some are comma delimited and some comma and double quote delimited together, to protect the embedded commas inside columns.
    The number of columns in a file is variable and each line in any one file can have a different number of columns. Ragged columns.
    There may be repeated null columns in any like ,,, or "","","" or any combination of the above.
    There may also be spaces between delimiters.
    The files may be UNIX/Linux terminated or Windows Server terminated (CR/LF or CR or LF).
    To make it even harder, there may be embedded LF characters inside the double quoted columns too, which need to be caught and weeded out.
    Some numeric columns will also need processing to handle currency signs and numeric formats that are not valid for the database inpu.

    It does not feel like a job for RegEx (I want to be able to maintain the code and complex Regex is often 'write-only' code that a 9200bpm modem would be proud of!) and I don't think PL/SQL will be any faster or easier than Java for this sort of character based work.

    I have no control over format of the incoming files, they are coming from all sorts of legacy systems, many from IBM mainframes or AS/400 series, for example. Others from Solaris and Windows.
    Some files will be small, some many GB in size.

    Bottom line: There are many ways of solving this problem, too big for a Forum post.

    this is why I just narrowed it down for the Forums to specific issue, to help me learn more about Java and use of char[] or Strings..... Nothing more.

    Thanks
  • 7. Re: ASCII character/string processing and performance - char[] versus String?
    Tolls Journeyer
    Currently Being Moderated
    You're going to have to show us how you are timing this, because I've just cobbled together a quick test below and I get the following results:
    public class Scratch {
         public static void main(String args[]) {
              char minChar = 32;
              char maxChar = 126;
              char[] chars = new char[1000];
              char currentChar = minChar;
              for (int i = 0; i < 1000; i++) {
                   chars[i] = currentChar;
                   currentChar++;
                   if (currentChar > maxChar) {
                        currentChar = minChar;
                   }
              }
              String myString = String.copyValueOf(chars);
              Date startDate = new Date();
              timeString(myString);
              Date endDate = new Date();
              long stringTime = endDate.getTime() - startDate.getTime();
              startDate = new Date();
              timeChars(chars);
              endDate = new Date();
              long charTime = endDate.getTime() - startDate.getTime();
              System.out.println("stringTime = " + stringTime);
              System.out.println("charTime   = " + charTime);
         }
    
         private static void timeChars(char[] chars) {
              int length = chars.length;
              for (long i = 0; i < 10000000; i++) {
                   for (int j = 0; j < length; j++) {
                        char c = chars[j];
                   }
              }
         }
    
         private static void timeString(String myString) {
              int length = myString.length();
              for (long i = 0; i < 10000000; i++) {
                   for (int j = 0; j < length; j++) {
                        char c = myString.charAt(j);
                   }
              }
         }
    }
    stringTime = 42
    charTime = 30
    Which makes some sense as charAt does range checking.
  • 8. Re: ASCII character/string processing and performance - char[] versus String?
    801904 Newbie
    Currently Being Moderated
    YES, you get the medal - THANKS!

    I fixed the problem by looking at your code - the CPU culprit turned out to be the String.Length() method, it is the one consuming the CPU on repeated calls in the for loop.
    Not the charAt() method at all.

    Once I took that String.length() method out of the 'for loop' and used integer length local variable, as you have in your code, the performance is very close between array of char and String charAt() approaches.

    I was expecting the Java compiler/JVM/Hot-Spot/etc to optimise that call for me, but it did not happen (I use NetBeans 7.3 IDE btw), perhaps there is some sort of "optimise" switch I don't know about.

    In any case, happiness.

    Thanks.
  • 9. Re: ASCII character/string processing and performance - char[] versus String?
    baftos Expert
    Currently Being Moderated
    String is immutable, so once constructed the length cannot change. Therefore String.length() is just a simple getter, simpler even than charAt().
    I would suggest you reconsider your performance tests. I am not saying you are wrong, just that I am very surprised.
  • 10. Re: ASCII character/string processing and performance - char[] versus String?
    801904 Newbie
    Currently Being Moderated
    Java code below. be my guest to critique/modify/explain
    sorry about the format, I don't know how to make the forum post preserve it.


    public class ComYuriDates
    {

    public static void main(String[] args)
    {               
    char[] ca;
    String s;
    int i=0, j=32; // 32 is space ASCII in decimal

    ca = new char[1000];
    for (i=0; i<ca.length; i++)
    {
    ca[i] = (char) (j++);
    if (j>126)
    {
    j=32;
    }
    }
    s = new String(ca);

    System.out.println("char[]");
    System.out.println(ca);
    System.out.println("String");
    System.out.println(s);


    long et=0, st=0, k = 0, r = 0;
    char c;
    int len = 0;

    System.out.println("char array one char at a time: ");
    st = System.currentTimeMillis();
    for (k=0; k<1000000L; k++)
    {
    len = ca.length;
    for (i=0; i<len; i++)
    {
    r = r + ca;
    }
    }

    System.out.println("FINISHED char array one char at a time: result " + r);
    et = System.currentTimeMillis();
    System.out.println("Time " + (et-st));

    System.out.println("String one char at a time: ");
    st = System.currentTimeMillis();
    et=0;

    for (k=0; k<1000000L; k++)
    {
    len = s.length();
    for (i=0; i<len; i++)
    {
    r = r + (char) s.charAt(i);
    }
    }

    System.out.println("FINISHED String one char at a time: result " + r);
    et = System.currentTimeMillis();
    System.out.println("Time " + (et-st));

    }
    }

    // end of Java code

    Edited by: yurib on May 1, 2013 12:12 PM
  • 11. Re: ASCII character/string processing and performance - char[] versus String?
    rp0428 Guru
    Currently Being Moderated
    >
    Once I took that String.length() method out of the 'for loop' and used integer length local variable, as you have in your code, the performance is very close between array of char and String charAt() approaches.
    >
    You are still worrying about something that is irrevelant in the greater scheme of things.

    It doesn't matter how fast the CPU processing of the data is if it is faster than you can write the data to the sink. The process is:

    1. read data into memory
    2. manipulate that data
    3. write data to a sink (database, file, network)

    The reading and writing of the data are going to be tens of thousands of times slower than any CPU you will be using. That read/write part of the process is the limiting factor of your throughput; not the CPU manipulation of step #2.

    Step #2 can only go as fast as steps #1 and #3 permit.

    Like I said above:
    >
    The best 'file to database' performance you could hope to achieve would be loading simple, 'known to be clean', record of a file into ONE table column defined, perhaps, as VARCHAR2(1000); that is, with NO processing of the record at all to determine column boundaries.

    That performance would be the standard you would measure all others against and would typically be in the hundreds of thousands or millions of records per minute.

    What you would find is that you can perform one heck of a lot of processing on each record without slowing that 'read and load' process down at all.
    >
    Regardless of the sink (DB, file, network) when you are designing data transport services you need to identify the 'slowest' parts. Those are the 'weak links' in the data chain. Once you have identified and tuned those parts the performance of any other step merely needs to be 'slightly' better to avoid becoming a bottleneck.

    That CPU part for step #2 is only rarely, if every the problem. Don't even consider it for specialized tuning until you demonstrate that it is needed.

    Besides, if your code is properly designed and modularized you should be able to 'plug n play' different parse and transform components after the framework is complete and in the performance test stage.
    >
    The only thing that is fixed is that all input files are ASCII (not Unicode) characters in range of 'space' to '~' (decimal 32-126) or common control characters like CR,LF,etc.
    >
    Then you could use byte arrays and byte processing to determine the record boundaries even if you then use String processing for the rest of the manipulation.

    That is what my framework does. You define the character set of the file and a 'set' of allowable record delimiters as Strings in that character set. There can be multiple possible record delimiters and each one can be multi-character (e.g. you can use 'XyZ' if you want.

    The delimiter set is converted to byte arrays and the file is read using RandomAccessFile and double-buffering and a multiple mark/reset functionality. The buffers are then searched for one of the delimiter byte arrays and the location of the delimiter is saved. The resulting byte array is then saved as a 'physical record'.

    Those 'physical records' are then processed to create 'logical records'. The distinction is due to possible embedded record delimiters as you mentioned. One logical record might appear as two physical records if a field has an embedded record delimiter. That is resolved easily since each logical record in the file MUST have the same number of fields.

    So a record with an embedded delimiter will have few fields than required meaning it needs to be combined with one, or more of the following records.
    >
    My files have no metadata, some are comma delimited and some comma and double quote delimited together, to protect the embedded commas inside columns.
    >
    I didn't mean the files themselves needed to contain metadata. I just meant that YOU need to know what metadata to use. For example you need to know that there should ultimately be 10 fields for each record. The file itself may have fewer physical fields due to TRAILING NULLCOS whereby all consecutive NULL fields at the of a record do not need to be present.
    >
    The number of columns in a file is variable and each line in any one file can have a different number of columns. Ragged columns.
    There may be repeated null columns in any like ,,, or "","","" or any combination of the above.
    There may also be spaces between delimiters.
    The files may be UNIX/Linux terminated or Windows Server terminated (CR/LF or CR or LF).
    >
    All of those are basic requirements and none of them present any real issue or problem.
    >
    To make it even harder, there may be embedded LF characters inside the double quoted columns too, which need to be caught and weeded out.
    >
    That only makes it 'harder' in the sense that virtually NONE of the standard software available for processing delimited files take that into account. There have been some attempts (you can find them on the net) for using various 'escaping' techniques to escape those characters where they occur but none of them ever caught on and I have never found any in widespread use.

    The main reason for that is that the software used to create the files to begin with isn't written to ADD the escape characters but is written on the assumption that they won't be needed.

    That read/write for 'escaped' files has to be done in pairs. You need a writer that can write escapes and a matching reader to read them.

    Even the latest version of Informatica and DataStage cannot export a simple one column table that contains an embedded record delimiter and read it back properly. Those tools simply have NO functionality to let you even TRY to detect that embedded delimiters exist let alone do any about it by escaping those characters. I gave up back in the '90s trying to convince the Informatica folk to add that functionality to their tool. It would be simple to do.
    >
    Some numeric columns will also need processing to handle currency signs and numeric formats that are not valid for the database inpu.

    It does not feel like a job for RegEx (I want to be able to maintain the code and complex Regex is often 'write-only' code that a 9200bpm modem would be proud of!) and I don't think PL/SQL will be any faster or easier than Java for this sort of character based work.
    >
    Actually for 'validating' that a string of characters conforms (or not) to a particular format is an excellent application of regular expressions. Though, as you suggest, the actual parsing of a valid string to extract the data is not well-suited for RegEx. That is more appropriate for a custom format class that implements the proper business rules.

    You are correct that PL/SQL is NOT the language to use for such string parsing. However, Oracle does support Java stored procedures so that could be done in the database. I would only recommend pursuing that approach if you were already needing to perform some substantial data validation or processing the DB to begin with.
    >
    I have no control over format of the incoming files, they are coming from all sorts of legacy systems, many from IBM mainframes or AS/400 series, for example. Others from Solaris and Windows.
    >
    Not a problem. You just need to know what the format is so you can parse it properly.
    >
    Some files will be small, some many GB in size.
    >
    Not really relevant except as it relates to the need to SINK the data at some point. The larger the amount of SOURCE data the sooner you need to SINK it to make room for the rest.

    Unfortunately, the very nature of delimited data with varying record lengths and possible embedded delimiters means that you can't really chunk the file to support parallel read operations effectively.

    You need to focus on designing the proper architecture to create a modular framework of readers, writers, parsers, formatters, etc. Your concern with details about String versus Array are way premature at best.

    My framework has been doing what you are proposing and has been in use for over 20 years by three different major nternational clients. I have never had any issues with the level of detail you have asked about in this thread.

    Throughout is limited by the performance of the SOURCE and the SINK. The processing in-between has NEVER been an issu.

    A modular framework allows you to fine-tune or even replace a component at any time with just 'plug n play'. That is what Interfaces are all about. Any code you write for a parser should be based on an interface contract. That allows you to write the initial code using the simplest possible method and then later if, and ONLY if, that particular module becomes a bottlenect, replace that module with one that is more performant.

    Your intital code should ONLY use standard well-established constructs until there is a demonstrated need for something else. For your use case that means String processing, not byte arrays (except for detecting record boundaries).
  • 12. Re: ASCII character/string processing and performance - char[] versus String?
    801904 Newbie
    Currently Being Moderated
    thanks. good feedback and in line with my personal experiences too. In the past I used C (and COBOL before that!!) and certainly properly coded C and even COBOL did not present a problem, it was always the source and/or target IO or network that were the bottlenecks.

    Curiosity question: is your framework 100% written in Java (what version?) or did you use some other languages, at least in parts?

    Is it safe to assume that Java 6 and Java 7 are 100% ready for all the 'heavy lifting' even in heavy batch oriented tasks, such as these?

    thanks
  • 13. Re: ASCII character/string processing and performance - char[] versus String?
    gimbal2 Guru
    Currently Being Moderated
    yurib wrote:
    Is it safe to assume that Java 6 and Java 7 are 100% ready for all the 'heavy lifting' even in heavy batch oriented tasks, such as these?
    I certainly processed (load, parse, store in database) millions of lines of data without any issues, using Java 5 and 6. Its not limited by the technology - only by the hardware, the runtime environment and especially the configuration of that environment.
  • 14. Re: ASCII character/string processing and performance - char[] versus String?
    rp0428 Guru
    Currently Being Moderated
    >
    Curiosity question: is your framework 100% written in Java (what version?) or did you use some other languages, at least in parts?
    >
    Originally written in C in mid to late '80s. That first version was byte-oriented ASCII/EBCDIC and was strictly file-to-file validation and transform. It was primarily used for mainframe to mainframe (IBM to 'other') data conversions and ASCII to EBCDIC. Also supported COBOL and mainframe data formats since many of those had different flavors and variations among mainframes.

    Most of the files were fixed-width (not delimited) format, both flat and hierarchical. Hierarchical formats had records with different formats in the file with the record type indicated by a field in the file; thing parent/child (one to many) records in the same file. Maybe 10% were delimited.

    Late 90's converted to Java 1.2 once 1.3 was imminent. Still file-to-file since most of the database work involved using DB tools for import/export of data.

    Current version uses 1.6 but only to stay current with Java not because of any need or use of new Java functionality. The only Java-related updates since 1.2 have been support of multi-threading, streams, character set conversions and the JDBC area.

    Non-Java specific updates included support for XML file formats. These were needed as companiies starting sourcing and sinking data that used the emerging EDI standards. Banking, Telecom and the Airline industry use EDI formats extensively.

    Except for pure ASCII trying to use byte processing of character-based data is way too complicated to be useful in a generic-oriented platform that stirves to provide wide character set support. Anyone who has ever had to try to deal with things like BOMS (byte order marks) and different ENDIAN usage on different file systems knows what I mean. In some cases it is virtually impossible to auto-detect the character set being used due to such complications.

    That is why the most fundamental thing you need to know about a text-based data file to process it properly is the character set. The reason I use byte-oriented processing to detect record boundaries is because when you use character readers they often need to look ahead to know where the boundary really is. That can result in a partial read of the next character and then that needs to be dealt with.

    I convert delimiters expressed as character strings (or unicode) to byte arrays and look for those; no lookahead to speak of. That provides the start/end of an item of interest (physical record in my case) which I then use to source a byte array. THEN, based on the character set, it is trivial to convert that byte array to an appropriate String.

    NOTE: that means I can support files that might use multiple character sets.
    >
    Is it safe to assume that Java 6 and Java 7 are 100% ready for all the 'heavy lifting' even in heavy batch oriented tasks, such as these?
    >
    Well now we are back to what was confusing me to begin with.

    Java 6 is more than capable. I haven't had any need or desire to update the framework to Java 7 since there is nothing in Java 7 that I need and most clients aren't there yet either.

    But what 'heavy lifting' are you even talking about?

    1. Read from a source
    2. Parse and manipulate some data
    3. Write to a sink

    I don't consider any of those to be heavy lifting or to be a 'heavy batch oriented' task.

    Divide and conquer - modular code - K.I.S.S.
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points