1 2 3 4 Previous Next 48 Replies Latest reply: Sep 26, 2006 1:38 PM by jschellSomeoneStoleMyAlias RSS

    Faster split than String.split() and StringTokenizer?

    416044
      First I imrpoved performance of split by replacing the String.split() call with a custom method using StringTokenizer:
                      final StringTokenizer st = new StringTokenizer(string, separator, true);
                      String token = null;
                      String lastToken = separator; //if first token is separator
                      while (st.hasMoreTokens()) {
                          token = st.nextToken();
                          if (token.equals(separator)) {
                              if (lastToken.equals(separator)) { //no value between 2 separators?
                                  result.add(emptyStrings ? "" : null);
                              }
                          } else {
                              result.add(token);
                          }
                          lastToken = token;
                      }//next token
      But this is still not very fast (as it is one of the "hot spots" in my profiling sessions). I wonder if it can go still faster to split strings with ";" as the delimiter?
        • 1. Re: Faster split than String.split() and StringTokenizer?
          800322
          If anything can beat the tokenizer, then probably indexOf and substring.
          • 2. Re: Faster split than String.split() and StringTokenizer?
            800322
            If anything can beat the tokenizer, then probably
            indexOf and substring.
            Or a StringBuilder, filled while iterating over the String.
            • 3. Re: Faster split than String.split() and StringTokenizer?
              416044
              Yup, for simple splitting without escaping of separators, indexOf is more than twice as fast:
                  static private List<String> fastSplit(final String text, char separator, final boolean emptyStrings) {
                      final List<String> result = new ArrayList<String>();
                      
                      if (text != null && text.length() > 0) {
                          int index1 = 0;
                          int index2 = text.indexOf(separator);
                          while (index2 >= 0) {
                              String token = text.substring(index1, index2);
                              result.add(token);
                              index1 = index2 + 1;
                              index2 = text.indexOf(separator, index1);
                          }
                          
                          if (index1 < text.length() - 1) {
                              result.add(text.substring(index1));
                          }
                      }//else: input unavailable
                      
                      return result;
                  }
              Faster? ;-)
              • 4. Re: Faster split than String.split() and StringTokenizer?
                800322
                It was rather obvious - regex is complicated, and Tokenizer is just as quick as indexOf; at best it's almost the same, plus one additional method call, and a one-iteration loop.
                • 5. Re: Faster split than String.split() and StringTokenizer?
                  807569
                  I think the fastest yet would be too abandon Strings and use char arrays,
                  a for loop and System.arraycopy
                  • 6. Re: Faster split than String.split() and StringTokenizer?
                    807569
                    Sometimes I wonder Martin why you use Java at all? Why not
                    assembler? I am serious.
                    • 7. Re: Faster split than String.split() and StringTokenizer?
                      800322
                      I think the fastest yet would be too abandon Strings
                      and use char arrays,
                      a for loop and System.arraycopy
                      I'd have to look again to make sure, but I thought String is already using sub-arrays for substrings?
                      • 8. Re: Faster split than String.split() and StringTokenizer?
                        807569
                        I think the fastest yet would be too abandon
                        Strings
                        and use char arrays,
                        a for loop and System.arraycopy
                        I'd have to look again to make sure, but I thought
                        String is already using sub-arrays for substrings?
                        What I am thinking is to return a list (or array) of char arrays. Don't want
                        to waste time allocating whole new String objects that aren't needed
                        you know.

                        I dunno. All Martin's questions are of two varieties it seems:

                        - how big is x in memory? and how can I squeeze this down

                        - how can I hack up things to squeeze more performance out of
                        standard API features?

                        I mean this are both fine topics but to me suggest that Java is not really
                        the way to go here...
                        • 9. Re: Faster split than String.split() and StringTokenizer?
                          jschellSomeoneStoleMyAlias
                          But this is still not very fast (as it is one of the
                          "hot spots" in my profiling sessions). I wonder if it
                          can go still faster to split strings with ";" as the
                          delimiter?
                          Speeding up the code is only one approach.

                          Another approach is to not call the code in the first place. Look at the design and determine how to do the work without calling that piece of code, perhaps deferring it until needed.
                          • 10. Re: Faster split than String.split() and StringTokenizer?
                            camickr
                            I wrote my own, years ago, before regex was part of the SDK. I wrote it mostly to return an empty string when two delimiters where found consecutively. At the time it was about 30% faster then StringTokenizer (I made a few other assumptions as well). If you want to check it out then search the forum for "SimpleTokenizer" to find the download link.

                            I should mention that my largest test case was on a string of 200 characters that yielded 10 tokens. I did 200,000 iterations. Total time was half a second. So I don't see a big savings here. As suggested above looking at the design may be a better approach.
                            • 11. Re: Faster split than String.split() and StringTokenizer?
                              807569
                              String.split() has to recompile the regex and create new Pattern and Matcher objects each time it's called. If you do that stuff ahead of time, I think you'll notice that the split() approach speeds up considerably. You would be using Pattern's split() method in that case, not String's. I'm also assuming that the regex is well written, but the regexes people use with split() tend to be simple, which makes it difficult to screw them up too badly. :D
                              • 12. Re: Faster split than String.split() and StringTokenizer?
                                416044
                                I dunno. All Martin's questions are of two varieties
                                it seems:

                                - how big is x in memory? and how can I squeeze this
                                down

                                - how can I hack up things to squeeze more
                                performance out of
                                standard API features?

                                I mean this are both fine topics but to me suggest
                                that Java is not really
                                the way to go here...
                                "All" my questions? Hello? In my last 3200 posts I guess, the last 2 questions were of this type, the (latest) rest has nothing to do with it. It's that fast that people get squeezed into a tiny little drawer ... :-(
                                • 13. Re: Faster split than String.split() and StringTokenizer?
                                  800322
                                  "All" my questions? Hello? In my last 3200 posts I
                                  guess, the last 2 questions were of this type, the
                                  (latest) rest has nothing to do with it. It's that
                                  fast that people get squeezed into a tiny little
                                  drawer ... :-(
                                  You say you posted 3200 questions? :) I do recall some performance questions, but mostly related to BigInt or BigDecimal. I also remember a lot about 1.5 backwards incompatibility. But of course, I never remember much.
                                  • 14. Re: Faster split than String.split() and StringTokenizer?
                                    807569
                                    For example

                                    http://forum.java.sun.com/thread.jspa?threadID=618476
                                    http://forum.java.sun.com/thread.jspa?threadID=623619&start=0&tstart=0

                                    Basically you never seem to like what Java does or how it works etc. So I do wonder why you bother at all.
                                    1 2 3 4 Previous Next