1 2 Previous Next 17 Replies Latest reply: Dec 23, 2008 1:34 PM by jschellSomeoneStoleMyAlias RSS

    Bug in javadocs for StringTokenizer

    jschellSomeoneStoleMyAlias
      The following text would appear to be a bug in the StringTokenizer class


      StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.


      Regex, which is what split uses is not a generally suitable solution for parsing.

      Regex certainly shouldn't be used for simple parsing where parsing is the goal.

      And java has a perfectly acceptable alternative for simple parsing in the java.io.StreamTokenizer class.

      If the javadocs are going to provide advice on programming (which this is) then it should at least provide some cautions so someone doesn't end up attempting to use String.split() on a 1 gig source file.
        • 1. Re: Bug in javadocs for StringTokenizer
          EJP
          Regex, which is what split uses is not a generally suitable solution for parsing.
          If you mean 'scanning', which I assume you do, I'm curious to know why not. All but the first two compilers I built used regular expressions for scanning, compiled via 'flex' or 'JavaCC'.

          OTOH if you do mean 'parsing' then StringTokenizer isn't usable for that either. Not having a stack.
          • 2. Re: Bug in javadocs for StringTokenizer
            jschellSomeoneStoleMyAlias
            ejp wrote:
            Regex, which is what split uses is not a generally suitable solution for parsing.
            If you mean 'scanning', which I assume you do, I'm curious to know why not. All but the first two compilers I built used regular expressions for scanning, compiled via 'flex' or 'JavaCC'.
            I am rather certain that they used regexes to identify tokens as they moved along the stream.

            That is far different than parsing the entire stream into tokens before doing anything at all with it.
            • 3. Re: Bug in javadocs for StringTokenizer
              807589
              Good point: for large inputs they should recommend Scanner instead of split().
              • 4. Re: Bug in javadocs for StringTokenizer
                EJP
                I am rather certain that they used regexes to identify tokens as they moved along the stream.

                That is far different than parsing the entire stream into tokens before doing anything at all with it.
                I am rather more certain that this is a distinction without a difference. A compiler doesn't do anything with the input stream except scan it into tokens, then parse the tokens into sentences, &c.
                • 5. Re: Bug in javadocs for StringTokenizer
                  jwenting
                  uncle_alice wrote:
                  Good point: for large inputs they should recommend Scanner instead of split().
                  I don't think they've changed the JavaDoc since inserting what amounts to a deprecation warning.
                  • 6. Re: Bug in javadocs for StringTokenizer
                    jschellSomeoneStoleMyAlias
                    ejp wrote:
                    I am rather certain that they used regexes to identify tokens as they moved along the stream.

                    That is far different than parsing the entire stream into tokens before doing anything at all with it.
                    I am rather more certain that this is a distinction without a difference. A compiler doesn't do anything with the input stream except scan it into tokens, then parse the tokens into sentences, &c.
                    Yes it is different.

                    Several reasons.
                    1. Context changes the tokenization process.
                    2. The token is the goal not the string.
                    3. The javadocs suggestion leads to using non-trivial amounts of memory that would not otherwise be needed.

                    As an example a parser might choose to maintain a integer numeric and a primitive integer and discard the string. With the javadocs solution the string could not be discarded until the tokenization process was completed.
                    • 7. Re: Bug in javadocs for StringTokenizer
                      807589
                      jwenting wrote:
                      uncle_alice wrote:
                      Good point: for large inputs they should recommend Scanner instead of split().
                      I don't think they've changed the JavaDoc since inserting what amounts to a deprecation warning.
                      That's true. In the code sample they added, they imply that splitting on "\\s" is equivalent using a StringTokenizer with the default delimiter, but it's not. They should have used "\\s+", because StringTokenizer collapses consecutive delimiter characters into one delimiter. I submitted a bug report for that, but they rejected it. Apparently, they don't even want to think about StringTokenizer any more.
                      • 8. Re: Bug in javadocs for StringTokenizer
                        jwenting
                        uncle_alice wrote:
                        Apparently, they don't even want to think about StringTokenizer any more.
                        I don't blame them :)
                        • 9. Re: Bug in javadocs for StringTokenizer
                          jschellSomeoneStoleMyAlias
                          uncle_alice wrote:
                          jwenting wrote:
                          uncle_alice wrote:
                          Good point: for large inputs they should recommend Scanner instead of split().
                          I don't think they've changed the JavaDoc since inserting what amounts to a deprecation warning.
                          That's true. In the code sample they added, they imply that splitting on "\\s" is equivalent using a StringTokenizer with the default delimiter, but it's not. They should have used "\\s+", because StringTokenizer collapses consecutive delimiter characters into one delimiter. I submitted a bug report for that, but they rejected it. Apparently, they don't even want to think about StringTokenizer any more.
                          Odd because there isn't in fact anything wrong with it. Certainly not buggy and doesn't even have a downside such as Vector.

                          Probably someone just decided it wasn't cool enough.
                          • 10. Re: Bug in javadocs for StringTokenizer
                            807589
                            jschell wrote:
                            Certainly not buggy and doesn't even have a downside such as Vector.
                            Tokenizing strings such as "a,b,,c" can lead to unexpected results, depending on the requirements.

                            ~
                            • 11. Re: Bug in javadocs for StringTokenizer
                              jschellSomeoneStoleMyAlias
                              yawmark wrote:
                              jschell wrote:
                              Certainly not buggy and doesn't even have a downside such as Vector.
                              Tokenizing strings such as "a,b,,c" can lead to unexpected results, depending on the requirements.
                              And using split() will completely prevent that?
                              • 12. Re: Bug in javadocs for StringTokenizer
                                807589
                                Let's settle this once and for all: whoever can post the best argument, set to the music of "Greensleeves" wins.
                                • 13. Re: Bug in javadocs for StringTokenizer
                                  807589
                                  jschell wrote:
                                  And using split() will completely prevent that?
                                  It can, depending on the requirements.

                                  <greensleeves> "Alas, my love, you do me wrong..." </greensleeves>

                                  ~
                                  • 14. Re: Bug in javadocs for StringTokenizer
                                    jschellSomeoneStoleMyAlias
                                    yawmark wrote:
                                    jschell wrote:
                                    And using split() will completely prevent that?
                                    It can, depending on the requirements.
                                    And using split() can introduce problems.
                                    And using either StringTokenizer or split() can introduce problems that are solved using StreamTokenizer.
                                    1 2 Previous Next