1 2 Previous Next 16 Replies Latest reply: May 22, 2009 9:14 AM by 807588 RSS

    Reason behind hashcode for String

    807588
      I want to know how the formula was designed for the hashcode of String class.
      Also as a continuation, is there a difference between the generation of hashcode in different versions of java (may be in other classes as well) ? :)
        • 1. Re: Reason behind hashcode for String
          684401
          The javadocs give a good description of this:
          Returns a hash code for this string. The hash code for a String object is computed as
           s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
           
          using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
          If your question is why it's done that way, probably just because it's a good function to produce unique hashes. There are likely entire books (or at least chapters of books) devoted to how to do a really good hash function or how to prove it.

          As for whether it's the same between versions, the answer is very likely, but you probably shouldn't rely on that. If you need to compare hashes, use a well-known algorithm that's guaranteed to have the same value at all times, like MD5.
          • 2. Re: Reason behind hashcode for String
            807588
            JEisen wrote:
            If your question is why it's done that way, probably just because it's a good function to produce unique hashes.
            Not really -
            public class TwoStringWithTheSameHashCode
            {
                public static void main(String[] args)
                {
                    int hash1 = "ABCDEa123abc".hashCode();
                    int hash2 = "ABCDFB123abc".hashCode();
                    System.out.println(hash1 + " " + hash2);
                }  
            }
            • 3. Re: Reason behind hashcode for String
              807588
              sabre150 wrote:
              JEisen wrote:
              If your question is why it's done that way, probably just because it's a good function to produce unique hashes.
              Not really -
              public class TwoStringWithTheSameHashCode
              {
              public static void main(String[] args)
              {
              int hash1 = "ABCDEa123abc".hashCode();
              int hash2 = "ABCDFB123abc".hashCode();
              System.out.println(hash1 + " " + hash2);
              }  
              }
              do we have any hashing function implemented in java which gives us unique hashcodes.???
              and how about the complexity of such hashing techniques?
              • 4. Re: Reason behind hashcode for String
                807588
                sandeep_khurana wrote:
                do we have any hashing function implemented in java which gives us unique hashcodes.???
                What is the maximum number of possible discrete hashcodes? What is the the maximum number of possible discrete strings? How do these values compare?
                • 5. Re: Reason behind hashcode for String
                  807588
                  es5f2000 wrote:
                  sandeep_khurana wrote:
                  do we have any hashing function implemented in java which gives us unique hashcodes.???
                  What is the maximum number of possible discrete hashcodes? What is the the maximum number of possible discrete strings? How do these values compare?
                  When I were a lad I took a course in [Combinatorics |http://en.wikipedia.org/wiki/CombinaTorics], which is just the fancy name for "the art of counting". Our professor introduced the pigeon-hole principle this way: there are two men in this city with the same number of hairs on their head. Prove it! (There was a clarification where we ignored baldies.)
                  • 6. Re: Reason behind hashcode for String
                    684401
                    sabre150 wrote:-
                    public class TwoStringWithTheSameHashCode
                    {
                    public static void main(String[] args)
                    {
                    int hash1 = "ABCDEa123abc".hashCode();
                    int hash2 = "ABCDFB123abc".hashCode();
                    System.out.println(hash1 + " " + hash2);
                    }  
                    }
                    Good point. I mean, it's not impossible to get collisions. I have to admit I didn't really sit and analyze the function to try to calculate the probability. So the answer may very well be that it tends to spread the hashes out in practice. Well, at least, that's the whole idea, right? So maybe there is a better algorithm for it.

                    As for full uniqueness, that's impossible in general -- that's the definition of a hashcode, after all (unless your hashcode is just to return the same value). There are, however, industry-standard algorithms that generate pretty large sets of possible hashcodes (MD5, SHA1, SHA256...)

                    In the end, the best hashcode depends on your data set and what you're willing to sacrifice. If you have a data set of only 10 distinct values, you're (theoretically) not going to benefit much from a hash much at all. If you have millions/billions/etc of unique values (say, all strings up to 1024 chars), then a hashcode which pushes it into 256 bytes is going to collide eventually if you put every possible string into it. The trick is much collision you're okay with, how big your target space is, and how much work you want to do to get a hash.
                    • 7. Re: Reason behind hashcode for String
                      807588
                      JEisen wrote:
                      Good point. I mean, it's not impossible to get collisions. I have to admit I didn't really sit and analyze the function
                      Yea, its not even close. There are only 2^32 hashcodes, and since a single char is 2^16...
                      • 8. Re: Reason behind hashcode for String
                        796440
                        BigDaddyLoveHandles wrote:
                        When I were a lad I took a course in [Combinatorics |http://en.wikipedia.org/wiki/CombinaTorics], which is just the fancy name for "the art of counting". Our professor introduced the pigeon-hole principle this way: there are two men in this city with the same number of hairs on their head. Prove it! (There was a clarification where we ignored baldies.)
                        I first saw that in Paulos' Innumeracy. I like it.
                        • 9. Re: Reason behind hashcode for String
                          807588
                          jverd wrote:
                          BigDaddyLoveHandles wrote:
                          When I were a lad I took a course in [Combinatorics |http://en.wikipedia.org/wiki/CombinaTorics], which is just the fancy name for "the art of counting". Our professor introduced the pigeon-hole principle this way: there are two men in this city with the same number of hairs on their head. Prove it! (There was a clarification where we ignored baldies.)
                          I first saw that in Paulos' Innumeracy. I like it.
                          I have that book, it is a good one. I don't remember that specifically so I'm going to have to check that out.
                          My first reaction is "it depend on the population of the city."
                          But it likely I'm simply searching for the pathological case.
                          A little light re-reading while scarfing burgers this weekend.
                          • 10. Re: Reason behind hashcode for String
                            791266
                            When I were a lad I took a course in [Combinatorics |http://en.wikipedia.org/wiki/CombinaTorics], which is just the fancy name for "the art of counting". Our professor introduced the pigeon-hole principle this way: there are two men in this city with the same number of hairs on their head. Prove it! (There was a clarification where we ignored baldies.)
                            I've heard two people trying to answer that while they had a discussion on the bus. My spontaneous reaction / answer was that many babies get born with almost no hair, and many men die with almost no hair.
                            • 11. Re: Reason behind hashcode for String
                              807588
                              is there a difference between the generation of hashcode in different versions of java
                              Yes, there was already such a difference.

                              At the beginning not very character was considered in the computation, which saved some exectuion time but otherwise was not a good idea and this implementation flaw was rectified later.
                              • 12. Re: Reason behind hashcode for String
                                807588
                                kajbj wrote:
                                I've heard two people trying to answer that while they had a discussion on the bus. My spontaneous reaction / answer was that many babies get born with almost no hair, and many men die with almost no hair.
                                Simply, its just that the human head has 100,000 hairs and there are 9 million people in NYC (and at
                                least 1 million in the average city). You cant put 9 million lunchboxes into 100,000 cubbyholes.
                                • 13. Re: Reason behind hashcode for String
                                  791266
                                  TuringPest wrote:
                                  kajbj wrote:
                                  I've heard two people trying to answer that while they had a discussion on the bus. My spontaneous reaction / answer was that many babies get born with almost no hair, and many men die with almost no hair.
                                  Simply, its just that the human head has 100,000 hairs and there are 9 million people in NYC (and at
                                  least 1 million in the average city). You cant put 9 million lunchboxes into 100,000 cubbyholes.
                                  The text says "this city", and I guess that the average city has a population of less than 100.000 :)
                                  • 14. Re: Reason behind hashcode for String
                                    807588
                                    kajbj wrote:
                                    The text says "this city", and I guess that the average city has a population of less than 100.000 :)
                                    Depends on your definition of "city". And your country. How many cities have a population over a million in China?
                                    1 2 Previous Next