You Are What You Is: Defining Object Identity Blog

Version 2


    Object identity can be considered a rather academic topic, with "academic" taken in its negative sense. In this article, I try to show why having a strong understanding of what makes up your objects' identity can help you avoid a number of problems in your design and some tricky-to-find bugs.

    What is identity?

    Wikipedia's entry on the philosophical meaning of identitystarts:

    In philosophy, identity is whatever makes an entity definable and recognizable, in terms of possessing a set of qualities or characteristics that distinguish it from entities of a different type. Or, in layman's terms, identity is whatever makes stuff the same or different.

    What makes up the identity of an object in an object-oriented system? A common answer is "object identity" is defined by being in the same spot in memory, which is called "reference equality" and matches Java's == operator. A consequence of this situation is if two objects are identical, a change to one will always affect the other. This distinguishes identity from the notion of "being equal," which can change over time. In fact, multiple notions of being equal are possible, such asString.equals() andString.equalsIgnoreCase().

    Using reference equality as the only notion of object identity seems a good solution at first, but unfortunately this point of view is a little naive. A number of scenarios exist where this approach does not fit the intended semantics. I'll discuss some of them.

    Keeping identity when your program stops
    What happens with objects that are needed after you shut down a program and restart it? Objects usually get serialized and then get deserialized when they are needed again. If you define their identity through reference equality you have to consider them different objects every time you load them. Of course that makes sense from a technical perspective and it matches scenarios such as loading an object twice, but it is hardly practical when explaining what happens in terms of business logic.

    Having objects represent external data
    Objects can come from external sources, such as a relational database. The structures on which these objects are based often define their own identities such as using primary keys to identify rows in a table. If a program maps the same data onto two objects in memory, you get two distinct objects that refer to identical parts of the database and are equal in that sense. Technically this is a correct view, but is it useful when writing an application? Shouldn't both objects be considered the same?

    Having identifiers in the business world
    Quite similar to the last example, what happens if an application does not depend on external data, but the business side already has a strong notion of how to identify entities? Order numbers, passport IDs, and social security numbers are examples of this. If object identity is based solely on reference equality ("reference" in the object-oriented sense), then an application can have two objects with the same identifier from the business world. Unless extra measures are taken, these can have different values, which is most likely not acceptable.

    What I propose is you make sure that the way a program understands the identity of objects is the same as the notion of identity that comes out of the business context. In fact, even core parts of the JDK do not use reference equality to define object identity, as I will show.

    Object.equals() implements identity

    At first, Java seems to follow the common approach to define object identity through its references. Running a program like this:

    public class StringIdentity { public static void main(String[] args) { String a = new String("Hello"); String b = new String("Hello"); System.out.println(a == b); System.out.println(a.equals(b)); } }

    returns this:

    false true

    The two String objects do not have the same references, but they are considered equal because they contain the same values. Value equality can be checked, but reference equality is used to define the objects' identity. Is this always the case?

    One of the consequences of having distinct identities is objects can be collected in a set. A set is a data structure that is able to hold one instance of each item; as an example, you cannot have a set that contains an identical number twice.

    So how do sets relate to strings? I've established thata and b are not considered identical in Java, so you should be able to put them both into a set:

    import java.util.HashSet; import java.util.Set; public class StringIdentity { public static void main(String[] args) { String a = new String("Hello"); String b = new String("Hello"); Set set = new HashSet(); set.add(a); set.add(b); System.out.println(set.size()); } }

    The result of running the code above is 1, not the2 you would expect if the objects' identities were truly based on reference equality. If you swap the implementation from HashSet to TreeSet, the result will stay the same.

    Are the implementations of HashSet andTreeSet broken? Not at all. The JavaDocof the Set interface actually starts like this:

    A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 ande2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.

    HashSet and TreeSet behave correctly; the two objects a and b are equal according to their equals() method, so only one of them goes into the set.

    The JavaDoc also states that the Set interface models the mathematical set abstraction. A set is a data structure that cannot contain the same element twice, but it can contain two elements that are considered equal otherwise; an example of this is a set of toys with two balls the same shade and size. By usingequals() in the definition of the Setinterface as it is done in the JDK, the objects' identity is defined by the equality relation implemented byequals(), which means value identity in the case ofString and other classes overridingequals().

    As far as Set and similar collections are concerned, object identity is not always reference identity but is determined by Object.equals(), which defaults to standard object-oriented reference equality. It can also implement value equality or other equality relations, for example, reference equality based on external references such as a primary key of a database or an identifier from the business world.

    Why think about identity?

    How does identity relate to developing software? Having a solid understanding of what makes two objects identical can be very important to getting an implementation to match the requirements; if two distinct objects in a system represent something the customer considers to be one, problems can arise. A simple example would be one of the objects in the program getting changed by a user but retrieving the other one later. The changes would seem to be lost, or, even worse, they may disappear and appear depending on the way in which the objects were retrieved.

    Even without looking at any requirements or customer's expectations you can construct a scenario where not thinking about identity properly can break your code. I'll implement a simple little class for points in a plane. Two instances of this class should be considered the same if they describe the same point in the plane; in short, I want to use value identity. Here is an implementation:

    public class Point { private int x; private int y; public Point(int x, int y) { this.x = x; this.y = y; } public int getX() { return x; } public void setX(int x) { this.x = x; } public int getY() { return y; } public void setY(int y) { this.y = y; } public boolean equals(Object obj) { if(obj == null) { return false; } if(obj.getClass() != this.getClass()) { return false; } Point other = (Point) obj; return (other.x == this.x) && (other.y == this.y); } public int hashCode() { int hash = 7; hash = 31 * hash + this.x; hash = 31 * hash + this.y; return hash; } }

    This class implements value identity and at the same time is mutable. This means the object can change identity; whenever a setter changes one of the members, effectively it becomes a different object. If you add the following main method to thePoint class and run the code:

     public static void main(String[] args) { Point a = new Point(5,5); Set set = new HashSet(); set.add(a); a.setX(8); System.out.println(set.contains(a)); set.add(a); System.out.println(set.size()); set.remove(a); System.out.println(set.size()); set.remove(a); System.out.println(set.size()); }

    the result will be (with a probability very close to 1):

    false 2 1 1

    What happened? Did I break HashSet?

    I actually did break it. In the end, the set contains not only two items for which e1.equals(e2) holds, but they are the same in terms of reference identity, which means one of the invariants of the Set interface is broken. In addition, you can run into the following problems:

    • Hard-to-find bugs since asking if the object is in the set produces an unexpected result
    • Memory leaks since remove() fails and you get dangling references
    • Performance issues if you use a hash structure as cache for such objects

    What happens is HashSet puts the objects in matching buckets, as does any hash structure. These buckets are determined by the hash code, that is, whateverhashCode() returns. When looking for an object during the execution of contains() the same approach is used: Calculate hashCode() and look in the matching bucket. However, if the hash code changes in the meantime (as it did in my example), then the wrong bucket is checked (unless coincidentally the hash buckets match, which can happen but is not likely). This results in the behavior you have seen.

    There are a number of ways to fix this:

    • Do not use hash structures. This means losing a lot of performance, and structures based on binary search suffer from similar problems.
    • Add callbacks so HashSet can update its layout on changes. This approach is feasible but complex; everything would need to be a proper JavaBean withPropertyChangeListeners or something similar.
    • Replace HashSet with something that usesSystem.identityHashCode() instead ofObject.equals(), therefore reverting to reference identity.
    • Get HashSet to rehash before it callshashCode(), but its performance will be gone completely.
    • Implement equals(), but nothashCode(). This action generates its own set of troubles, as mentioned in Effective Java, Chapter 3 (seeResources).
    • Do not access any mutable fields in the implementation ofequals().

    Only the last approach is problem-free. Skipping thehashCode() implementation generally is a bad idea, and the earlier approaches all suffer from the problem that they are local fixes. If the same type of usage pattern appears somewhere else, the same problems will arise again. In any case, the semantics of the scenario described are not clear: According to theSet documentation my objects have value identity, but what I expect in my little main method is reference identity, otherwise the answer "false" for the call tocontains() would be right.

    It is feasible to leave equals() alone to keep the reference identity consistent throughout a program while implementing other equality relations in parallel but just use different names. While it is nearly impossible to keep anyone from using equals() in unexpected ways (it is declared onObject, after all), it is rather easy to call something else in the business logic of an application.

    Value objects

    If value identity is wanted for some objects, it can be quite useful to completely distinguish state objects andvalue objects. A state object is an object that stores mutable state and is identified through reference equality, very much a standard Java object with getters and setters. A value object is always immutable and is identified through value equality. In Java that means overriding equals() and using only private members with no setters. These members also have to be value objects themselves.

    This way the semantics tend to be clearer and value objects have a number of additional advantages:

    • They can be persisted and restored without having to think about accidental duplicates.
    • They can be sent across a network without any further need for remote calls.
    • They can be cached and shared whenever it seems suitable (such as String.intern() does).


    The notion of "identity" seems trivial at first but it can be important for the design and, consequently, the correct behavior of an object-oriented application. Through implementingequals(), the Java programmer has the option to define a specific type of identity, a very powerful but dangerous thing to do. Whenever implementing equals() the consequences should be well considered and the implementation should avoid any mutable members, including members that change value themselves and those that refer only to mutable objects.