A New and Consistent Way of Implementing equals and hashCode | Part 2


Introduction

In Part 1 of this topic, I’ve shown you several ways of implementing equals and hashCode. I wasn’t able to conclude with a definite answer to the question which way is the “best”. The Java 8 way of implementing these two methods looks very nice, but it remained an open question how much performance gain there is compared to the Java 7 way. Besides showing a more realistic usage example of implementing these two methods, this blog post will also answer that question.

Sample Customer Class

You probably might have noticed that Part 1 of this topic has been very similar in structure to Part 1 of my blog post about implementing Comparable. This has been done on purpose, and the first part of this blog post will do so, too. In particular, I want to show you the same demo class Customer which represents a customer from an online shop and contains different field types. Note that the field dateOfBirth may be null. For details or meanings of the other class fields, please refer to the second blog post about comparable objects.

import java.util.*;
import java.time.*;

public final class Customer {

    private String firstName;

    private String lastName;

    private LocalDate dateOfBirth;  /* May remain null. */

    private int orderCount;

    private double averageOrderValue;

    private boolean premiumCustomer;

    public Customer(String firstName, String lastName, LocalDate dateOfBirth,
                    int orderCount, double averageOrderValue, boolean premiumCustomer) {

        /* Check and adopt parameters. */
        Objects.requireNonNull(firstName);
        Objects.requireNonNull(lastName);
        this.firstName = firstName;
        this.lastName = lastName;
        this.dateOfBirth = dateOfBirth;
        this.orderCount = orderCount;
        this.averageOrderValue = averageOrderValue;
        this.premiumCustomer = premiumCustomer;
    }

One real-world scenario would be treating two customers being equal if their first names, last names, and dates of birth are equal. Using my Java 8 template of equals and hashCode, the two methods—using the three fields of types String, String, and LocalDate (i.e., all Objects)—look like this:

    @Override
    public boolean equals(Object otherObject) {
        if (this == otherObject) {
            return true;
        }

        if (!(otherObject instanceof Customer)) {
            return false;
        }

        Customer otherCustomer = (Customer) otherObject;

        return (Objects.equals(this.firstName, otherCustomer.firstName)
                && Objects.equals(this.lastName, otherCustomer.lastName)
                && Objects.equals(this.dateOfBirth, otherCustomer.dateOfBirth));
    }
    @Override
    public int hashCode() {
        int result = 17;
        result = 37 * result + Objects.hashCode(firstName);
        result = 37 * result + Objects.hashCode(lastName);
        result = 37 * result + Objects.hashCode(dateOfBirth);

        return result;
    }

Since it’s possible that more than one customer share the same first names, last names, and dates of birth in real life, one might think of including more fields for equality checks and hash code calculations. Using all six fields first name, last name, date of birth, order count, average order value, and premium customer of types String, String, LocalDate, int, double, and boolean, respectively, looks like this:

    @Override
    public boolean equals(Object otherObject) {
        if (this == otherObject) {
            return true;
        }

        if (!(otherObject instanceof Customer)) {
            return false;
        }

        Customer otherCustomer = (Customer) otherObject;

        return (Objects.equals(this.firstName, otherCustomer.firstName)
                && Objects.equals(this.lastName, otherCustomer.lastName)
                && Objects.equals(this.dateOfBirth, otherCustomer.dateOfBirth)
                && (this.orderCount == otherCustomer.orderCount)
                && (Double.compare(this.averageOrderValue,
                                   otherCustomer.averageOrderValue) == 0)
                && (this.premiumCustomer == otherCustomer.premiumCustomer));
    }
    @Override
    public int hashCode() {
        int result = 17;
        result = 37 * result + Objects.hashCode(firstName);
        result = 37 * result + Objects.hashCode(lastName);
        result = 37 * result + Objects.hashCode(dateOfBirth);
        result = 37 * result + Integer.hashCode(orderCount);
        result = 37 * result + Double.hashCode(averageOrderValue);
        result = 37 * result + Boolean.hashCode(premiumCustomer);

        return result;
    }

The definition of equality shouldn’t be taken lightly, as these simple examples already illustrate. There might be a “low-level” reason for using all fields for comparisons and hash code calculations, especially when it comes to storing such customer objects into (hash) sets. You don’t want an existing customer to become overwritten just because another one that shares the same first name, last name, and date of birth gets added to the set. Neither do you want to miss the “high-level” reason for being able to look up a customer with its first and last name only.

The requirements might be contradictory, which makes it even more important to spend your time and energy into finding the appropriate defintion and corresponding method implementations. Only you as the developer know what and when to expect equality.

By the way, I personally would equip the Customer class with a unique customer ID and base my equals and hashCode calculations solely on that field. This covers the “low-level” application. All searching and finding must then be implemented outside the class. This “high-level” application should be part of the business logic.

In case you want to test the different equals and hashCode method implementations, feel free to experiment with the provided main method that sets up 10 more or less different customers and pairwise checks them for equality. Note that there are two Ronald Hogans (but with a different date of birth) and two customers that share the same date of birth (Richard Leblanc and Suzanne Murray on October 6, 1975). Also note how the provided sample methods easily deal with possible null values.

    public static void main(String[] args) {

        /* Create and initialize customer array. */
        Customer[] customerArray = {
                new Customer("Richard", "Leblanc", LocalDate.of(1975, Month.OCTOBER, 6),
                             2, 498.30, false),
                new Customer("William", "Robinson", null,
                             2, 737.35, true),
                new Customer("Adella", "Wheeler", null,
                             7, 824.33, true),
                new Customer("Ronald", "Hogan", null,
                             2, 297.99, false),
                new Customer("Joseph", "Davis", LocalDate.of(1981, Month.NOVEMBER, 3),
                             9, 783.16, true),
                new Customer("Charles", "Quintana", LocalDate.of(1989, Month.JANUARY, 2),
                             4, 748.35, true),
                new Customer("Graham", "Reamer", LocalDate.of(1967, Month.MARCH, 16),
                             3, 255.96, false),
                new Customer("Suzanne", "Murray", LocalDate.of(1975, Month.OCTOBER, 6),
                             2, 199.49, false),
                new Customer("Ronald", "Hogan", LocalDate.of(1997, Month.NOVEMBER, 26),
                             2, 368.10, false),
                new Customer("Dolores", "Falco", null,
                             3, 556.41, true)
        };

        /* Find and print pairs of equal (but nonidentical) customers. */
        for (Customer customer1 : customerArray) {
            for (Customer customer2 : customerArray) {

                /* Skip identical customer pairs. */
                if (customer1 == customer2) {  /* reference equality */
                    continue;
                }
                
                /* Check for equality and print equal customer pairs. */
                if (customer1.equals(customer2)) {
                    System.out.println(customer1);
                    System.out.println("^^^^ equals vvvv");
                    System.out.println(customer2);
                    System.out.println();
                }
            }
        }
    }

    @Override
    public String toString() {
        return String.format("Name: %-7s %-8s | DoB: %10s | Orders: %d | AOV: %3.2f | Premium? %5b",
                             firstName, lastName, dateOfBirth,
                             Integer.valueOf(orderCount),
                             Double.valueOf(averageOrderValue),
                             Boolean.valueOf(premiumCustomer));
    }

Time Measurements | Value Class and Framework

Now let’s move on to a detailed performance analysis of the different equals and hashCode versions presented in my previous blog post. First of all, we need a sample class that can easily be filled with random data. I decided to provide the class ValueClass that consists—in its core—of three value fields of types boolean, byte, and double, respectively. They’re highlighted in the source code shown below. The class provides a public factory method newRandomInstance that creates a new instance of ValueClass and fills it with random data.

import java.util.*;
import java.util.concurrent.*;

public final class ValueClass {

    private static long equalsCallCount
            = 0;

    private static long hashCodeCallCount
            = 0;

    private boolean booleanField;

    private byte byteField;

    private double doubleField;

    private ValueClass(boolean booleanField, byte byteField, double doubleField) {
        this.booleanField = booleanField;
        this.byteField = byteField;
        this.doubleField = doubleField;
    }

    public static ValueClass newRandomInstance() {
        return new ValueClass(ThreadLocalRandom.current().nextBoolean(),
                              (byte) ThreadLocalRandom.current().nextInt(),
                              ThreadLocalRandom.current().nextDouble());
    }

    public static long getEqualsCallCount() {
        return equalsCallCount;
    }

    public static long getHashCodeCallCount() {
        return hashCodeCallCount;
    }

    public static void resetCallCounts() {
        equalsCallCount = 0;
        hashCodeCallCount = 0;
    }

    @Override
    public String toString() {
        return String.format("%5b | %4d | %1.15f",
                             Boolean.valueOf(booleanField),
                             Byte.valueOf(byteField),
                             Double.valueOf(doubleField));
    }

    /* [...] Other methods defined later. */

The reason why I’ve chosen these three types as field types lies in the fact that I want to allow the equals method to go beyond the first comparison. The chance that two (random) ValueClasses have the same booleanField is 50 %. This means that in 50 % of all cases, equals will also compare the second field, byteField. The chance of two byteFields being equal is 1/256 ≈ 0.39 %, so roughly 0.2 %, i.e., roughly every 500th object will be equal with respect to the fields booleanField and byteField. It’s very unlikely that finally the two doubleFields are equal as well, but it doesn’t matter for the time measurements, because the fields need to be compared nevertheless, no matter if the final answer is true (yes, they’re equal) or false (no, they’re not equal).

Before we can have a detailed look at the different equals and hashCode implementations, I need to provide you with some boilerplate code of the measuring framework.

import java.util.*;
import java.time.*;

public final class TimeMeasurements {

    private static final int RUN_COUNT
            = 100;

    private static final int ELEMENT_COUNT
            = 20_000_000;
//            = 20_000;

    private ValueClass[] randomElementArray
            = new ValueClass[ELEMENT_COUNT];

    /* Initial capacity and load factor prevent automatic capacity increase and rehashing. */
    private Set<ValueClass> randomElementSet
            = new HashSet<>((int) (ELEMENT_COUNT * 1.1), 1.0f);

    private LinkedList<Duration> durationList
            = new LinkedList<>();

    public TimeMeasurements() {

        /* Initialize random element array with random value class instances. */
        for (int elementIndex = 0; elementIndex < ELEMENT_COUNT; elementIndex++) {
            randomElementArray[elementIndex] = ValueClass.newRandomInstance();
        }
    }

    public static void main(String[] args) {
        TimeMeasurements timeMeasurements = new TimeMeasurements();

        /* Print run number, actually run, and print last duration. */
        for (int runNumber = 1; runNumber <= RUN_COUNT; runNumber++) {
            System.out.format("Run number: %3d", Integer.valueOf(runNumber));
            timeMeasurements.runHashSetMeasurement();
            timeMeasurements.printLastDuration();
        }

        timeMeasurements.printTotalResults();
    }

    private void runHashSetMeasurement() {

        /* Reset data structures for next run. */
        randomElementSet.clear();
        ValueClass.resetCallCounts();

        /* Add random array elements to set. */
        Instant startTime = Instant.now();
        for (int elementIndex = 0; elementIndex < ELEMENT_COUNT; elementIndex++) {
            randomElementSet.add(randomElementArray[elementIndex]);
        }
        Instant endTime = Instant.now();

        /* Add measured duration to duration list for later printout and average calculation. */
        durationList.add(Duration.between(startTime, endTime));

    /* [...] Other methods defined later. */

    private void printLastDuration() {
        Duration lastDuration = durationList.getLast();
        System.out.format(" | Duration: %,5.1f s | Duration per element: %,9.1f ns%n",
                          Double.valueOf(lastDuration.toMillis() / 1000.),
                          Double.valueOf(lastDuration.toNanos() / (double) ELEMENT_COUNT));
    }

    private void printTotalResults() {
        OptionalDouble averageTotalDurationNanos
                = durationList.stream().mapToLong(d -> d.toNanos()).average();
        System.out.println();
        System.out.format("Average duration per element: %,9.1f ns%n",
                          Double.valueOf(averageTotalDurationNanos.getAsDouble()
                                         / ELEMENT_COUNT));
        System.out.format("equals call count:   %,11d%n",
                          Long.valueOf(ValueClass.getEqualsCallCount()));
        System.out.format("hashCode call count: %,11d%n",
                          Long.valueOf(ValueClass.getHashCodeCallCount()));
    }
}

I won’t go through every line, just through the main points:

  • The framework first initializes a randomElementArray with ELEMENT_COUNT random elements of type ValueClass.
  • These elements are then added to a HashSet, one by one. Adding elements to a hash set will be a typical scenario that makes heavy use of the elements’ equals and hashCode methods. It’s only the adding-elements-to-hash-set part that gets time measured.
  • The hash set is initialized with a capacity of 1.1 × ELEMENT_COUNT and a load factor of 1.0. This 10% reserve together with the maximum load factor will prevent the hash set from automatically increasing its capacity during runtime and trigger a rehashing procedure. This would distort the time measurements (although it hasn’t shown to be severe in reality).
  • The time needed for adding the elements usually isn’t constant. Often, the garbage collector kicks in, which leads to different durations. Every adding procedure is repeated RUN_COUNT times. In my experiments, I’ve set RUN_COUNT to 100.
  • The framework prints out the duration and the duration per element after each run. At the very end, it shows the average duration per element. The average is calculated by looking up all previous durations that have been stored in durationList.
  • It is often helpful to see how often the equals or hashCode methods have been called. In order to do so, ValueClass contains static counters. By the way, increasing these counters doesn’t do any significant harm to the total runtime, so there is no need to comment out the XXXCallCount++ commands (see below).

Time Measurements | Java 7 Way of equals and hashCode

The Java 7 way of implementing equals and hashCode for ValueClass is shown below:

    /* equals | Java 7 way */
    @Override
    public boolean equals(Object otherObject) {
        equalsCallCount++;

        if (this == otherObject) {
            return true;
        }

        if (!(otherObject instanceof ValueClass)) {
            return false;
        }

        ValueClass otherValueClass = (ValueClass) otherObject;

        return (Objects.equals(Boolean.valueOf(this.booleanField),
                               Boolean.valueOf(otherValueClass.booleanField))
                && Objects.equals(Byte.valueOf(this.byteField),
                                  Byte.valueOf(otherValueClass.byteField))
                && Objects.equals(Double.valueOf(this.doubleField),
                                  Double.valueOf(otherValueClass.doubleField)));
    }
    /* hashCode | Java 7 way */
    @Override
    public int hashCode() {
        hashCodeCallCount++;

        return Objects.hash(Boolean.valueOf(booleanField),
                            Byte.valueOf(byteField),
                            Double.valueOf(doubleField));
    }

Running the time measurement framework on my machine (Java 8 on a MacBook Pro with 16 GB memory) results in an average duration per element of 362.9 ns. ELEMENT_COUNT was set to 20 million. The method hashCode has been called 20 million times by the hash set, equals has only been called 46,960 times. This makes sense, since a hash set only needs to draw upon the equals method if the hash codes are equal or collide. In this case, we can assume that there have been about 47 thousand hash code collisions out of 20 million elements, which is 0.235 %.

Time Measurements | Java 8 Way of equals and hashCode

The original question was whether the Java 8 way of the suggested implementations is faster than the Java 7 way. Here is the code using Java 8 features:

    /* equals | Java 8 way */
    @Override
    public boolean equals(Object otherObject) {
        equalsCallCount++;

        if (this == otherObject) {
            return true;
        }

        if (!(otherObject instanceof ValueClass)) {
            return false;
        }

        ValueClass otherValueClass = (ValueClass) otherObject;

        return ((this.booleanField == otherValueClass.booleanField)
                && (this.byteField == otherValueClass.byteField)
                && (Double.compare(this.doubleField, otherValueClass.doubleField) == 0));
    }
    /* hashCode | Java 8 way */
    @Override
    public int hashCode() {
        hashCodeCallCount++;

        int result = 17;
        result = 37 * result + Boolean.hashCode(booleanField);
        result = 37 * result + Byte.hashCode(byteField);
        result = 37 * result + Double.hashCode(doubleField);

        return result;
    }

This time, the average duration per element is 182.0 ns, which is significantly faster than the Java 7 version. The question is: Who is to blame—equals or hashCode?

Time Measurements | Combining the Java 7 and Java 8 Ways of equals and hashCode

As the section title implies, I tried to combine the Java 7 and Java 8 ways of these two methods. Here is the summary of all four measurements, including the two mentioned above:

equals hashCode Average Duration per Element
Java 7 way Java 7 way 362.9 ns
Java 8 way Java 8 way 182.0 ns
Java 7 way Java 8 way 181.3 ns
Java 8 way Java 7 way 357.0 ns

The results state it clearly that the hashCode method has a much greater impact on the runtime than the equals method. This might be due to the fact that hashCode is called about 425 times more often than equals.

Time Measurements | Isolating equals

What is the impact of equals? In order to isolate equals, let’s replace hashCode by a minimal version:

    /* hashCode | Minimal Version */
    @Override
    public int hashCode() {
        hashCodeCallCount++;

        return 0;
    }

Yes, returning 0 (or any other constant number you want) is a valid hash code function. It’s just not very efficient; actually, it’s a tremendously bad implementation, because it destroys all advantages of hash sets and makes them perform extremly poorly. It performs so badly that I needed to reduce ELEMENT_COUNT by a factor 1000 down to 20,000.

After running the time measurements, the number of equals call counts has increased to 200,253,039, the number of hashCode call counts has decreased to 20,000. This makes sense. The number of hashCode calls is trivial. Returning the same hash code all the times makes the hash set degenerate to a linked list. Every new element has to be equals-compared with all already-existing elements. The 2nd element thus needs 1 comparison, the 3rd element needs 2 comparisons, the 4th element 3 comparisons, the 5th element 4, and so on. This results in about 20,0002/2 comparisons. The counted 200,253,039 comparisons are very close to the theoretic approximation of 200 million.

On my machine, I measured the following average durations:

equals Average Duration per Element
Java 7 way 301,067.5 ns
Java 8 way 257,341.0 ns

The results show that the Java 8 version of the equals method implementation is faster.

I was curious whether this could be improved even further. The Double.compare part of the equality comparison looks suspicious. I was wondering whether this could be improved by replacing it with the peculiar doubleToLongBits method presented in the Java 6 way. Here is the code I call the “Java 6/8 way”:

    /* equals | Java 6/8  way */
    @Override
    public boolean equals(Object otherObject) {
        equalsCallCount++;

        if (this == otherObject) {
            return true;
        }

        if (!(otherObject instanceof ValueClass)) {
            return false;
        }

        ValueClass otherValueClass = (ValueClass) otherObject;

        return ((this.booleanField == otherValueClass.booleanField)
                && (this.byteField == otherValueClass.byteField)
                && (Double.doubleToLongBits(this.doubleField)
                    == Double.doubleToLongBits(otherValueClass.doubleField)));
    }

And here is the result: 243,310.5 ns average duration per element. This is even a little bit faster, but in my opinion not worth the effort. It is—in real world—not common to compare double values for equality, because it is very dangerous. Please refer to my blog post “Using BigDecimal as an Accurate Replacement for Floating-Point Numbers” for more information about the problems of floating-point numbers. Because using floating-point numbers for equality checks is discouraged, there is no reason to ponder about the most “efficient” way of dealing with floats or doubles inside the equals method.

Conclusion

This blog post has shown by simple time measurements that the Java 8 way of implementing equals and hashCode I suggested in Part 1 of this topic is not only very consistent, but also very efficient. At the same time, I must not forget to mention that the measured results are not very scientific. Several reasons, all beyond the scope of this blog post and usually also out of my control, make the measured durations fluctuate a lot: a) the implementation of HashSet, which has some very complicated internals and optimization hacks that make it hard to follow what exactly is going on and what weight the HashSet implementation has on the total time; b) the garbage collector, which does some heavy work in the background, even though I tried my best to reuse as many objects as possible; c) the ValueClass which has to fulfill the requirement of being easily created with random data, but probably won’t be the best example of a real-world class.

To solve part of the problems just mentioned, one could attach a profiler to it and have a even more detailed look at the different runtimes and their distribution among the several (internal) methods. However, I think there is no need to do so. The answer to which implementation is the “best” is clear. If one just remembers the main message that the implementation of a proper, efficient hashCode method is crucial and even more important than equals, then I’ve reached my goal with this two-part blog post.

Shortlink to this blog post: link.simplexacode.ch/5gih2019.01

Leave a Reply