Friday, July 29, 2011

Every instance is equal; some are more equal than others

A friend contacted me recently with a rather vexing problem he was experiencing with XML serialisation in .NET. A very simple class containing only integers and strings was causing an exception with the message:

“System.InvalidOperationException: A circular reference was detected while serializing an object of type Common.Models.Member”.

After a few minutes, the intertubes suggested to me that the problem was likely caused by classes in the inheritance hierarchy referencing each other, which they weren’t. Eventually, I happened upon an offhand suggestion that led to investigation of the class’s equality checking. Common.Models.Member inherited from Common.Models.Reference, the latter of which contained the following:

public override bool Equals(object obj) { Reference other = obj as Reference; if (other == null) return false; return this.id == other.id; }

In addition to the method itself, two factors set off alarm bells:

  • Reference was not abstract. Moreover, it contained a method to return a new and separate Reference instance from any subclass, using its id value.
  • An override of Equals was not present in any of Reference’s subclasses.

Right away, this meant that multiple separate instances of different types could be considered equal, and that usually makes me feel uneasy.

As it turned out, overriding Equals in Reference’s suptypes to avoid the above confusion allowed the classes to be proccessed by XmlSerializer, but this seemed to be symptomatic of a deeper problem: how should equality be properly implemented in .NET?

As usual, I fully intend to fall well short of attempting to answer this question definitively, as should be the case. Far smarter and more informed developers than I have tackled this issue and reached different conclusions. So here are some more thoughts to cloud the issue.

The how, the what, and the why

Before any meaningful discussion about equality comparing can begin, it is essential to be mindful of some basic properties of equality:

Reflexive
For any a, a = a
Symmetric
For any a and b, if a = b, then b = a
Transitory
For any a, b and c, if a = b and b = c, then a = c

As we will see soon, the symmetric property is quite easy to violate.

Now that we have a basic understanding of what equality is, the next question is how to check for it. The answer, not surprisingly when speaking of .NET, is “lots of different ways.” And, equally unsurprising, some of them can yield different results even when comparing the same objects.

They are (all belonging to System.Object):

  1. Object.ReferenceEquals(x, y)
  2. Object.Equals(x, y)
  3. (x == y)
  4. x.Equals(y)

Only ReferenceEquals always returns the same value irrespecetive of the order of x and y; the other three offer no such guarantee. This should be the dawning realisation that you should tread carefully here. ReferenceEquals, however, is extremely basic: it checks to see if its two arguments occupy the same location in memory, and that’s it. If this doesn’t sound like such a bad deal, consider that the following expression evaluates to false:

if (object.ReferenceEquals(42, 42)) // unreachable code...

There aren’t many situations where you would expect or want a value to not be equal to itself, so this method is of limited utility. Object.Equals(x, y), on the other hand, wraps a polymorphic call to x.Equals(y) after checking that x and y are not the same reference and that neither are null.

The equality operator (==) wraps a call to ReferenceEquals, unless overloaded (note, not overridden) in a subtype of Object. Because operators are static, they are not polymorphic, and so the following code worryingly evaluates to false:

Uri a = new Uri("http://hinternets.blogspot.com"); Uri b = new Uri("http://hinternets.blogspot.com"); if (a == b && (object)a == b) // unreachable code...

Fairly warned be yea, says I.

Common pitfalls

Sooner or later, you are probably going to create a type that implements custom equality checking. As many people before you have done, you may try something like this simplified example (null checking and other error handling is omitted for brevity):

public class T { public int ID; public T(int id) { ID = id; } public override bool Equals(object obj) { return obj is T && ID == ((T)obj).ID; } }

With a minimal subclass:

public class U : T { public string Data; public U(int id, string data) : base(id) { Data = data; } public override bool Equals(object obj) { return base.Equals(obj) && obj is U && Data == ((U)obj).Data; } }

Each type compares all its members, and even leverages existing functionality like all good object-oriented code should. To those who have spotted the impending disaster, congratulations. To those who haven’t, watch on in horror as the symmetric property leaves the building:

T a = new T(1); T b = new U(1, "boogers"); if (a.Equals(b) != b.Equals(a)) Console.WriteLine("This cannot be!");

What went wrong here? In this example (and the case that prompted this post) the most direct answer is because the is operator tells you whether the instance can be casted to the specified type. That means that it includes any base types or interfaces, and so any U is therefore a T. T.Equals returns true because the type check passes, and only the value of ID is compared; U.Equals returns false because T is not a U, and even if it somehow were, it would fail when comparing the value of Data.

The immediate solution is to replace the is check with GetType() == obj.GetType(). But should we look deeper?

Navel-gazing

Returning to the initial example, instead of simply patching the problem with an explicit type check, it is worth questioning the relationship between Reference and Member, and also whether Reference should even exist in the first place, considering that it consists of nothing other than an ID value and an overridden method.

At best, it should probably be marked abstract because it probably doesn’t convey enough information about anything to be useful on its own. Subtypes are invariably going to be compared in different ways in order to prevent this type of confusion, and rather than enter a murky world of incessent overriding of behaviour, it is probably much more flexible and less problematic to simply write:

if (a.ID.Equals(b.ID)) // do something...

And be done with it.

Just because one type’s members are a subset of another does not automatically mean that they should be related by inheritance. It may be convenient not to have to rewrite a little bit of code here and there, but sometimes the price you pay for blurring the lines of identity just isn’t worth it.

If this isn’t your cup of tea because you prefer a greater degree of encapsulation, then this behaviour can be rolled into custom comparer classes that implement IComparer<T> and IEqualityComparer<T>, both under the System.Collections.Generic namespace. This is handy for, if nothing else, avoiding the tiresome repetition of null checks on both operands.

public class UComparer : IEqualityComparer<U>, IComparer<U> { public static readonly ByID = new UComparer(); private UComparer() { }; public bool Equals(U x, U y) { return x.ID.Equals(y.ID); } public int GetHashCode(U obj) { return obj.ID.GetHashCode(); } public int Compare(U x, U y) { return x.ID.CompareTo(y.ID); } }

And the above example is rewritten:

if (UComparer.ByID.Equals(a, b)) // do something...

Additional comparers can be created for the purpose of ordering or collating a data type in different ways. These comparers can also be fed into collection classes such as SortedSet<T> and others that are sensitive to order or duplicate elements. Both they, and their methods, can be passed around:

Array.Sort(arrayOfU, UComparer.ByID.Compare);

Alternatively, when dealing with collections, a dictionary or or lookup (under the System.Linq namespace) can be used with the desired property extracted as the key. In the following example, an array of U instances are grouped on arbitrary members.

// Get the count of each distinct Data in descending order. var countByData = arrayOfU.ToLookup(u => u.Data) .Select(g => new { g.Key, Sum = g.Count() }) .OrderByDescending(t => t.Sum);

This only works well in simple cases, but it is ad hoc and therefore very flexible. It also avoids having to create a large amount of dedicated comparer classes to achieve the same result.

Only half the story

No discussion on equality is complete without addressing hash codes as well, since the two are tightly intertwined. This, however, will require another post, preferrably after a good night’s sleep or two, and, if I’m honest, a glass of red to steady my nerves.

There will be maths involved. Mark my words.

No comments:

Post a Comment