Friday, February 8, 2008

Core Data and uniqueness

Being such a newcomer to the many and varied wonders of Cocoa, I'm perennially accosted with that feeling that I *must* be missing something when I'm challenged by some apparent missing feature. So far, about half the time, my continued quest to learn how to achieve something will be rewarded with some new epiphany - a new pattern, the discovery of some previously arcane knowledge. The other half of the time I cave in and achieve what I'm trying to accomplish with some belt and braces - still wondering whether whether I'll kick myself at some future point for having missing the provision of some elegant solution.

So it is that I've recently been wondering about uniqueness in Core Data.

Now Apple are quite clear about the nature of Core Data (at least as it stands today). Despite the entity relationship models, the prepared parameterised fetch requests (queries) and the ability to use a SQL database for storage, Core Data is not a general purpose database. It is designed, of course, to provide an elegant way to persist your application's internal state in a way that is natural and requires the minimal amount of overhead (notwithstanding the need to conform to its design patterns).

I've been very impressed with Core Data (so far). I'm lucky enough to be beginning my Cocoa career with Leopard, and like a lot of the Mac OS X frameworks, Core Data in Leopard has clearly matured very nicely into a highly capable and general facility. At this time, my main data model spans a half dozen pages and uses a good many of the features available (inheritance, to-one/to-many relationships, delete rules and a little validation with more to come). For most of what my app does, modelling its data this way is clearly superior. However, I have been surprised by a couple of things that seem like omissions, but per the foregoing, leave me wondering whether I'm missing the 'right' way to approach the problem.

One such item is a need to store singleton instances of some entity. My application has global state that should be persisted, which is not necessarily per-user (though naturally I have some of those too). It would be nice to be able to create an entity to represent a unique object that will record this state - and declare to Core Data that there can be only one instance of this (at a given version of the model). Yet, I know of no way to achieve this. Of course, one can live without this formal uniqueness, and instead date-stamp an instance, and have the application 'housekeep' any excessive number of instances (perhaps clearing away all but the last written instance), but...

There's a rather more fundamental kind of uniqueness in data of course, that of unique keys - and again Core Data has no way that I know of to express that an attribute will contain a unique key. In Core Data, all the combinations (tuples) of matching data will be returned on a query, and one imagines that there can be no opimisations in how an underlying data base is queried when such fundamental metadata is missing. I got a little excited when I first saw the "Indexed" check box on what happened to be a String attribute in the model designer, but looking up what this did revealed nothing more than the vague "Indicates that an attribute may be useful for searching".

Even if Core Data itself has no formal way of indicating uniqueness or key values, you certainly need to be able to determine this from time to time in your application. For example, if your model records "Customer" in various places, you are likely to have the same actual Customer represented by multiple instances of the Customer entity (one perhaps attached to a 'recent calls' part of the model, and one attached to 'sales'). Now, because you cannot formally uniquify the details of a particular Customer (with a key like 'customer number'), if you were to query for all the Customers in your data, you will end up with an Array of Customers with 'clones'. So, how do you turn an array of objects into an, er... set of unique objects (by some definition of unique). Of course, normally a set collection does this quite handily, through the expedient of defining appropriate hash and isEqual methods on the objects to ascribe identity. Cocoa certainly has such a thing in NSSet/NSMutableSet. So recovering uniques, even if the data framework can't, should be a walk in the park, right?

Well, in one of those unsatisfying moments I alluded to earlier, you soon stumble over what looks like a major flaw in Core Data. Core Data manages objects that derive from NSManagedObject, and the documentation clearly states that -hash and -isEqual: are reserved for Core Data's use (i.e. you cannot override these methods as you can, and often do, override them as an NSObject subclass). Oops. Try as I might, I have not yet found any canonical way to reasonably filter out uniques from an array/set of NSManagedObjects (whether obtained from a to-many relationship, a fetch request or any other way). The most obvious solution is barred, given the reliance of NSSet on the out-of-bounds -hash and -isEqual:, which left me scrabbling to think of how you are _supposed_ to achieve these ends.

My reading and thinking led me to realise that without the emergence of some new arcane method, I was going to have to invent. A number of really ugly approaches came to mind, but mostly they seemed horribly expensive (lots of shuffling of objects, constructing wrappers, whatever). What I really wanted was a more flexible NSSet. That got me to find the somewhat undocumented (at least in the current Cocoa guides) NSHashTable and NSMapTable. These seemed to be offered as lower-level forms of NSSet and NSDictionary, and the ability to handle non-object keys and values was vaunted (though in actual fact, when you consult the supported configuration options on these Cocoa classes, objects are about all that is "guaranteed" to work!). It seems that the motivation for adding these collections to the Cocoa level was mostly to provide for weakly referenced keys and/or objects, in the presence of the new GC. However, clicking about in the class docs led me first to the NSPointerFunctionsOptions when initialising the collection, and then to a curious method -pointerFunctions. The latter returns an object of type NSPointerFunctions, and there right in front of me was the documentation for a couple of the properties of this object: hashFunction and isEqualFunction. Bingo! Perhaps I could concoct a set-like collection that used custom methods for identity - rather than fixed -hash/-isEquals: and therefore get around the limitations of NSManagedObject?

Experimenting with NSHashTable and pointerFunctions was frustrating - and I still haven't successfully managed to get this to work. The NSPointerFunctions returned from a freshly created NSHashTable have writeable properties for the functions I needed to provide, and the prototypes of these functions are documented enough, but try as I might, my provided functions were never called when adding objects. Are NSPointerFunctions only good for reading in this release (despite the writeable properties)? I have no idea.

However, the research into the pointer functions served as a segue into the murky world of the underlying Core Foundation implementation of CF(Mutable)Set. I've taken hardly any time to bother looking at the CF stuff - mostly because Cocoa itself is so complete, but also because it seems like a strange non-OO world where one doesn't go unless out of desperation...

As it happens, CFMutableSet is what Apple calls "toll-free bridged" to NSMutableSet, meaning that the same address can be used via either C-style pointers or Cocoa (id) pointers and with either the appropriate C functions or messages. This is very nice, but what was important was that CFSetCreateMutable (the C 'constructor' for the mutable set) takes a CFSetCallbacks structure, which is the analogue of the NSPointerFunctions. Creating appropriate 'alternate' -hash and -isEqual: functions was straightforward, and moments later... success! The CF version of the mutable set allows these alternate 'call backs' to be set up at construction time, and these are correctly called when objects were added (by sending the -addObject: message to the returned address cast to an 'id'!

Once I had had this mini-breakthrough, I was confident enough to create a 'new kind' of set collection back in Cocoa-land that constructed itself with CFSetCreateMutable, encapsulating the appropriate 'callbacks'. This collection expects to work with a category of objects that conform to an 'alternate identity' protocol, which requires the implementation of -altHash and -isAltEqual:. Furthermore, I now have a subclass of NSManagedObject that adopts this protocol and is the common super class for all my model entities, allowing me to create (albeit temporary) sets of unique instances derived from Core Data - according to the 'alternate identity functions' that they encode.

So I'm happy... but as I mentioned at the beginning of this piece, I still have that nagging doubt...

1 comment:

Rolf Hendriks said...

I ran into the same problem and have a roundabout solution:

All my objects derive from a base class, which defines an -objectWithValues:(NSDictionary*) method. The default implementation is to first look up an object matching the passed in keys + values, then create a new one only if it can't find an existing object. This ensures you never create duplicates / copies. This is not bulletproof, though, because you can still change an object's values and end up with duplicates that way.

Then i simplified and optimized the design by creating a unique 'ID' field in the base class, and having the default -objectWithValues: look up an existing object by ID instead. This is safer, as you should never change the ID of an object after creating it.