Saturday, January 26, 2008

Map and filter with NSArray

Having spent a good deal of my recent professional life getting seriously into (lazy) functional languages, and particularly the implementation of a rather nice (IMHO!) such language for the JVM called CAL, I got to toying with an implementation of map and filter over NSArrays.

OK, this must have been done a gazillion times, but I fancied the mini-challenge anyway (evidently too much hands on my time).

I had no intention of a lazy implementation (too bad), but rather, strict implementations along the lines of Smalltalk/Ruby/... collect and inject methods.

Being a dynamic language, Objective-C can easily handle the dispatching of messages to an object for each element of an array, and I decided to implement an NSArray category to add the requisite collect and inject methods. In Objective-C, to refer dynamically to a runtime method we can use a selector (type SEL). These can be constructed as literals by the compiler with the @selector pseudo function.

Unlike a functional language where the mapping function to perform a single element transform can represent itself (by passing in a function/lambda), in Objective-C will will have to pass a selector and an object that will be the target of the selector (the receiver that will perform a transformation). Languages like Ruby have lambda functions and allow the passing of procedures (Procs). Even Smalltalk, like Ruby, has blocks that can be passed in messages, but for the time being at least (until some generous benefactor adds blocks to Objective-C), we will have to stick to object-selector as the means to reference an element transformer.

So, the most general collect method looks something like this:
- (NSArray *)collect:(SEL)selector on:(id)selObj with:(id)arg2

This asks the containing array, the receiver, to collect the result of sending objects from the array to the 'selector' on 'selObj', where this message also has a extra parameter 'arg2' (beside the array object). The array constructed from all the objects returned is what is passed back.

In practice, quite a few of the transforming methods we might want to call are on the objects in the original array themselves. For instance, if we wanted to upper case all the strings in an array, then the correct method to do this is the -uppercaseString method which all NSString instances will respond to. What this means is that we need a variant of collect that can internally target the object being processed in the source array with the selector invocation. While we're at it, we can add variants of these two methods so far that don't take an extra argument (in case we don't need to provide context to the transformer). That gives 4 variants of collect which would seem to handle all transforming cases, so long as you don't mind bundling all contextual information into the one and only context object that you can send in to the transformer from 'outside', and so long as the transformer you need accepts and returns objects (type id).

Flushed with success, I turned my attention to inject. This method needs to present a selector with each object in turn from the array, together with an 'accumulator' value that is threaded through the entire process i.e. an object that starts off in some original state, and is updated by every call to the selector (which returns the new value of the accumulator). The result of inject is the final state of the accumulator, after all array items have been processed.

The most general prototype of this method is something like this:
- (id)inject:(SEL)selector on:(id)selObj withSeed:(id)accumulator

Once again, we can specialise this in a couple of ways:
1. To use the source array object as the target of the selector
2. To treat the first element of the source array as the seed for the accumulator

Another option could be to include a context object in the design, per the collect example above. I chose not to do this at this time.

Now, the CAL language (like some other functional languages) allows concurrent evaluation of list mapping. Although the former is still lazy, I thought it would be fun to see what it would take to do the same in Objective-C/Cocoa.

The first inclination is that we want to create threads for the various mappings from elements in the source array, to elements (at the same index) in the destination (mutable of course) array. These threads perform the simple tasks of dispatching the source object to a transformer (identified by a selector/object pair as before). Parallel mapping is probably only valuable when the transforming function is relatively process intensive, but in any case, it is an opportunity to spread the cost of transforming all the elements across the available compute resources.

Looking at the options for parallelising these transformations, one soon considers Leopard's new NSOperation. This class, and NSOperationQueue implement a spaces-like pattern, where self-contained task objects are added to a space (the NSOperationQueue), which processing agents (the CPUs) take out and execute using the logic in the operation's own 'main method'. Having never played with these Cocoa classes before, I thought it would be fun to try.

Using NSOperation is easy - particularly if you want to use the basic "non-concurrent" mode (somewhat badly named IMO), which creates the execution context (specifically the thread) for you. It's a simple matter of subclassing NSOperation, creating some private state (ivars) that describe the task - a specification that can be provided when the object is constructed - and a 'main method'. The latter performs the operation using the specification encoded into the instance of the operation.

NSOperations can be started directly (by invoking their -start method), which in the case of the aforementioned "non-concurrent" mode (the behaviour of the default -start in NSOperation) will create a thread and execute -main on it. Thus, you are already provided with some benefit in the NSOperation encapsulation over creating your own NSThread and having it execute some object's method. However, we need some kind of operation controller that will start the right number of operations from an available 'pool', according to the compute resources available (number of CPUs/cores/threads-per-core). This is where NSOperationQueue comes in. Not only can this launch the requisite number of NSOperations to keep all your hardware optimally deployed, but it is a priority queue that is also aware of inter-operation dependencies if these are declared (i.e. an NSOperation can cite other NSOperations that it depends on, and that must be completed before it is allowed to run).

So, to implement a parallel collect, one might envisage a loop through the source array, creating an instance of a subclass of NSOperation (let's call it TransformOp, which is then added to an NSOperationQueue. This should work fine (and in this scenario, there is no need for any fancy prioritisation or dependency management). However, (and perhaps this was just overkill) what I preferred was a way to reduce the number of NSOperationQueues being produced from the array ahead of the those being executed (i.e. some sort of demand pump from the NSOperationQueue that would keep enough operations queued, but reduce the memory pressure of potentially creating thousands of operations for the whole array). Unfortunately, this is not (yet) a feature of NSOperationQueue. However, it occurred to me that you have everything you need to fairly easily create a blocking version of NSOperationQueue to achieve the main goal of reducing the number of operations in existence.

To cause some provider of operations to block on adding a new operation, one can subclass NSOperationQueue and override the -addOperation method. I created a "BlockingOperationQueue" class to do this. In order to do the actual blocking, this subclass maintains a two-state NSConditionLock, where the states are "open for additions" and "closed for additions". The -addOperation method gates the call to the underlying [super addOperation] by having the thread attempt to acquire the lock in the "open for additons" condition. If we're not open for additions, then the thread will block. So much for the gate, who is going to open and close it?

We need to control the state of our gate (condition lock) according to how many operations are currently queued, and some notion of what is 'too much' and 'too little' in the queue. The latter data can be introduced in the form of a 'high water mark' and 'low water mark' number of items. These become properties of the BlockingOperationQueue. A suitable default for these was chosen to be 2 times the maximum number of concurrent operations (available from the superclass) for the low water mark, and 4 times the maximum number of concurrent operations for the high water mark.

Within BlockingOperationQueue, we now need to ascertain when the queue length is changing, and to compare this with the set water marks (setting the condition of the 'gate' when these thresholds are crossed). NSOperationQueue exposes an array of operations currently in the queue, and the documentation informs us that operations is a KVC/KVO compliant key of NSOperationQueue. This means we can safely register for KVO notifications from it to detect when it is changing, which drives the opening and closing of the -addOperation gate.

In summary, the new NSOperation classes in Leopard look to be a very useful addition to the developer's kit. A throttling or demand-based interface to operation providers is a nice addition. Having a parallel map/collect method isn't appropriate for every occasion, but if the transformations are relatively compute-intensive, then can be very useful. As in languages like CAL, this is also a nice way to go about launching parallel activities in general - describing them as a list of items to be worked on, and having parallel computation launched for these items. The downsides in the case of the simple implementation described here are the restrictions in using selectors (one argument and returning an 'id'), and the lack of error handling (though NSError could be deposited as results).

No comments: