11 Aug 2020 3 min read Dev and Pipe Tips

Gaffer 0.58.0 : We're going on a bug hunt

Gaffer 0.58 contains a caching optimisation that can significantly improve interactive performance in many cases. But this optimisation can also reveal subtle bugs in Nodes that may have gone unnoticed until now. In this post we'll explore why that is the case, and outline some useful strategies for catching and eradicating such bugs from your custom nodes.

Gaffer's compute framework

Let's start with a quick recap of Gaffer's compute framework. The primary workhorse of this framework is the ComputeNode::compute() method, which must be implemented to generate a value for an output plug, using the values provided by the input plugs and the current Context. This is where all the heavy lifting of processing objects and generating images is performed.

When the node graph is modified interactively, Gaffer needs to know which node outputs require a recompute. This is achieved by calling ComputeNode::affects(), which must be implemented to list all the output plugs whose values are affected by a specific input plug. Using affects(), the relevant output plugs are dirtied, and the UI uses this dirtying to trigger an update.

Because computes can be time consuming, Gaffer caches the results of compute() so that they can be reused later. The cache is accessed using keys generated via ComputeNode::hash(), which must be implemented to quickly generate a unique "signature" which identifies where in the cache the result is stored. hash() should typically be very fast, but some nodes such as FilterResults and Set have hash implementations which themselves have significant overhead. So Gaffer also keeps a separate, smaller cache of hash() results indexed by Context.

What has changed

Before Gaffer 0.58, entries in the hash cache were very short lived. Any time anything at all was dirtied in the node graph, we cleared the entire hash cache. This could be incredibly inefficient, and heavy use of nodes like FilterResults could cause significant interactive lag as they were rehashed unnecessarily. We now use ComputeNode::affects() to invalidate the cache for only the plugs that are affected by the graph edits, and we reuse the rest of the cache. In several cases this can improve interactive performance significantly.

Every silver lining has a cloud

As a result of this change, it is now imperative that all affects() implementations are 100% accurate in declaring dependencies between plugs. In the past, an inaccuracy would often go unnoticed; as long as something was dirtied, the cache would be invalidated and the UI would update. But now, all relevant plugs must be dirtied; if they are not, they will read invalid values from the cache. At best, this will look like the UI is not updating, and at worst, crashes could result.

The safety net

Having performed some lengthy bug hunts through Gaffer's native nodes, we know that it can take some time to fix everything. For this reason, Gaffer 0.58 allows you to revert to the legacy cache behaviour, providing a grace period during which you can get things fixed up without worrying about production grinding to a halt. This can be enabled either using an environment variable :

export GAFFER_HASHCACHE_MODE=Legacy

Pinpointing bugs

Since Gaffer retains the ability to run in Legacy mode, we can detect bugs automatically by comparing the results we get in the Legacy and Standard modes. This is can be enabled with a value of Checked for the environment variable :

export GAFFER_HASHCACHE_MODE=Checked

In this mode, any time the Legacy and Standard caches disagree, you will see a message like this printed in the shell :

ProcessException: Sphere.__source : Detected undeclared dependency. Fix DependencyNode::affects() implementation.

This allows you to pinpoint the exact source of the bug - in this case we would need to edit the implementation of Sphere::affects() to make sure that all dependencies of the internal __source plug were accounted for. We strongly recommend running your unit tests in this mode as soon as you have Gaffer 0.58 installed, as a first step in tracking down any potential problems.

Enabling checking at runtime

Of course, the most rewarding bugs to hunt aren't the sitting ducks that pop up reliably in your unit tests. For real sport we want to be hunting bugs that only appear sporadically in the wild. The ones you only hear of as distant rumours from the production floor, and which are described differently by anyone who encounters them. For these, it can be useful to turn on checking at runtime, so that you can debug directly in the current Gaffer session as soon as the bug is spotted. This is achieved with the following scripting command which can be run via the PythonEditor :

Gaffer.ValuePlug.setHashCacheMode( Gaffer.ValuePlug.HashCacheMode.Checked )

As before, this will report any undeclared dependencies it detects to the shell, hopefully allowing you to add one more squashed bug to your trophy cabinet.

Useful links

Gaffer 0.58 release notes
The GitHub PR introducing the change (demonstrates common fixes and Filter API changes)