Erlang inspired error handling with superv.async.

March 10, 2016

Clojure error handling

Proper error handling is hard. I have struggled with it often without recognizing it. You might think that you can add try-catch mechanisms to sections you find prone to error, typically sections doing IO with remote systems and services. While you realize such sections when you write them personally, this context is lost when you use libraries in an impure language. Clojure goes a long way in providing a pure state management concept which shields you from errors. As long as you properly separate your IO functions, there is a limited interface on which exceptions can propagate. But is this assurance good enough? I have experienced that once your state becomes more and more linked to external systems, code covering these systems also tends to spread, yielding possibly distributed error states which need to be addressed on higher levels.

Clojure has nice concepts of how to deal with errors beyond try-catch mechanisms with exceptions. There are libraries providing the Common Lisp condition system for Clojure and as Chris Houser has pointed out, even a simple dynamic scoping will doAnd will compose properly. . This approach has many benefits, importantly it can allow robustness (restarts), which are not possible when an exception has unwound the stack.

Erlang error handling

So does this cut it? I have faced a more profound problem with Clojure error handling in core.async. But this only lead me to investigate the design space a bit, most prominently the Erlang Let it crash principle.

Hindenburg crash

So before diving into error handling with core.async, I will briefly describe the Erlang error handling concept. It is a succinct feature of Erlang, which makes it profoundly solid as a language to build error-prone distributed systems with very high reliability despite lacking a type system. Clojure with CSP core.async is very similar in this regard, but by default it lacks the same quality error handling.

Erlang uses links and monitors to track different processes (actors) automatically and using these mechanisms preemptively terminates processes linking to a dying process. This importantly allows to easily restart whole subsystems if something goes wrong. You do not handle the error locally if you are not sure what to do about it, but let your process crash, knowing that somebody might be able to catch the error on their monitor and deal with appropriately. Error handling is done in general at higher levels and incrementally added when the programmer experiences errors. But despite this lazy approach, the restarting concept allows to keep the system available as long as some higher system entity can properly restart. Technically there are many Erlang specific concepts like the linking of processes at runtime by refering to their name. But this is not essential for the concept, it only allows easier mutable adaptation of a running system and fits Erlang’s CSP design.

We can hence summarize the requirements for robust error handling by two qualities:These are only the core requirements to recover robustness.

  1. All errors must be catched and dealt with somewhere
  2. Resources must be properly freed to allow restarts

So how does this relate to Clojure? core.async uses the concept of go-routines, which are not the same as Erlang processes as they lack the addressing by names (like PIDs). Instead they are passed around as values, expanding a call tree at construction. They can be bound to vars or managed in STM primitives, but typically are rather used as in go-lang as dedicated routines which are called. core.async hence also does not have the send/receive mechanism of Erlang. This is not critical though, as we can recover the error handling requirements above without them.

Erlang inspired error handling in core.async

Coming back to error handling in core.async, it does not provide any error handling by default and throwing exceptions is broken. While it is easy to fix it by installing a default handler for all errors somewhere, it does not recover the Erlang qualities, but only installs one global handler. One might wonder whether go-lang itself has a good error handling, but I have found out that it has no comparable conceptsIt is not seen as a problem really. and advertises to deal with errors locally when possible. This does not recover the robustness requirements mentioned above and can leave subsystems in a broken state by wrongly handling errors in a too local scope. So what we need to do is ensure that we catch all exceptions and propagate them and also that once we decide to deal with an exception, we are able to restart the subsystem affected by it. This requires bidirectional communication, as we need to signal our restart intent. Importantly it is hence not enough to put exceptions on an error channel with a go-try macro construct.

To recover Erlang-like subsystem modelling of errors, we need to assume to have been passed an entity into our scope which tracks and gets all the errors. We will call this entity our supervisor in similar terminology to Erlang. This entity needs to track all errors (1.) and all go-routines (2.). The latter is necessary so it can wait for their termination, if it decides to restart the subsystem. There is only one problem left here, which is nicely solved in Erlang by its preemptive scheduling and error-handling support in the runtime. How to terminate the go-routines to enforce a restart? The supervisor could easily wait for a go-routinee.g. a go-loop forever, effectively defeating the whole purpose of robustness and availability. Since Clojure cannot preemptively reschedule code in go-routines, we have to interrupt them differently.

We can exploit a simple observation here, almost all go-routines are written to allow asynchronous operations on channels which typically cover most of their wallclock time. We can control these blocking operations, since we already intercept them with our error-handling mechanism, unblocking them by throwing an abortion in the context, recursively unwinding all parallel go-routines. While this behaviour goes a long way, the programmer needs to be aware that go-routines with code that can take a long time without asynchronous operations (e.g. with blocking IO, long computations etc.) is problematic. It is cooperative concurrency after all.

I haven’t found this to be a problem yet, but there is little we can do here, except of doing some deep code transformation with autoinjection of code in the go-macro, which still would not help with outside JVM code. Pulsar uses bytecode instrumentation on the JVM to achieve something similar, but they have also found that preemption is not too important in practice. We cannot use this low-level approach with core.async, as we also need to target JavaScript. (One could do differnt things on the two hosts though.)

So my first attempt at robust error handling was to adapt full.async. It would have been desirable to use dynamic binding for the supervisor as it is a typical example for a good dynamic binding and the already available go-try and <? constructs keep working transparently. While this dynamic binding approach worked well for the JVM, it proved incompatible with ClojureScripts dynamical binding.

Somewhat reluctantly I decided to fork full.async as superv.async then and add a lexical argument for the supervisor to all corresponding full.async operations. While this adds some inconvenience, the benefits outweigh the costs for me. Not only am I now able to track all exceptions in parallel go-routines appropriately, I can also do things like robust connection handling and provide error handling by functional composition to the whole replication system.

An example invocation of a supervisor with the corresponding async primitives and operations.

(let [try-fn (fn [S] (go-try S (throw (ex-info "stale" {}))))
      start-fn (fn [S] ;; will be called again on retries
                 (go-try S
                   ;; you must ensure the freeing of resources:
                   (on-abort S 
                     "do cleanup here")
                   (try-fn S) ;; triggers restart after stale-timeout
                   42))]
  (restarting-supervisor start-fn :retries 3 :stale-timeout 1000))

While the implementation definitely needs more testing to be really fleshed out, I am comfortable to say that it considerably improves on error-handling in core.async today. It also communicates the normal exceptions to an embedding environment, the supervisor just returns it in case it cannot deal with it. I hope you find it useful and report back in the gitter chat or on the Clojure mailing list :).

Erlang inspired error handling with superv.async. - March 10, 2016 - christian weilbach