Erlang inspired error handling with superv.async.
March 10, 2016
Clojure error handling
Proper error handling is hard. I have struggled with it often
without recognizing it. You might think that you can add try-catch
mechanisms
to sections you find prone to error, typically sections doing IO with remote
systems and services. While you realize such sections when you write
them personally, this context is lost when you use libraries in an impure
language. Clojure goes a long way in providing a pure state management concept
which shields you from errors. As long as you properly separate your IO
functions, there is a limited interface on which exceptions can propagate. But
is this assurance good enough? I have experienced that once your state becomes
more and more linked to external systems, code covering these systems also tends
to spread, yielding possibly distributed error states which need to be addressed
on higher levels.
Clojure has nice concepts of how to deal with errors beyond try-catch
mechanisms with exceptions. There
are
libraries providing the Common Lisp condition system for Clojure and
as Chris Houser has pointed out,
even a simple dynamic scoping will doAnd will compose
properly. . This approach has many benefits, importantly it can allow
robustness (restarts), which are not possible when an exception has unwound the
stack.
Erlang error handling
So does this cut it? I have faced a more profound problem with Clojure error handling in core.async. But this only lead me to investigate the design space a bit, most prominently the Erlang Let it crash principle.
So before diving into error handling with core.async
, I will briefly describe
the Erlang error handling concept. It is a succinct feature of Erlang, which
makes
it
profoundly solid as a language to build error-prone distributed systems with very high reliability despite lacking a type system.
Clojure with CSP core.async
is very similar in this regard, but by default it
lacks the same quality error handling.
Erlang uses links and monitors to track different processes (actors) automatically and using these mechanisms preemptively terminates processes linking to a dying process. This importantly allows to easily restart whole subsystems if something goes wrong. You do not handle the error locally if you are not sure what to do about it, but let your process crash, knowing that somebody might be able to catch the error on their monitor and deal with appropriately. Error handling is done in general at higher levels and incrementally added when the programmer experiences errors. But despite this lazy approach, the restarting concept allows to keep the system available as long as some higher system entity can properly restart. Technically there are many Erlang specific concepts like the linking of processes at runtime by refering to their name. But this is not essential for the concept, it only allows easier mutable adaptation of a running system and fits Erlang’s CSP design.
We can hence summarize the requirements for robust error handling by two qualities:These are only the core requirements to recover robustness.
- All errors must be catched and dealt with somewhere
- Resources must be properly freed to allow restarts
So how does this relate to Clojure? core.async uses the concept of go-routines, which are not the same as Erlang processes as they lack the addressing by names (like PIDs). Instead they are passed around as values, expanding a call tree at construction. They can be bound to vars or managed in STM primitives, but typically are rather used as in go-lang as dedicated routines which are called. core.async hence also does not have the send/receive mechanism of Erlang. This is not critical though, as we can recover the error handling requirements above without them.
Erlang inspired error handling in core.async
Coming back to error handling in core.async, it does not provide any error
handling by default and throwing
exceptions
is broken.
While it is easy to fix it by installing a default handler for all errors
somewhere, it does not recover the Erlang qualities, but only installs one
global handler. One might wonder whether go-lang
itself has a good error
handling, but I have found out that it has no comparable conceptsIt is not seen as a problem really. and advertises to deal with
errors locally when possible. This does not recover the robustness requirements
mentioned above and can leave subsystems in a broken state by wrongly handling
errors in a too local scope. So what we need to do is ensure that we catch all
exceptions and propagate them and also that once we decide to deal with an
exception, we are able to restart the subsystem affected by it. This requires
bidirectional communication, as we need to signal our restart intent.
Importantly it is hence not enough to put exceptions on an error channel with a
go-try macro construct.
To recover Erlang-like subsystem modelling of errors, we need to assume to
have been passed an entity into our scope which tracks and gets all the errors.
We will call this entity our supervisor in similar terminology to Erlang. This
entity needs to track all errors (1.) and all go-routines (2.). The latter is
necessary so it can wait for their termination, if it decides to restart the
subsystem. There is only one problem left here, which is nicely solved in Erlang
by its preemptive scheduling and error-handling support in the runtime. How to
terminate the go-routines to enforce a restart? The supervisor could easily wait
for a go-routinee.g. a go-loop
forever, effectively
defeating the whole purpose of robustness and availability. Since Clojure cannot
preemptively reschedule code in go-routines, we have to interrupt them
differently.
We can exploit a simple observation here, almost all go-routines are written to allow asynchronous operations on channels which typically cover most of their wallclock time. We can control these blocking operations, since we already intercept them with our error-handling mechanism, unblocking them by throwing an abortion in the context, recursively unwinding all parallel go-routines. While this behaviour goes a long way, the programmer needs to be aware that go-routines with code that can take a long time without asynchronous operations (e.g. with blocking IO, long computations etc.) is problematic. It is cooperative concurrency after all.
I haven’t found this to be a problem yet, but there is little we can do here,
except of doing some deep code transformation with autoinjection of code in the
go-macro, which still would not help with outside JVM
code. Pulsar uses bytecode
instrumentation on the JVM to achieve something similar, but they have also
found that preemption is not too important in practice. We cannot use this
low-level approach with core.async
, as we also need to target JavaScript. (One
could do differnt things on the two hosts though.)
So my first attempt at
robust error handling was to
adapt full.async. It would have
been desirable to use dynamic binding for the supervisor as it is a typical
example for a good dynamic binding and the already available go-try
and <?
constructs keep working transparently. While this dynamic binding approach
worked well for the JVM, it
proved
incompatible with ClojureScripts dynamical binding.
Somewhat reluctantly I decided to fork full.async as superv.async then and add a lexical argument for the supervisor to all corresponding full.async operations. While this adds some inconvenience, the benefits outweigh the costs for me. Not only am I now able to track all exceptions in parallel go-routines appropriately, I can also do things like robust connection handling and provide error handling by functional composition to the whole replication system.
An example invocation of a supervisor with the corresponding async primitives and operations.
(let [try-fn (fn [S] (go-try S (throw (ex-info "stale" {}))))
start-fn (fn [S] ;; will be called again on retries
(go-try S
;; you must ensure the freeing of resources:
(on-abort S
"do cleanup here")
(try-fn S) ;; triggers restart after stale-timeout
42))]
(restarting-supervisor start-fn :retries 3 :stale-timeout 1000))
While the implementation definitely needs more testing to be really fleshed out, I am comfortable to say that it considerably improves on error-handling in core.async today. It also communicates the normal exceptions to an embedding environment, the supervisor just returns it in case it cannot deal with it. I hope you find it useful and report back in the gitter chat or on the Clojure mailing list :).