Corda Enterprise high availability design doc


(Mike Hearn) #1

This design document focuses on how to make an individual, ordinary Corda node highly available. It does not focus on high availability of network services such as notaries, oracles, etc, which have their own strategies.

Goals

  • Be able to construct a Corda node that is resistant to individual machine failures or restarts.
  • Be able to fail over between a hot and cold backup datacenter, automatically or at will.
  • Be able to tolerate extended outages of network peers.
  • Be able to tolerate extended outages of the Corda network operator (e.g. R3).
  • Be able to do upgrades of a node without taking it completely offline, at least some of the time.
  • Be able to scale from low-cost, low availability nodes up to high-cost, high availability nodes.

Non-goals

  • Be able to distribute a node over more than two datacenters. This may come later.
  • Be able to distribute a node between datacenters that are very far apart latency-wise (unless you don’t care about performance).
  • Be able to tolerate arbitrary byzantine failures within a node cluster.

Background

DLT users come in all shapes and sizes. Some of them will be small and will wish to use the ledger only occasionally, others will be large systemically important institutions that cannot tolerate any outages at all e.g. because banks would be unable to process or settle trades, or handle payments from their customers.

R3 plans to offer a paid-for Corda Enterprise product that supports advanced high availability features. The design is based on the team’s experience from both the financial industry and engineering HA consumer services at Google and elsewhere. Whilst small operations that don’t need/want the complexity and cost of HA may be satisfied with the open source node, larger operations will benefit from both more advanced technology and commercial support.

As always with Corda, the HA design tries to rely heavily on existing financial industry tools, infrastructures and experience.

Overview

The design principle of Corda is that there are a series of overlapping restart points in the progression of flows, so that data is never lost and eventual flow progression is guaranteed. Initially the intent to send a message during a flow is captured as part of the Flow Checkpoint along with a unique message id. The message is then submitted to a local durable queue as a staging area in Artemis. An Artemis core bridge is then used as relay to the target peer node. The bridge will consume the local message copy only when it has successfully been saved to a durable inbox on the target node. From there the message should only be acknowledged when the receiving flow has marked receipt and updated its own checkpoint. All along this chain of responsibility any restart, or process failure should lead to retries and although this may lead to message duplication the Corda system tracks fully consumed IDs to eliminate. In the worst-case scenario, a flow might need manually restarting from a checkpoint via the Flow Hospital to retrigger message sending, but this should be a very rare occurrence.

When we move to a High Availability model this process gets more complicated as by necessity the information needs to be duplicated and there needs to be more coordination to ensure no flow step is acted upon twice and that nodes can take over when one dies. It also requires us investigating how complete various HA features interact and there will be performance trade-offs in return for the increased reliability.

Sitting at the center of every Corda node is an Apache Artemis message queue broker. Artemis is a continuation of the JBoss HornetMQ codebase, now developed under the aegis of the Apache ActiveMQ project (but it is not the same product as ActiveMQ itself). Artemis supports clustering and HA out of the box (see the user guide for more info).

The Enterprise node will be structured as a series of micro-services. The open source code is designed with this in mind but isn’t itself structured this way. These micro-services will connect to the clustered/HA message broker and take on the node’s tasks as they are distributed over the message queues. MQ load balancing and work reallocation systems are used to handle the case when a micro-service is killed or restarted whilst work is in flight.

We anticipate at least the following types of microservice being present in a production distributed node and quite possibly more:

  • Flow workers. The flow framework is a central piece of Corda’s design, and each interaction between a node and other nodes is structured into a flow. The bulk of the work of a node takes place in the context of working on a coordinated change to the ledger, thus, flow workers are the engine room of the node. Adding more flow workers provides one way to scale a node up. Recall that “flows” are somewhat long-lived network protocols, written as ordinary blocking code, with the contents of their state machines (stacks and reachable heap objects) being checkpointed into the database. Thus flow workers can be restarted but the pseudo-threads that they’re running can simply be picked up by a different worker and resumed from the database … all through the magic of JVM engineering.
  • Database. Corda is designed to use ordinary relational databases. High end database engines like Oracle, MS SQL Server and Postgres offer support for replicated, HA database setups. Corda has nothing more to say on the topic, beyond attempting to keep as much data as possible in the database.
  • Transaction verification workers. The act of verifying a transaction involves signature checks and running smart contracts. It may also benefit from special hardware. Thus you may wish to sandbox, firewall or otherwise treat these workers specially. Whilst it’s possible we’ll end up deciding to only have flow workers, right now, we’re heading towards a design where the two can be split.
  • RPC frontends. Tools and programs that interact with a Corda node do so via a message queue based RPC protocol of our own design. The work of handling these inbound RPCs (to do database queries, attachment management, start up new flows, etc) would be distributed via MQ load balancing to RPC workers. We will need to take special care of RPCs that return long-lived observables, i.e. streams of events, because in case of an RPC frontend failing or being restarted the task of continuing to produce the events and send them to the RPC caller must be handed off to another worker atomically. Note that RPC workers may also act as administrator SSH frontends, which provide the SSH-based node console we’re integrating to system administrators that prefer the command line to the GUI. For example, installing a new CorDapp would be done via such a frontend.
  • Firewall floats. See the discussion on network engineering. Firewall floats run outside the corporate or cloud firewall. They accept inbound messages from the internet and relay outbound messages. They are especially useful for distributed nodes because you typically will not want your microservices directly exposed to the internet! You may want more than one if you have unusually heavy message traffic.
  • HSMs. These would not connect to the MQ broker as they usually speak their own protocol, but we can think of them as being a kind of “service”.
  • (Possibly) Distributed lock service. It’s not clear to us yet if we’ll need one of these or if we can use relational database locks instead. In case that we find workers need to atomically claim or release “work” e.g. in the case of an observable RPC result or some other long-running stateful job, we may need a RAFT lockservice with machines spread between the hot/failover site. This would play a similar role that Chubby does in the Google datacenter architecture. Workers would compete to take locks in the lockservice as part of doing master elections. Failure of a machine results in a new master being selected which can then take over work from the failed machine.

Note that inter-site replication is assumed to be handled by the database layer, for instance, by using a SAN. This approach stands in contrast to other platforms:

  • Ethereum does not have any kind of distributed node and relies on the global ledger for all data replication (useless for e.g. private transaction notes, keys, etc)
  • Fabric uses a database called “CouchDB”, which isn’t anything like a normal database and doesn’t seem to have any notion of strong data integrity in a multi-site setup. There is a commercial version from IBM that offers some more features.

HA network connectivity will be provided by support for IP multi-homing in the network map.

Introducing parallel flow workers requires that messages are routed to the right workers and there is no possibility of the same flow being executed in parallel (as an individual flow is single threaded). This is probably best done using MQ features to shard session IDs over several queues, but the exact strategy used will be explored in a future design document. We have already done some work on distributing work across multiple in-process flow worker threads.

Hot/warm replication and failover strategies

Terminology:

  • Hot/hot replication - two physically separated clusters which are both live simultaneously. This requires synchronously replicated multi-master database writes (or something like Google Spanner but we are not assuming this is available to users).
  • Hot/warm replication - two physically separated clusters, one of which is live and one of which is asynchronously replicating from the other. This does not require synchronised database writes, so can go a lot faster, but at the cost that if the master cluster takes a sudden failure then writes may be trapped in the cluster, preventing failover. This approach is useful when you know you can do planned failovers e.g. if you trust your uninterruptible power supply. Failover may be triggered automatically by a master election protocol.

The same basic concepts apply even if you have more than two clusters.

We’d like to support both modes for a node. The bulk of the difference is in how the underlying database is configured. From Corda’s perspective, the difference is primarily about when to open listening ports to other peers. In both setups the IP addresses of all sites are advertised simultaneously. Peers will attempt to connect to all advertised IP addresses and distribute new outbound flows across all available connections. Therefore you can control whether traffic arrives at a site by simply opening and closing the TCP/IP ports at each site. This functionality would be offered via RPC and the shell.

To do a failover in a hot/warm setup means closing the network port on the live datacenter (draining it), waiting for in-flight flows to reach checkpoints, at which point database traffic should quiesce and replication can catch up, flip the direction of replication on the database system, and then open the network port on the new master site (undraining it).

In a hot/warm setup, care must be taken that in the backup cluster the node is aware that it’s in a read-only mode, and it will not try to start scheduled flows or allow mutating RPCs.

It should be possible to failover between clusters of different worker pool sizes.

In hot/hot setups, peers are expected to assign individual flows to particular advertised IP:port combinations, that is, peers assist with ensuring that flow processing doesn’t bounce pointlessly between sites.

There are some open questions for hot/hot replication:

  • To what extent users will want a multi-site hot/hot setup to appear to be a single unified node? That is, should we be treating a hot/hot setup as if it were a single node e.g. for shell and RPC purposes? Or is it OK to expose the notion of there being two nodes that happen to be operating on the same database?
  • If there’s any singleton state left e.g. the scheduler it will still need a notion of cross-site mastership managed via a lock service.
  • Is there a smooth transition path to let us reach different levels of HA in pieces?

MQ Replication

To begin with in-flight RPCs do not fail-over. The client library is expected to retry. If RPC workers are split out from the node into microservices, we could use MQ features to configure retry for idempotent RPCs automatically.

Artemis supports MQ replication. However, for handling flows, RPC and P2P traffic, it should not be necessary to do MQ replication between sites as Corda keeps all state in the database. Care must be taken to ensure message dedupe works correctly during failover.

Artemis HA may be used to provide redundancy within a cluster. In that case the broker can be scaled up by adding more machines and may survive outages of single machines.

Low availability nodes

Node outages result in other nodes buffering up messages on disk and retrying delivery until either the target node comes back or the network’s event horizon is reached. A typical event horizon might be 30 days, yielding plenty of time for even small low availability nodes to return from a severe outage. This situation may occur in the case of e.g. small shipping ports in third world countries which experience a severe storm or power failure.

For many deployments, this ability to tolerate transient outages may mean that the best way to handle upgrades, IP movements, hardware upgrades etc is simply to take the node offline for a while, safe in the knowledge that any in flight flows have been checkpointed into the databases of the peer nodes and can be resumed whenever wanted.

Network operator infrastructure

Being inspired by Bitcoin, Corda nodes attempt to minimise their dependence on centralised network infrastructure. A Corda network operator must provide at least the following services:

  • A live network map showing node IP addresses, services offered, etc.
  • A permissioning (doorman) server through which new nodes request access.
  • Network operators may operate additional infrastructure on behalf of the network members such as notaries, oracles, certificate revocation servers and so on, but this is not mandatory.

Neither network map nor doorman are critical to the operation of the network. In case of an outage, old cached network map data will be used. New nodes will not be able to join the network, but this is not usually a time sensitive operation.

Rolling upgrades

It may be desirable to upgrade a node, or a CorDapp, without taking any downtime of the node. This sort of thing is tricky but can be done if planned for carefully, for example, by ensuring multi-phase rollout strategies where support for new protocol features is introduced before they start being used. We plan to support upgrades of apps without having to restart the node. Being able to upgrade the node itself without downtime requires a fully distributed node so restarts can occur in pieces, with automatic work rebalancing along the way. This is planned for later. Until then, simply restarting the node would yield minimal downtime. From the perspective of other peers on the network there is no outage from doing this, just a transient latency increase.

Corda node components

For rolling upgrades of the MQ broker or database machines, we delegate to the providers of those pieces of software.

For rolling upgrades of flow workers, it is sufficient to simply kill and restart the worker processes with the new version of the software. To avoid wasted work i.e. re-execution of code since the last checkpoints, a flow worker can be asked to drain itself before shutdown. “Draining” means the worker would reject new inbound requests to start/resume flows, but finish work on the in-flight requests. It is likely that flow workers would take only a few seconds at most to completely drain.

The same strategy applies to other stateless workers, such as RPC frontends and tx verification workers.

CorDapps

For rolling upgrade of CorDapps between (backwards compatible) versions, the following strategy can be used:

If the CorDapp provides a relational mapping of states and changes the database schemas it uses, the database administrator would need to apply a schema change in an outage-free way at this point. Good database engines can usually do many kinds of schema change without holding long table locks. CorDapps are expected to be designed with knowledge of what they can reasonably expect of databases to do.

The new version of the CorDapp is then “installed” by inserting it into the database via an RPC or administrator frontend.

The node configuration (which may eventually also be stored in the database) is adjusted to point new inbound flow requests at the new version. Note that in flight flows will continue to be handled by the old apps.

Tools will be provided to monitor the drain of old flows, and to perform controlled upgrades of the ledger states in case it’s necessary (please see the Contract Upgrades discussion).

Note that unlike draining worker processes, draining old versions of CorDapps may take quite some time, as in some cases flows cannot be upgraded in place and thus the old versions must complete until only the new version is in use. Likewise, upgrading the data structures held on the ledger itself requires coordination with other interested parties who may have their own timelines.

Tradeoffs

Is it reasonable to rely on the user to provide a synchronously replicated relational database? This has been a topic of extensive debate within the team.

An alternative design would be to handle replication entirely at the application level by splitting message streams. This isn’t the currently preferred approach because realistic Corda applications will rarely stand alone. There will usually be private, user-specific data stored in other database tables … things like customer lists, private transaction notes, tables required to implement internal business workflows and reporting requirements and so on. Indeed, in future CorDapps may be ports of existing standalone apps to the DLT context, with much of the business logic being reused. These apps will probably depend on a highly available database anyway, and because CorDapps have direct SQL access to the underlying database (exactly for this reason), it’s unlikely that we can ever represent all application state as a stream of idempotent messages.

The exact role of MQ replication and Artemis’ HA features is also something we’ve debated. Many organisations already run a core MQ platform and would like to reuse it. We’re open to this and there is a design doc about this already. Obviously in the case where MQ becomes pluggable, Corda HA becomes partly about how that component is configured.

Do we need a cross-site lock service to do automatic master elections? Possibly. However in the async replication setting you can’t tolerate unexpected cluster outages anyway … you’d need at least a few seconds to cleanly finish replication. In the synchronously replicated setup, sudden outages can be tolerated but it’s possible that the database layer itself provides sufficient locking primitives already and a Raft cluster on top would be unnecessary.

Staging of work

It would be good to develop this feature incrementally so it can be launched to Enterprise users sooner rather than later.

The obvious place to start is with the current unified/monolithic server model, with a third party database instead of H2. We can then add the notion of drains, also useful for smooth upgrades that don’t waste work by replaying from checkpoint, which can be triggered by admins using SSH access to the shell e.g. from scripts. We then just need to ensure that flow sessions have IP:port affinity which I think is already the case as it’s needed for the distributed notary.

We should then explore a rapid proof of concept for hot/hot operation between two nodes that are joined together with a synchronously replicated database. RPC client libraries would need to be taught how to load balance (this is easy because Artemis already supports such features) and singleton state like flow scheduling should be modified to ensure that only one of the two node replicas takes ownership.

Microservices can then be split out of the node incrementally, in separate releases, until the original node process is left with trivial matters that do not need to be HA.