Alfonso de la Rocha

Jan 25, 2018

4 min read

Who is to blame in a distributed world?

(Disclaimer: None of the opinions in this post represent Alastria’s views as an organization, nor any of the institutions I am working or collaborating with. This is a personal opinion.)

Probably one of the most used images in Medium posts talking about CS :)

Who would you blame in a distributed system owned by everyone but managed by no one if something fails? This is the million dollar question, and it raises an issue that we will have to face (and that we have actually been facing this week) in Alastria. As mentioned in my previous posts, Alastria uses IBFT as its consensus algorithm. In IBFT we discern between two types of nodes:

  • General nodes: Standard participants of the system. They deploy applications over the blockchain infrastructure, they conduct transactions, deploy smart contracts and perform any other task someone using a blockchain infrastructure would desire to perform.

This is the current status in Alastria’s test-net. We have general nodes using the blockchain and whose transactions are validated by a set of selected validator nodes. Owners of general nodes do not need to have a validator node necessarily deployed (at least it is not mandatory). These validator nodes are managed by their owners, following the guidelines and recommendations from Alastria’s core technical team (to clear it up, the ones developing the infrastructure). This was “kind of working” until this week.

The problem appeared when out of the blue, some of the validator nodes started failing or misbehaving, and they came down. When this happened, all the new transactions in the network were being queued and the system stopped mining, i.e. the consensus algorithm stopped working. In IBFT, “the system can tolerate at most of F faulty nodes in a N validator nodes network, where N=3F + 1”. So it seemed that more than F validator nodes came down at the same time, killing our beloved network.

When we managed to contact the node’s owners and revive some validator nodes, the network started mining again… but it was mining empty blocks, without validating any of the queued transactions or any of the brand new transactions being performed. “But WHY?”- we asked ourselves desperately.

Apparently, only three of the partner’s validators stayed alive along with Alastria’s validator node when the network failed. Definitely, F was higher than required. Moreover, we had an even number of validator nodes, and we saw that regular nodes were at two different stages of synchronization, i.e. they had a different blockNumber. This made us think that an unintentional fork had happened in the blockchain, but then we read the following at IBFT’s description: “Blocks in Istanbul BFT protocol are final, which means that there are no forks and any valid block must be somewhere in the main chain. To prevent a faulty node from generating a totally different chain from the main chain, each validator appends 2F + 1 received COMMIT signatures to extraData field in the header before inserting it into the chain.” So that did not seem the problem for the desynchronization…

We haven’t fixed the problem yet, and we are trying to recreate the issue under our test scenario to help to technically avoid or minimize these “validator’s dead problem”. However, this issue raises a much more metaphysical question that we should address in the design of Alastria’s infrastructure. Whose blame is that a validator node owned by a partner, but that belongs to Alastria (as a computing infrastructure not as an organization), fails? Even more, who should be in charge of managing these nodes, the owner of the nodes or a dedicated Alastria task team? It is a hard question. For regular nodes it seems clear to me, each partner manages theirselves. However, validator nodes are trickier as they are critical for the operation of the network.

As far as I know, all the distributed consortium initiatives over blockchain are facing this problem of “who is responsible for what”. So let’s fight to be the first ones to efficiently solve it technically and regulatorily. For now, in Alastria’s test-net what we are building is a lightweight monitoring system with limited access to nodes to enable Alastria’s core technical team to perform simple management tasks over validator nodes (restart, update version, get operational statistics) in order to advance in the development of the infrastructure… but more about this piece of software in my next posts.

Is this “monitoring system” the best solution to the distributed responsability problem? Probably not as we are centralizing it again… for now, but you can help us define a “best solution”. Let’s keep in touch ;)