Critical Bug Report - 21st March 2019

On the 21st of March 2019, approximately 5 days before the scheduled Summer Sigyn hardfork to enable Infinite Staking, the Loki team identified a potential situation that would cause a consensus divergence in the Service Node lists between version 3.0.0 and all version 2.0.x nodes. At the time, most of the network were still running on v2.0.x nodes, but a significant portion had already updated to 3.0.0.

At that time, the versions were:

SN versions: 3.0.0 [209], 2.0.4 [36], 2.0.3 [109], 2.0.2 [43], 2.0.1 [57], 2.0.0 [11], unknown [1]

The Problem

As a part of the Infinite Staking release, a change to the curve of the staking requirement was implemented to allow nodes to be staked indefinitely. The old curve would force the staking requirement to go up from its minimum of 10,000 Loki back up towards 15,000 Loki, which would have allowed nodes to stake indefinitely at the lower staking requirement, even as it rose. The curve was modified such that the minimum was set at 15,000 and never increased from there. It also decreased the rate of decline to offset the new higher minimum.

In order to do this, a date had to be hardcoded into the software for nodes to uphold the new curve. This was originally set for the 20th of March – in line with the original Summer Sigyn hardfork date. However, due to concerns about constrained public testing times, the hardfork was pushed back 6 days to the 26th of March, but the height at which the staking requirement curve change was to take place was not adjusted in line with the new hardfork date. This was a mistake made by the development team and was caught by prominent community contributor, Jagerman, before being confirmed with the rest of the team on the (Australian) morning of the 21st of March.

The problem with this was that as the curves diverged, 3.0.0 nodes would start to require a higher staking requirement than the 2.0.x nodes. If a 2.0.x node staked at the minimum staking requirement, the 3.0.0 nodes would not recognise it as being ‘full’ and thus would not add it to their Service Node lists. This would then cause a divergence in the consensus about the state of the Service Node lists.

The Fix

A new release was quickly created which brought the staking requirement curve back into line with the 2.0.x nodes until the actual hardfork date. This release, 3.0.1, was immediately distributed to prevent Service Nodes from being deregistered due to the divergence. However we quickly realised that this alone would be insufficient, because if a divergence occurred, the 3.0.1 nodes would still be using a database which contained the old staking requirement in their databases. In order to fix this, users were asked to utilise the loki-blockchain-import utility to force a recalculation of the Service Node list. In parallel, a 2nd release, 3.0.2, was quickly created to do this automatically for users. An additional dummy field was created in the Service Node list code. This meant that when users deployed 3.0.2 for the first time, and rescanned their Service Node lists on boot, this dummy field would not be present and would force the daemon to recalculate the Service Node list.

Once the Service Node list was recalculated using 3.0.1 or 3.0.2, the node would be in line with 2.0.x nodes and would continue onto the hardfork height as planned.

The Outcome

The window between releasing the first fix and the time at which we expected a 2.0.4 node to first diverge from the 3.0.0 node’s staking requirement was a mere 6 hours. In that time, the vast majority of operators running 3.0.0 nodes upgraded. At the time of writing (10:30am AEDT 22 Mar 2019), the current version status of the Service Node network is:

3.0.2 [215], 3.0.1 [11], 2.0.4 [34], 2.0.3 [84], 2.0.2 [31], 2.0.1 [55], 2.0.0 [8], unknown [1]

3.0.0 deregistrations have been occurring slowly overnight, but all things considered, the number of operators who upgraded their nodes in time was truly impressive. Of the ~462 Service Nodes that were active before this event, only around 30 deregistrations occurred, some of which would have been routine. This accounts for approximately 6% of the Service Nodes. This was obviously not a good outcome for those operators, but overall the result is nothing short of incredible.

I am extremely happy that the community was so active in performing this upgrade, and would like to thank Jagerman and several other community members for participating in rolling out this fix. I’m also extremely proud to work with a team that can pull out all the stops when things go wrong and quickly and effectively deliver solutions, and communicate with the Loki user base to ensure everyone has the best possible experience.

Conclusion

Yesterday was a rather stressful day for most of the Loki team, and I’m sure it was for many Service Node operators, too. However as it stands, the situation now has been resolved with the last of the 3.0.0 nodes having been removed from the network. Considering the upgrade window was so short, I’m amazed by the speed at which operators were able to upgrade their nodes and keep their stakes alive.

We will be closely analysing what we can change on our end to prevent further incidents like this from occurring, and examining strategies we can implement to deal with situations like this in the future.

As per usual, you can find us on Discord, Github and Telegram if you have any thoughts, concerns, or ideas on this matter. Thanks for your patience and quick responses.

Simon Harman

Loki Project Lead