Pulse launch: What went wrong and what went right

Without doubt, the Loki Pulse upgrade is our biggest ever blockchain update. Pulse (our proof-of-stake transition) is the most visible feature in this release, but there are also a ton of other under-the-hood implementations:

We completely replaced the RPC layer
We added Lokinet LNS registrations
We cleaned up massive chunks of the code
We upgraded to (and have made good use of) C++17
We imported lots of changes from Monero (such as transitioning to smaller transactions)
We also made tons of smaller fixes in nearly every part of the code.

But of course, the biggest feature is Pulse, and it really is a big change!

Quick overview of Pulse

Pulse is our protocol to replace PoW (proof-of-work), where miners were spending lots of CPU cycles computing RandomX (Loki variant) hashes to find a sufficiently difficult block to be accepted by the network. (Generally this is done by updating a single “nonce” value in the block which generates a completely different proof-of-work hash, and then seeing if this hash is acceptable for the network; the vast majority are not). That mined block rewards the miner that found a valid nonce, but also includes a predefined reward that was paid to Service Nodes for the services they provide.. While PoW has served us well in getting to this point, we now have a robust Service Node network that is more than capable of doing the job of securing the Loki blockchain.

Enter Pulse. Under Pulse we add more responsibilities to the Service Node network: they now take turns creating the block, and instead of spending CPU cycles to find a magic nonce value, they build a quorum of 11 validators who receive the block from the block creator, contribute a random value (for chain entropy), and sign off on this block. Once they have enough validator signatures (at least 7 of 11 are required), they distribute the signed block to the network. Network acceptance then depends on having enough valid signatures rather than a hash with enough proof of work.

It’s a bit more involved than that in the details: for instance, the random values are revealed in multiple steps so that all participants have to generate and commit to a random value before they see the random values of anyone else. There is also a fair amount of complexity in how the quorums get chosen, how we select backup quorums in case a quorum fails to produce a block in time, and so on. If you love this sort of thing, there are a lot of details in the pull request descriptions for the Pulse implementations that you can read on GitHub.

Pulse launch

Aside from being a big release, this is also our most tested release: we created a “devnet” network for early, rapid development long before we merged Pulse onto our regular testnet, and then tested on the testnet for several weeks to nail down all the possible weird corner cases and edge cases that can come up in a transition like this. We have “fakechain” test code that creates a new network from scratch, registers Service Nodes, and generates Pulse blocks. We were pretty confident that it works well, given what we’ve thrown at it over the past months of development.

We rolled out the update, and nervously waited for the network upgrade block (block 641111) to appear, COVID-style virtual champagne emojis at the ready.

2 minutes go by. No block.

3 minutes. 5 minutes. 10 minutes. No block.

At this point the conclusion was inevitable: something went wrong. Over the next hour we analyzed nodes, debug logs, etc. to figure out where things went wrong. What we figured out is that network nodes were disagreeing about which nodes were supposed to be in each of those Pulse quorums. This meant that a Service Node would produce a signature and try sending it to its quorum, saying “I am Pulse validator #7, here is my signature.” Receiving Service Nodes would completely reject it: the signature wouldn’t validate because they would have a completely different idea of who validator #7 is supposed to be — and if the network can’t agree on who is supposed to sign the blocks, then Pulse doesn’t work at all.

Some more investigation led us to the cause. One of the implementation details of Pulse is that we pseudorandomly select validators in such a way that everyone agrees (or at least, is supposed to agree) on which validators get selected by using randomness from past blocks on the chain. However we also don’t want the same validator to get used repeatedly to make it more difficult for any single Service Node to have any meaningful influence on the chain: thus we sort everyone according to when they last participated in a Pulse quorum, and use the first half of this list to select random members. As long as everyone agrees on the order of previous validators, this all works out: everyone ends up selecting exactly the same 11 random Pulse quorum validators, and thus they know which Service Node public keys they need to validate signatures against.

Once we figured out that the problem was here, we nailed it down to a more specific issue: that big sorted list of nodes (from which we select the quorum) was sorting differently on different nodes: network nodes seemed to have a different idea of when other nodes had last participated. Eventually this led us to the bug: when a node upgraded from 7.1.x to 8.1.0, we would assign the last-participated time for peer nodes to the last time that node was rewarded, which seemed reasonable. But that broke quite badly because we only ever update that last-participated value starting at the Pulse fork block. The end result is that a node upgraded at block 640000 and a node upgraded at block 640100 would end up with completely different “last participated” values, and then when we hit the first Pulse block, they completely disagree about who should go next.

Our fix was simple enough. when upgrading to 8.1.1, every peer’s “last participated” value got set to 0 so that everyone could agree on the sorted list again.

So why didn’t we catch this bug earlier on testnet/devnet/test code? Well, it turns out that the bug was certainly there, but because our testing apparatus upgrades all our nodes at the same time (automatically, as soon as we push new commits to the dev branch), it didn’t have the opportunity to occur: they all had a “wrong” sort order, but they also all agreed on the sort order.

That let the 8.1.1 release resolve the main issue, but we hit a built-in emergency fallback. The network tries to generate Pulse blocks, and keeps trying once/minute cycling through different backup quorums until some quorum manages to create a block with the required signatures. If this goes through 254 failed rounds, the network gives up and waits for a proof-of-work block to be mined, restarting Pulse quorum attempts after the next block. This was designed to essentially allow us to fix a problem in a scenario exactly like this: pause the chain and get out an emergency fix to get it going again.

This actually exposed a second problem given that the stall happened at the very beginning of the Pulse chain: Pulse blocks are all considered to have an artificial difficulty value of 1 million as if they were on a network with a paltry 8.3kH/s of hash, allowing a mined block to be found within a few minutes by a single desktop class machine. However, in this case, all the past blocks that are used to determine the difficulty were from a network with a hashrate of several MH/s. Thus 8.1.1 has a second fix to ignore all the pre-Pulse blocks when determining difficulty and go directly with the 1M difficulty.

One last change we put in was to disable Service Node testing for a week since 8.1.0 nodes would be unable to sync the chain once it started again. As a consequence, we still have a few (around 7% at the time of writing) nodes still on 8.1.0 or 7.1.x; while they should have been decommissioned and deregistered by now, they are kept alive by this extra margin to give them time to upgrade.

With the fixes applied, we deliberately broke the testnet in the same way with different “last participated” orderings for peers and tested its ability to recover. It did, and so we rolled out the fixes to the community and asked people to update, and meanwhile mined the 641111 block to get the network rolling again.By the time we got a fix deployed, we’d gone past the block 254 stall point.

So what happens now?

Pulse is designed to be fairly robust against nodes that are down. We don’t want the network to grind to a halt because someone decided to install an update or reboot their Service Node at the wrong time. Thus we use backup quorums, as I described above, so that even if a big chunk of the network isn’t available, the backup quorums randomly select different nodes until we find one that works. We call these “Pulse rounds.” Round 0 is the normal block where the next SN reward winner in the queue gets to produce the block (and earns any tx fees on top of the Service Node block reward!), and 11 random validators; if that quorum fails to produce a block within 60 seconds of when it was supposed to, round 1 begins by selecting a new set of 11 random validators and a random block producer. This backup producer still has to pay the queued SN reward winner its basic fee, but gets to reward itself with any earned TX fees, splitting the fees among the operator and contributors.

You can see this happening in the first few blocks after 641112 (which is the first actual Pulse block; 641111 was supposed to be, but ended up as a fallback mined block): https://lokiblocks.com/range/641111/641121

The network took quite a while after that to find block 641112, which it did at round 67 (you can see this on the block details page on the block explorer: “Block type: Pulse 💓 | 67” is indicating the Pulse round) because we only had Loki Foundation and team members’ nodes updated at this point. As community members started updating, Pulse blocks started coming faster, coming in at rounds 12, 0, 7, 1, 0 over the next five blocks.

Looking at recent blocks with 93% of the network on the 8.1.1 release, the vast majority are now “round 0” blocks with an occasional round 1 block.

We also built Pulse to be able to recover from slow blocks or long delays (as we had here): rather than producing blocks at the regular 2-minute interval, if the chain is behind schedule it speeds up, starting new Pulse quorums blocks after 90 seconds rather than 120 seconds. You can see this as well in recent blocks on the explorer with most being about 90s apart. If you hover your mouse over the “Next Pulse: …” indicator at the top of the main block explorer page, you can see how far behind target we are.) For example, as of writing this post, here are the top 20 blocks:

https://lokiblocks.com/range/642775/642794

Most of these around slightly over 90s apart, though some still are a bit longer: if one or more validators that get selected are 8.1.0 (or 7.x) nodes, it adds about 15-20 seconds before the quorum gives up on getting a response from those nodes and goes ahead without them. Some are longer still (such as block 642789), which generally means a backup quorum had to get involved to make a block.

Since our fixes, things are moving forward, the blocks are rolling (or perhaps you could say, “pulsing” 😜) and Pulse is working exactly as intended, despite the network being still slightly degraded. By next week, testing will turn on again, we’ll kick off the last non-upgraded nodes, and we’ll catch up to the target height and end up with the smooth, consistent 2 minute blocks that Pulse is designed to produce.

So all-in-all, one tiny bug completely ground things to a halt despite all of the testing and effort that went into this release, but with it fixed, things are looking great. As much as I wish our initial 8.1.0 release had done the job, I’m pretty pleased with how things are working since fixing that (tiny, tiny and yet huge) bug.

— Jason