Why Hasn’t The Consensus Problem Been Solved Yet?
We often get asked, “Why is this still a problem?” Why doesn’t consensus-based state management work well without Cachai? And if it’s such an important piece of distributed systems then why isn’t this problem solved already?
Here’s why: in order to perform well, this small but impactful component of distributed systems requires a unique set of network characteristics - characteristics that are very different from what’s needed by everything else happening on that network.
Protocols and Environments
It’s not hard to wrap our minds around the idea that different applications are designed to run on different protocols. For example, think about the difference between the internet data movement patterns you’d want for video streaming versus for sending an important document. For video streaming, we happily sacrifice a bit of detail in exchange for uninterrupted viewing. To say that in technical terms, over the internet, non-critical and continuous applications might communicate using UDP lower bandwidth and lower latency data streaming.
But that protocol would not be quite as appropriate for transmitting a document. For the purposes of sending a document, we really want accuracy and we’re happy for the network to take a few extra seconds to make sure all the information showed up in the right order. When downloading a file, we might sacrifice some speed and take advantage of TCP’s ability to make sure packets get delivered.
Most of the data on networks belongs to applications that are happy with a common set of characteristics that have become standard in data centers and over the internet. These include large packet sizes, the option to drop packets whenever that’s convenient, and a confluent, somewhat chaotic mixed-together sharing of resources that allows all of the various packets to find their own way to their destination to be reassembled into meaningful content. You can think of this like a trading floor in a giant convention center, in which there are many paths from booth to booth and everyone is carrying packages with hand trucks. These characteristics are highly efficient when significant information needs to be sent from node to node in the network - i.e. for most applications.
But, in order to scale collaborative distributed systems over larger networks, a fast and reliable consensus process needs to keep track of the state of the system. Unfortunately, while consensus can execute in the environment described above, it won’t be easy to scale or secure because that environment lacks the network and protocol characteristics it needs.
The Care and Keeping of Consensus
Consensus is essentially an ongoing conversation about reality among three, five, or another odd number of nodes. These nodes are constantly sending tiny packets to each other sharing their status and voting on whether they agree about what they’re seeing. They are a gatekeeper or bottleneck for other subsequent applications, so the most important thing for their purpose is to stay in constant contact and maintain that mutual trust and agreement as reality unfolds. If one member stops responding, the rest have to cut him loose and find a new member to take his place. Someone maliciously impersonating a member could disagree endlessly to sabotage the process, or even introduce untrue information. If some members of the group lose track of the others, that’s a big problem because now there are two realities… Which one is correct?
Can you imagine what it must be like to have that job in a crowded, noisy conference hall surrounded by people running in every direction rolling stacks of heavy boxes? If the architect is savvy, at least the consensus nodes ought to be physically near each other - but that’s not necessarily a given. This works okay for hobbyists designing Kubernetes and Kafka clusters in their home labs, but an enterprise deployment might need nine rather than three members. It might need more frequent and granular updates. It might need those members to cover a much larger area. At some point, metaphorically speaking, the frequency with which these consensus members trip and fall, lose track of each other, and can’t be heard over the noise starts to become a real problem.
Now, you can perhaps imagine what kind of network environment would be optimal for a consensus cluster. They should be easily accessible to each other, but separated from everyone else - perhaps in a secure skybox with a view of the convention floor. We may want them to be physically separated into three different skyboxes, for redundancy’s sake, but we give them a direct audiovisual line to each other, and a well-understood shorthand language for agreeing on what they see. With no interruptions and no opportunity for deception or sabotage, you can imagine that team working like a well-oiled machine and taking on greater responsibilities with ease.
Why Don’t People Do This Already?
Distributed systems used to be much less important than they are now, so until recently consensus simply didn’t matter much - and both infrastructure and training reflect that. The world of IT and networking infrastructure in the 1990’s and early 2000’s could not have reasonably predicted what would happen: that Google, Amazon, Facebook, and Microsoft would have made cloud and other distributed systems such a core part of our everyday life - and such an important competitive advantage for the products and business foundations of just about every industry.
If IT leaders had somehow predicted our reliance on distributed systems, they would have included in every data center a small enclave dedicated to consensus. But, that simply wasn’t necessary until very recently. Now, the easiest way to do that is to install Cachai.
As distributed systems grew into the foundation of our daily lives, the expertise required to tune and configure consensus clusters into the kind of environment they thrive in remained quite niche. Consensus does run well enough on ordinary networks for folks to get started, and cloud-based managed services are available when the going gets tough. So, even IT professionals with lots of experience running distributed systems like Kubernetes and Kafka have usually never learned much about the state management and consensus processes underlying them.
As enterprises scale their distributed systems, they find that relying on the cloud provider’s managed distributed system services grows unreasonably expensive. But, to build and maintain their own infrastructure they would need to hire -and maintain- some of those scarce engineers who know how to build enterprise distributed systems from scratch, consensus enclave and all. So, “why don’t people do this already?” They do, just very slowly, imperfectly, and one enterprise system at a time. There are not enough talented engineers to keep up with the demand for enterprise-grade distributed systems.
That is one major problem Cachai solves, by automating best practices and elite expertise from across the fields of enterprise distributed systems, network architecture, Web3 security, and software engineering. By adding Cachai to the network, an enterprise can empower their existing IT team to easily build and run resilient, secure, performant distributed systems independent of any particular cloud.