This is the third post in our blog series about the design, implementation and usage of caching in datacenters.
The most important design decision we adopted in building the server is to separate performance-sensitive processing from the rest, and separate different types of performance-sensitive processing from each other. This is key to ensure deterministic runtime behavior facing a wide range of environments and workloads. As Dean et al have pointed out in the Tail at Scale, performance variability is amplified by scale, and the key to reduce (meaningful) variability is differentiation.
While operating Twemcache and Redis in production, we have seen:
- logging to disk impeding request processing on a particular thread;
- a flood of new connections, common after some network “blip”, virtually prevent a server from processing any requests for extended period of time;
- probing a server for either health or stats becomes unreliable when it is heavily loaded.
Such unpredictable behavior has real consequences in production:
- we cannot enable detailed logging by default due to performance concerns;
- network glitches can bring down a server unnecessarily, even when the network failure itself is rather brief and recoverable;
- health of a server can be misjudged, resulting in the restart of an loaded server, leading to even more load when the server comes back or additional load elsewhere.
Most of such uncertainties can be avoided, if we carefully separate operations that may interfere with each other undesirably. And of course, we are not the first to apply this principle. Look no further than right under our feet for an example – computer networks, being one of the earliest and most important type of large-scale distributed systems, have given this problem plenty of thoughts.
Data Plane, Control Plane
Networking provides the substructure to most distributed systems. To maximize
throughput, minimize latencies and jitter, networking technologies in recent
decades adhere to a divide between control plane and data plane1. Data
plane is in charge of actually forwarding individual packets, and its
performance is directly measurable by the end users, such as when you send a
ping and wait for the response. Control plane deals with uncommon events,
such as a stray packet that cannot be routed, or recomputing routing table upon
topology changes. Unsurprisingly, a trip through the data plane is called the
“fast path” while landing on the control plane puts a packet through the “slow
path”. Data plane optimizes for latency and throughput, often implementing a
relatively fixed pipeline using ASIC or simple, fast processors. Control plane
emphasizes more on flexibility, and is often equipped with general-purpose
processor(s) capable of running software that can be easily updated. There is
often a several orders of magnitude difference between these two planes in terms
The computer networking community demonstrated that by recognizing difference in priorities for different parts of the system, they can make packet processing fast while keeping state of the system well-managed and flexible. We take that lesson and apply the same analogy to high-performance caching systems.
Processing Pipelines in a Cache Server
The core idea of the networking model is minimizing work and interference on performance-critical paths, and allow processing to flow unobstructed. The first step in applying such a strategy is to recognize all the possible processing pipelines in a cache server, and label the performance critical ones.
The most common case in a cache server is the
request→response pipeline. A
request is sent over an established channel such as a TCP connection, and the
server processes the request and sends back a response, usually over the same
channel. This accounts for the vast majority of the load under normal
operations, and is definitely performance-sensitive.
Since a cache server in a datacenter almost always runs in a different process,
and usually on a different host from the clients, communication channels need to
be established before handling requests. This pipeline has less stringent
requirements on throughput and latency, since most channels are kept around
through many requests-responses cycles, so the cost is amortized. Still, it adds
to the perceived latency of the first request, and concurrency can rise
unexpectedly through synchronized client behavior after deploy and network
glitches. This pipeline is therefore still performance-sensitive, but should
give precedence to the
Monitoring and Administration
To use a (cache) server with any seriousness requires some amount of administrative capabilities, which include but are not limited by: querying server metadata, monitoring its health and condition, logging and rotating logs, updating certain configuration options without restarting. These are important features but quite tolerant performance-wise. For example, having debug logs synced to disk every second instead of every 100 millisecond is unlikely to be noticeable by debuggers or operators; having metrics exported with a 200 millisecond delay instead of 20 makes little difference in monitoring quality.
Because the velocity of these functionalities are much lower and their latency requirements lenient, they usually can be processed using time-shared resources. The only exception would be background tasks that generate sufficient load but are otherwise latency insensitive, such as snapshot/backup.
Avoid Things That Could Be Slow
To keep slow things out of fast paths, we need to identify what those things are. Some are more obvious– for example, one quickly learns not to use blocking sockets in a single-threaded server. Other operations are better at hiding their performance woes.
Many operations are usually fast but not always so, these operations are much harder to avoid upfront because tests or even benchmarks may give the illusion that everything is running as smoothly as needed. Even when running in production, the server may behave mostly fine, but showing occasional slow-downs that can be quite “mysterious” and hard to reproduce. Such performance issues tend to become bigger and more constant headaches as one scales up their operations, but remains hard to debug.
We certainly went through some of these headaches over the years, as Twitter cache went from running a few dozen to tens of thousands of instances. Each time significant time and resources were put into tracking down the root cause. Each time the debugging process was both fascinating and demanding. Unfortunately, not every problem had a simple fix within the existing architecture, and that was an important motivation for us to redo the design through Pelikan.
Because many such operations are deep in the heart of service implementation, I believe it is worthwhile to call out these subtleties. Hopefully, future developers will be aware of them upfront and avoid the same pitfalls.
Write to file
File I/O these days are almost always buffered, unless one explicitly calls
fsync. As a result, a call to
write (and its siblings) almost always returns
immediately, since all that the kernel does is moving some bytes in memory.
Furthermore, if the file descriptor corresponds to a non-blocking I/O object,
such as a socket, the call always returns immediately, deepening the
write is fast.
However, the latency of
write is not guaranteed. If the buffer is full and the
file is backed by disk,
write can implicitly trigger sync to flush data from
buffer(s) to disk while users have no direct control over this mechanism. At
Twitter, we only realized this was a problem for Twemcache after observing
sporadic spikes in tail latencies on some of our busiest cache servers. Since
disk activities are not logged by default, we had to sift through a much wider
range of application logs to find correlation. Eventually a pattern emerged,
where we notice latency spikes were observed only when certain I/O-intensive
application were activated through cron. Still, presumably Twemcache stored
everything in memory and swap was disabled on these hosts, so when would we ever
go to disk? The culprit turned out to be a small change in the rather innocent
looking log utility, a standard part of most production services. Shortly before
the symptoms appeared, we had increased our log level slightly to study
connection activities. After the incident, we had to turn the log level back
down to avoid performance hiccups in cache and further upstream.
Most developers are aware of the effect of lock contention on performance, but
not necessarily the extent of it. Again, this is because contention is low
most of the time, where performance is largely predictable. However, tail
latencies tend to spin out of control when a server using locking is
especially when the locking mechanism also leads to heavy context switching,
such as those using
When we looked at the impact of locking to performance in Twemcache2, it was
evident that it hinders scalability and tail latencies dramatically.
One thing that is somewhat unique to cache server is how little work the server needs to do to fulfill a request. And most of the heavy-lifting is done by syscalls – when we profiled Twemcache2, we noticed almost 80% of CPU time went to syscalls and is spent in kernel space.
Without pipelining, a simple read request over a socket involves the following steps:
- an event syscall to notify data arrival on the socket (cost of this call is amortized over all the events returned at the same time);
- a syscall to read from the socket;
- processing request in user space;
- a syscall to write response to the socket.
That amounts to 2 to 3 syscalls plus application logic processing the request, which often is a simple hash lookup followed by by probing a small memory region. If requests are pipelined, the cost of read / write can be further amortized over all multiple requests.
In comparison, connection establishment is more syscall-intensive:
- it also starts with an event syscall that returns activity on the listening socket (cost of this call is amortized over concurrent connection requests);
- a syscall to accept the connection;
- one or more syscalls to set socket as nonblocking unless a more efficient API
accept4is available, other attributes such as
keepalivestill need to be set separately;
- another syscall to add the socket to the right event loop, and if the right event loop runs on a different thread, inter-thread communication is required.
That amounts to at least 2 syscalls per connection, but often quite a few more. It cannot resort to pipelining to use syscalls more economically.
The effect of relatively expensive connections establishment is that when hosting a large number of clients, the clients can easily DDoS the server by synchronizing their connecting attempts (the TCP handshake also puts a lot of pressure inside the kernel stack, which we will not go into here). The situation is greatly exacerbated if the same thread is responsible for both connection establishment and request handling.
A Performance-Oriented Architecture
One thread per processing pipeline
The first and most important decision in such an architecture is to assign
functionalities to either data plane or control plane– request-response and
connection establishment belong to data plane and need to be fast, while the
rest should go to control plane. Furthermore, we give each major processing
pipeline its own thread. For a simple in-memory cache implementation like
pelikan_slimcache, we use three threads:
- Worker thread: worker thread handles all latency-sensitive data requests,
set, but is not responsible for those related to administrative tasks, such as
stats. Worker thread is also off the hook from accepting connections, but still needs to register connections for event notifications.
- Server thread: server thread listens on the advertised data port and accepts (or rejects, when necessary) incoming connection requests. It should be mostly idling when connections are stable and reused, but can handle big spikes of new connection requests.
- Admin thread: admin thread does all the housekeeping: it listens on a separate control plane port (“admin port”) to avoid mixing data plane traffic with control plane, accepts connections, answers requests regarding service status, and periodically aggregates metrics, flushes logs, etc.
Performance-sensitive threads should not block
Knowing what operations could be slow, the next thing is to make sure we do not invoke them inside processing pipelines that are performance-sensitive:
- No explicit or implicit use of syscalls that may block: worker and server
thread should not use logging implementations that implicitly calls
write. Instead, an in-memory buffer is used for writes and admin thread is responsible for reading from the buffer and writing its content to persistent storage. Memory allocation should also be used judiciously, as
malloccould have unbounded worst-case latencies.
- Minimal communication between threads with lightweight synchronization:
core data structures should have a clear owner thread for each operation
type (read, write), and avoid synchronization as much as possible. When
communication is necessary, such as connection handover between server/worker
threads, prefer asynchronous data structures and mechanisms such as pipes and
futex-based primitives should be used sparingly, in favor of lighter weight alternatives, such as atomic instructions. For example, metric operations can be entirely carried out with atomic instructions, so worker thread can update them and admin thread can read them without locking.
We are going to talk about Pelikan’s memory management strategy, another core design decision, in the next post.