A Revisit of Disruptor

LMAX Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads

Costs
1. The Cost of Locks
2. The Costs of CAS
3. Memory Barriers and Cache Coherency
  1. Store Buffer
  2. Invalidate Queue
  3. check Memory Consistency Models: A Tutorial
4. Cache Lines
Locks are incredibly expensive because
1. they require arbitration when contended. This arbitration is achieved by a context switch to the operating system kernel which will suspend threads waiting on a lock until it is released.
2. During such a context switch, as well as releasing control to the operating system which may decide to do other house-keeping tasks while it has control, execution context can lose previously cached data and instructions. This can have a serious performance impact on modern processors.
3. Fast user mode locks can be employed but these are only of any real benefit when not contended.
A more efficient alternative to the use of locks can be employed for updating memory when the target of the update is a single word.
1. A CAS operation is a special machine-code instruction that allows a word in memory to be conditionally set as an atomic operation.
2. If the critical section of the program is more complex than a simple increment of a counter it may take a complex state machine using multiple CAS operations to orchestrate the contention.
3. The ideal algorithm would be one with only a single thread owning all writes to a single resource with other threads reading the results. To read the results in a multi-processor environment requires memory barriers to make the changes visible to threads running on other processors.
Modern processors perform
1. out-of-order execution of instructions and out-of-order loads
2. and stores of data between memory and execution units for performance reasons.
The processors need only guarantee that program logic produces the same results regardless of execution order.
A read memory barrier orders load instructions on the CPU that executes it by marking a point in the invalidate queue for changes coming into its cache. This gives it a consistent view of the world for write operations ordered before the read barrier.
A write barrier orders store instructions on the CPU that executes it by marking a point in the store buffer, thus flushing writes out via its cache. This barrier gives an ordered view to the world of what store operations happen before the write barrier.
This only works if the processors can detect a pattern of access such as walking memory with a predictable “stride”. When iterating over the contents of an array the stride is predictable and so memory will be pre-fetched in cache lines, maximizing the efficiency of the access. Strides typically have to be less than 2048 bytes in either direction to be noticed by the processor.
Queue implementations tend to have write contention on the head, tail, and size variables. When in use, queues are typically always close to full or close to empty due to the differences in pace between consumers and producers.
It would be ideal if the graph of dependencies could be expressed without incurring the cost of putting the queues between stages.
The limitations of the Java language mean that entries are associated with the ring-buffer as pointers to objects.
Garbage collectors work at their best when objects are either very short-lived or effectively immortal.
Under heavy load queue-based systems can back up, which can lead to a reduction in the rate of processing, and results in the allocated objects surviving longer than they should, thus being promoted beyond the young generation with generational garbage collectors. This has two implications: first, the objects have to be copied between generations which cause latency jitter; second, these objects have to be collected from the old generation which is typically a much more expensive operation and increases the likelihood of “stop the world” pauses that result when the fragmented memory space requires compaction.
On most processors there is a very high cost for the remainder calculation on the sequence number, which determines the slot in the ring. This cost can be greatly reduced by making the ring size a power of 2. A bit mask of size minus one can be used to perform the remainder operation efficiently.
In more unusual usages where there are multiple producers, producers will race one another to claim the next entry in the ring-buffer. Contention on claiming the next available entry can be managed with a simple CAS operation on the sequence number for that slot.
Memory barriers are used by processors to indicate sections of code where the ordering of memory updates is important. They are the means by which hardware ordering and visibility of change is achieved between threads. Compilers can put in place complimentary software barriers to ensure the ordering of compiled code, such software memory barriers are in addition to the hardware barriers used by the processors themselves.
Once a producer has copied the relevant data to the claimed entry it can make it public to consumers by committing the sequence. This can be done without CAS by a simple busy spin until the other producers have reached this sequence in their own commit. Then this producer can advance the cursor signifying the next available entry for consumption. Producers can avoid wrapping the ring by tracking the sequence of consumers as a simple read operation before they write to the ring buffer.
1. producer 2 at seq 13 check if the cursor is at 12; as producer 1 at seq 12 has not committed, the cursor is still at 11
2. producer 1 commit it and advance the cursor from 11 to 12
3. producer 2 looping condition find that the cursor is at 12 now, commit it, advance the cursor to 13 and notify the consumers
Lock free multi-producer – multi-consumer queues do exist but they require multiple CAS operations on the head, tail, size counters. The Disruptor does not suffer this CAS contention.
1. In MPMC lock-free queue, the ring buffer is also used.

Affinity

taskset -p <PID>: display the CPU affinity of a specific process
/sys/devices/system/cpu/isolated: eg., 2-3,5
To prepare some CPUs as isolated CPUs in Linux, you need to configure the kernel to isolate specific CPUs. This can be done by adding the isolcpus parameter to the kernel boot options
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list contains a list of CPUs that share the same physical core (i.e., sibling threads in a Simultaneous Multithreading (SMT) setup). This information is crucial for understanding how to optimize thread placement and performance in multithreaded applications, especially when considering CPU affinity and load balancing.
Processors are often data-starved, waiting on I/O requests from slower storage. When this happens, the OS context switches, but that costs a few thousand CPU cycles. Hyperthreading solve the context switch problem by having a second set of these super-fast registers already loaded and ready to go.
vCPU, thread, core, node, socket. What do CPU terms mean these days?

PreviousDamon's Blog NextMorden C++ Best Practices and Low-Level Operations

Last updated 2 hours ago

hashtagLMAX Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threadsarrow-up-right

hashtagAffinity

LMAX Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads

Affinity