2024
Last updated
Was this helpful?
Last updated
Was this helpful?
31:
20:
Open Addressing, aka., Closed Hashing
Closed Addressing, aka., Open Hashing
19:
18:
17:
The first process in the process group becomes the process group leader and the first process in the session becomes the session leader. Every session can have one TTY associated with it. Only a session leader can take control of a TTY. For a process to be truly daemonized (ran in the background) we should ensure that the session leader is killed so that there is no possibility of the session ever taking control of the TTY.
6: Some Python knowledge reviewed
3:
There's no difference between shm and tmpfs (actually, tmpfs is only the new name of former shmfs). hugetlbfs is a tmpfs-based filesystem that allocates its space from kernel huge pages and needs some additional configuration afford.
Block devices use SCSI (Small Computer System Interface) to talk to processes on one side and hardware on the other side. Hence, block devices have device names (since everything on Linux is a file, so device files) starting with sd*, where s = SCSI and d = device. During boot time, the kernel detects block devices one by one and name them in the form sd[a-z]. So in my case, my vmware disk got detected first and hence, it was named sda. Next, after boot, I attached my USB stick and the kernel named it sdb. As I mentioned earlier, all disks have one or more partitions. In the above output, sdb1, sda1, etc. are names of partitions.
Remember that file systems are the managers of every partition? When we created a partition on mylv1, it's size was 1 GB. Just increasing the volume size doesn't mean that the filesystem would also be synchronized.
1:
functools.wraps, an easy-to-use interface for functools.update_wrapper, is a decorator that automatically transfers the key metadata from a callable (generally a function or class) to its wrapper. Typically, this wrapper is another function, but it can be any callable object such as a class.
It is equivalent but not identical. Simply said, if a lambda expression does not capture values, it will be a singleton that is re-used on every invocation.
Decorator Functions with Decorator Arguments
I hope I know it earlier... Note there is a
bash.exe
and use it instead ofgit-bash.ext
. Use.bashrc
instead of.profile
or.bash_profile
to setup env var because by default thegit-bash
won't login likebash -l
automatically.
The token bucket limits the average inflow rate and allows sudden increase in traffic. The leaky bucket limits the constant outflow rate, which is set to a fixed value.
Some other implementations,
ring buffer based solution
queue based solution
Implementing a Java Timer uses native Java API.
Comparing it with a Hashed Timer Wheel Solution.
from Doug Lea, the famous Java Concurrency Module author.
"Container types, including collections, maps, streams, arrays, and optionals should not be wrapped in optionals."
"The ProcessHandle class does have the arguments method, which returns Optional<String[]>, but this method should be regarded as an anomaly that is not to be emulated."
Using String as a key in a Java HashMap is generally preferred over CharSequence. This is because String is immutable, ensuring that its hash code remains constant throughout its lifetime, which is crucial for maintaining the integrity of the HashMap. In contrast, CharSequence is an interface that includes both mutable (like StringBuilder) and immutable implementations, leading to potential inconsistencies in equality and hash code behavior across different implementations. Thus, to avoid unexpected behavior, it’s advisable to use String as the key.
The hashCode() implementation for StringBuilder in Java is not explicitly defined in the same way it is for String. Instead, StringBuilder inherits the hashCode() method from the Object class, which returns a hash code based on the object's memory address. This means that each instance of StringBuilder will have a unique hash code that does not take into account the contents of the builder.
String length method for Java and other languages.
due to the nature of threaded execution and system timing, this test could potentially fail if the system is under heavy load or experiencing other issues that cause significant delays.
Alternatives to Hashed Wheel Timers:
Heap-based timers
List-based timers
The default implementation has better performance.
Lesson: treat the switch statement as if they were the data.
Note that Java does not provide unsigned byte. If we need to represent a number as unsigned byte (1 byte -> 4 bytes), we must cast byte to int and mask (&) the new int with a &0xff (get the last 8 bits). It gives the last 8-bits or prevents sign extension.
Java 8 provides the built-in method toUnsignedInt() that is defined in the Byte class. It supports unsigned operations. The method converts a signed byte into an unsigned integer.
Many external systems (e.g., databases, network protocols) utilize unsigned types. The lack of native support for unsigned bytes in Java complicates integration with these systems, requiring additional conversion logic or the use of larger data types (like int or long) to represent values that should fit in an unsigned byte24. This can lead to performance overhead and potential bugs if developers do not handle these conversions carefully.
Frequent conversions between signed and unsigned representations can impact performance, especially in applications that require high throughput or low latency, such as video processing or real-time data analysis.
the lack of an unsigned byte type in Java complicates data handling and interoperability while increasing the risk of errors in code. Developers must implement additional logic to work around these limitations, which can lead to more complex and error-prone applications.
git bisect run ./test_for_bug.sh
If another process attempts to load the same file (while it is still resident in the cache) the kernel detects this and doesn't need to reload the file. If the page cache gets full, pages will get evicted - dirty ones being written back out to the disk.
By contrast, with IPC implemented using shared memory, there are no read and write syscalls, and no extra copy step. Each "channel" can simply use a separate area of the mapped buffer. A thread in one process writes data into the shared memory and it is almost immediately visible to the second process.
if shared memory IPC can be implemented without memory mapped files?
A practical way would be to create a memory-mapped file for a file that lives in a memory-only file system; e.g. a "tmpfs" in Linux.
You could in theory implement a shared segment between two processes
Note that both Aeron IPC and CQ support tmpfs to further improve the performance
When setting up Aeron for IPC, the media driver can be configured to operate with a term buffer located on a tmpfs mount point. This setup minimizes disk I/O latency since all operations occur in memory. The configuration involves specifying the directory for the Aeron media driver to point to a tmpfs mount, ensuring that all IPC messages are handled in-memory
For even lower latencies, Chronicle Queue can be backed by tmpfs, a temporary filesystem that resides in RAM. This configuration significantly reduces delays caused by disk operations, provided that the queue size is managed appropriately.
Transportation Media: multicast, IPC, InfiniBand, RDMA, PCI-e 3.0
OSI Layer 4 (Transport) Services
Connection Oriented Communication
Reliability
Flow Control: counters are the key to flow control and monitoring; pluggable in Aeron
Congestion Avoidance/Control: TCP is not suitable for HFT partially because of it; pluggable in Aeron
Multiplexing: HOL Blocking
Design Principles
from
clear segregation of control
garbage free in steady state running
lock-free, wait-free and copy-free in data structure in the messaging path
respect the Single Writer Principle
major data structures are not shared
don't burden the main paths with exceptional cases
non-blocking in the message path
...
into 3 basic things
system architecture
data structure
protocol of interactions
Data Structure
Maps: dealing with primitives
IPC Ring/Broadcast Buffer: between Conductors
ITC Queues: between Sender/Receiver and Conductors
Dynamic Arrays
Log Buffer: IPC for messaging, creates a replicated persistent log of messages
mmap
tail is being moved atomically
No big file: page fault; page cache churn; VM pressure; clean/dirty/active
receiver side: High Water Mark + Completed; point chasing is really bad (In the context of messaging systems, point chasing refers to a method where a sender strategically prioritizes and sends messages to maximize engagement or response rates.)
Monitoring and Debugging should be designed on day 1
Loss, throughput and buffer size are strongly related
Java
Bad:
No Unsigned Type
NIO - Locks, off-heap, PAUSE, Signals, etc
String Encoding - 3 buffer copy
External Resources
Selectors - GC
converting bytes into int
Good:
Tooling: IDEs, Gradle, HdrHistogram
Bytecode Instrumentation: good to debugging
Unsafe
The Optimizer
Garbage Collectors
Kernel Module and FPGAs possible
The video features Martin Thompson discussing the evolution of financial exchanges, focusing on advancements in design, resilience, performance and deployment over the past decade.
Design
State Machine -> Replicated State Machine: ordered input + deterministic execution
Distributed Event Log: event sourcing
Rich Domain Model (DDD) and specific data structure designed from scratch
Time & Timers: atomic clock + gps synchronizer; how a timer cancels an order
Resilience
Fairness: multiple gateways -> 1
Gateway: classification of customers
Matching Engine: sharding by symbol/fungible...
Primary Secondary vs Consensus: Raft
Code Quality and Model Fidelity: Model fidelity refers to the degree to which a model accurately represents the real-world system or phenomenon it is intended to simulate or predict. High fidelity means that the model closely matches the actual behavior or characteristics of the system, capturing important details and dynamics. Low fidelity indicates a more simplified or abstract representation that may overlook critical factors.
Performance: Transaction throughput has increased significantly, with some exchanges reaching millions of transactions per second and achieving latencies below 100 microseconds.
Latency: average latency is misleading, we need percentile
Throughput: burst scenario
JVM:
CMS full GC
G1
Azul C4: Continuously Concurrent Compacting Collector, high allocation rate without nasty gc pauses with Amdahl's law
ZGC: not generational (but we can turn it on now?)
Shanadoah: better at smaller heaps
Memory Access Patterns: Java is still catching up with that, c can get the close to the machine about the memory layout so that's why the fastest matching engine is written is c
Data Structure: check all kinds of libs or even implement your own one; prevent cache misses;
Binary Codecs: SBE; the FIX protocol is encoded in ASCII
Preventing Costs: system calls; disk calls; page fault is going to interrupt the kernel - A page fault is an exception raised by the memory management unit (MMU) when a program attempts to access a memory page that is not currently mapped to its virtual address space. This situation typically arises when the required page is not loaded into physical memory (RAM), which make the mmap file horrendously more expensive all of a sudden. Setup huge pages to fix it; context switching
Hardware
Disks: from milliseconds to tens/hundreds of microseconds
Network: financial organization is good at that
CPU: not too much improvement, throughput is abundant, but the latency is not getting better
IO: socket is not good; new API for IO and please use asynchronous API; DPDK
Language: polyglot
Deployment
CI/CD
Flexible Scaling: dev env in your local machine; using IPC if the machine has 100 cores
The Epochalypse problem.
QUIC’s faster connection set-up with 0-RTT is really more of a micro-optimization than a revolutionary new feature. Compared to a state-of-the art TCP + TLS 1.3 set-up, it would save a maximum of one round trip. The amount of data that can actually be sent in the first round trip is additionally limited by a number of security considerations.
Levered ETFs are trading tools that are not suitable for investing. They do a good job of matching the levered return of an underlying index intraday. The sum of all the negative gamma trading is expensive as the mechanical re-balancing gets front-run and “arbed” by traders. This creates significant drag on the levered ETF’s assets. In fact, if the borrowing costs to short levered ETFs were not punitive, a popular strategy would be to short both the long and short versions of the same ETF, allowing the neutral arbitrageur to harvest both the expense ratios and negative gamma costs from tracking the index!
If there is an arbitrage possibility, then there is no EMM.
If there are no arbitrage possibilities, then there is at least one EMM.
If every payoff is replicable, then there is exactly one EMM.
窗口大小 = min(cwnd, rwnd)
In Java, the length method of String objects is not the length of that String in characters. Instead, it only gives the number of 16-bit code units used to encode a string. This is not (always) the number of Unicode characters (code points) in the string.
TCP Fast Open (TFO) 是在传统的三次握手基础上进行优化,允许在握手过程中发送数据,从而减少首次发送数据的延迟,提升网络应用性能。
TODO
This repo using RAFT to ensure the Availability, and based on CAP Theory,
Consistency: no, only eventual consistency
Availability: yes
Partition Tolerance: yes
TODO
I wish I can understand more when I watch it again later.
back to the basics.
The goal of VarHandle is to define a standard for invoking equivalents of java.util.concurrent.atomic and sun.misc.Unsafe operations on fields and array elements.
In the worst case we evaluated, non-generational ZGC caused 36% more CPU utilization than G1 for the same workload. That became a nearly 10% improvement with generational ZGC.
When we don’t declare a hashCode() method for a class, Java will use the identity hash code for it.
JVM raises a flag in the monitor object that some thread acquires the lock, so reacquiring and releasing the lock by the same thread is lightweight. But the lock must be revoked when another thread tries to acquire the bias lock. And the revocation is a costly operation.
Use the keyword final when you want the compiler to prevent a variable from being re-assigned to a different object.
The default type of reference in Java is a strong reference.
If JVM sees an object that has only a weak reference during GC, it will be collected.
A soft reference is garbage-collected only if there is not enough memory left in the heap.
Unlike weak and soft references whereby we can control how objects garbage-collected, a phantom reference is used for pre-mortem clean-up actions before the GC removes the object.
Simple but useful.
Put simply, the Contended annotation adds some paddings around each annotated field to isolate each field on its own cache line. Consequently, this will impact the memory layout. Please note that the Contended annotation is JDK internal, therefore we should avoid using it. Also, we should run our code with the -XX:-RestrictContended tuning flag; otherwise, the annotation wouldn’t take effect.
the JVM adds padding to the objects so that their size is a multiple of 8 bytes. With these paddings, the last three bits in oops are always zero. This is because numbers that are a multiple of 8 always end in 000 in binary.
Since ZGC needs to use 64-bit colored pointers, it does not support compressed references. So, using an ultra-low latency GC like ZGC has to be weighed against using more memory.
On modern x86 CPUs, the typical cache line size is 64 bytes. However, some CPUs may have larger cache lines, such as 128 bytes.
To ensure compatibility with various CPU architectures and avoid false sharing on CPUs with larger cache lines, the
ConcurrentCircularArrayQueueL0Pad
class is padded with 128 bytes.
We can make “message processors” are faster
We can make deserialize and serialize process faster
We can make the network transfer faster
Usually when optimising algorithms, you wanna find redundant or unnecessary work. Then find a way to consolidate that redundancy.
The sweep-and-prune algorithm is also known as sort-and-sweep.
Pairs flagged in all dimensions would be considered intersecting.
Insertion sort has a running time of O(n) at best when the list is already sorted or nearly-sorted, and O(n2) at worst when the list is in reverse.
FAST: Primarily designed to reduce the amount of data sent over the network and improve the speed of message delivery, particularly for market data. It utilizes compression techniques to minimize the size of messages, which can include both binary and character-based data.
FAST: Focuses on the transmission layer and may be used in conjunction with other encoding methods, including SBE.
Mark you move constructors and move assignment operators with noexcept
Further optimizations and stronger exception safety with copy-and-swap idiom
Perfect forwarding
A possible mitigation strategy to this is, if for example 95% of the callers of the function will include 3 or less parameters you can create three functions that take 1, 2, and 3 parameters of the argument type to handle the 95% and a fourth that takes three parameters and a vararg parameter to handle the rest of the 5%.
Utility for preventing primitive parameter values from being auto-boxed. Auto-boxing creates temporary objects which contribute to pressure on the garbage collector. With this utility users can convert primitive values directly into text without allocating temporary objects.
private final ThreadLocal<int[]> current = new ThreadLocal<>();
: By using ThreadLocal<int[]>, you can store multiple integer values in a single thread-local variable without the overhead of boxing.
TODO
You will always get the same bean from the context if calling a @Bean annotated method in Spring java configuration. Same for @Component.
TODO
Anonymous subclass of TypeToken
The annotation of the
getGenericSuperClass()
method of theClass
class is: Returns the Type representing the direct superclass of the entity (class, interface, primitive type or void) represented by thisClass. If the superclass is a parameterized type, the Type object returned must accurately reflect the actual type parameters used in the source code. The parameterized type representing the superclass is created if it had not been created before. See the declaration of ParameterizedType for the semantics of the creation process for parameterized types. If thisClass represents either theObject class, an interface, a primitive type, or void, then null is returned. If this object represents an array class then theClass object representing theObject class is returned.
VISUAL BLOCK: I - insert before; A - append after; c - replace
macro
substitute
Exchanger vs SynchronousQueue vs LinkedTransferQueue
public interface RunnableFuture<V> extends Runnable, Future<V>
public class FutureTask<V> implements RunnableFuture<V>
There are four types of references in Java: Strong, Weak, Soft, and Phantom.
Unlike weak and soft references whereby we can control how objects garbage-collected, a phantom reference is used for pre-mortem clean-up actions before the GC removes the object.
It explains the origin of complex number and the geometrical meaning of complex domain.
How big can a certificate be? For curl to work is 100kB and for a web browser like Firefox is ~60kB though Firefox TLS library works different so results may vary.
How long can it last? From Jan 1, 1950 to Dec 31, 9999. About 8050 years.
Blocking I/O model (Blocking / Synchronous): recvfrom
Signal-driven I/O (Blocking / Synchronous): SIGIO
Non-Blocking I/O (Non-Blocking / Synchronous): recvfrom (O_NONBLOCK)
I/O Multiplexing (Blocking / Asynchronous): select, poll, epoll
Asynchronous I/O (Non-Blocking / Asynchronous): AIO
Power Set of X is the set of all possible subsets of X.
Definition of sigma algebra
what is a trade and how can it be modelled
what you need before booking a trade
what happens when a trade is booked
“Bootstrapping is a statistical procedure that resamples a single data set to create many simulated samples.”
30 lines.
CRDT stands for “Conflict-free Replicated Data Type”.
It’s a kind of data structure that can be stored on different computers (peers). Each peer can update its own state instantly, without a network request to check with other peers.
Peers may have different states at different points in time, but are guaranteed to eventually converge on a single agreed-upon state. That makes CRDTs great for building rich collaborative apps, like Google Docs and Figma — without requiring a central server to sync changes.
Acquire semantics -> LoadLoad + LoadStore
Release semantics -> LoadStore + StoreStore
Types of Memory Barrier/Reordering: LoadLoad, StoreStore, LoadStore, StoreLoad
Mintomic (short for “minimal atomic”) is an API for low-level lock-free programming in C and C++.
Sequential consistency means that all threads agree on the order in which memory operations occurred, and that order is consistent with the order of operations in the program source code.
Kiss Linux™ is a meta-distribution for the x86_64 architecture with a focus on simplicity, sustainability and user freedom.
The truth is that pairs of memory barriers provide conditional ordering guarantees.
This is a shell script template generator (i.e. a script that writes scripts).
Maybe the
thread.yield()
should be used instead of theLockSupport.parkNanos(1);
Java supports to use “memory-mapped file” we can see this feature has been implemented in some popular frameworks and are successful by using this capability of Java
Chronicle queue
QuestDB
KDB
kafka -> sendfile
TODO
lock-free using volatile / AtomicLong
padding to prevent false sharing
batching for disk/network writing (e.g., the size of a block is 4k)
enable the ability to zero garbage route using byte array or using the immutable object Reference,
Latency numbers every programmer should know,
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
...
27:
26:
check the
25:
20:
19:
13:
worth a reading.
7: /
The algorithm behind the check is based on Determine if a word has a zero byte:
6:
5:
4:
3:
2:
1:
31:
30:
When we need to represent signed numbers in Java, we find .In 2's complements the left most bit represent the sign (+ ive or - ive). The bit 0 denotes positive and 1 denotes negative. The rest of the bits denotes the value from -128 to 127. Therefore, it is called 8-bit byte only have 7 bits to store the values. other extra values range from 128 to 255 are unable to fit into a single byte. So, we can cast it to 32-bit unsigned integer for more spaces (bits).
29:
28:
27:
26:
25:
The video discusses Aeron, a messaging system focused on high performance and reliability, particularly in scenarios where traditional protocols like TCP and UDP may fall short. The speaker, Martin Thompson, emphasizes the need for consistent latency and the challenges of reliable message delivery over UDP.
Lambda &
24:
23:
Dagger, a directed, acyclic graph of financial instruments. Also refer to for this concept.
22:
21:
20:
19:
18:
and his [Yet another full-node guide](https://becomesovran.com/blog/yet-another-full-node-guide.html) is quite good too. And here is another blog about the mentioned .
15:
13:
12:
11:
7:
2:
30:
29:
28:
27:
26:
25:
24:
23:
13:
12:
11:
10:
9:
7:
: Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff.
: Loop unswitching is a compiler optimization. It moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional.[1] This can improve the parallelization of the loop. Since modern processors can operate quickly on vectors, this improvement increases the speed of the program.
6: -> , which means Apple chips are not supported yet.
29:
28:
27:
26:
24:
23:
22:
to inspect the memory layout of objects in the JVM.
21:
20:
19:
18:
17:
16:
15:
: The body length is the character count starting at tag 35 (included) all the way to tag 10 (excluded), including trailing SOH delimiters.
Simple Binary Encoding (SBE) is a codec format specified by the Financial Information eXchange (FIX) Trading Community for use in high-performance trading systems. ()
14:
13:
more,
12:
11:
10:
9:
8:
7:
6:
5:
TODO
4:
3:
2:
1:
31:
30:
29:
28:
27:
26:
24:
25:
: Mark-Sweep, Copying, Mark-Compact, Generation-Collection
24:
: TreeMap (Red-Black Tree) vs Priority Queue (Heap)
18:
15:
12:
11:
08:
04:
02: and
28:
27:
26:
24:
23:
21:
20:
"For antirez, programming was a way to express himself, a form of art. Every character and line break had to be meticulously crafted, akin to the art form of writing. Software development was like writing a book — it had to be beautiful, elegant, and easy to comprehend. If that software happened to be useful to others, that was just a side effect."
19:
18:
17:
16:
15:
14:
13:
12:
11:
10:
9:
8:
7:
6:
5:
4:
3: