Damon's Blog
Last updated
Was this helpful?
Last updated
Was this helpful?
26:
JEP 318 explains that “[Epsilon] … handles memory allocation but does not implement any actual memory reclamation mechanism. Once the available Java heap is exhausted, the JVM will shut down.”
25:
Every input into the system is assigned a globally unique monotonic sequence number and timestamp by a central component known as a sequencer. This sequenced stream of events is disseminated to all nodes/applications in the system, which only operate on these sequenced inputs, and never on any other external inputs that have not been sequenced. Any outputs from the applications must also first be sequenced before they can be consumed by other applications or the external world. Since all nodes in the distributed system are presented with the exact same sequence of events, it is relatively straightforward for them to arrive at the same logical state after each event, without incurring any overhead or issues related to inter-node communication.
Binary Encoding
Hardware Efficiency / Kernel Bypassing
20:
TODO
19:
The following 4-step approach proved to be most efficient to detect memory issues:
Get an overview of the heap dump. See: Overview
Find big memory chunks (single objects or groups of objects).
Inspect the content of this memory chunk.
If the content of the memory chunk is too big check who keeps this memory chunk alive This sequence of actions is automated in Memory Analyzer by the Leak Suspects Report.
18:
First make it possible. Then make it beautiful. Then make it fast.
17:
The best way to avoid GC is to not create garbage in the first place. This topic could fill a book, but the primary ways to do that are: (a) Do not create new objects in the critical path of processing. Create all the objects you’ll need upfront and cache them in object pools. (b) Do not use Java strings. Java strings are immutable objects that are a common source of garbage. We use pooled custom strings that are based on java.lang.StringBuilder (c) Do not use standard Java collections. More on this below (d) Careful about boxing/unboxing of primitive types, which can happen when using standard collections or during logging. (e) Consider using off-heap memory buffers where appropriate (we use some of the utilities available in chronicle-core).
Avoid standard Java collections. Most standard Java collections use a companion Entry or Node object, that is created and destroyed as items are added/removed. Also, every iteration through these collections creates a new Iterator object, which contributes to garbage. Lastly, when used with primitive data types (e.g. a map of long → Object), garbage will be produced with almost every operation due to boxing/unboxing. When possible, we use collections from agrona and fastutil (and rarely, guava).
Write deterministic code. We’ve alluded to determinism above, but it deserves elaboration, as this is key to making the system work. By deterministic code, we mean that the code should produce the exact same output each time it is presented with a given sequenced stream, down to even the timestamps. This is easier said than done, because it means that the code may not use constructs such as external threads, or timers, or even the local system clock. The very passage of time must be derived from timestamps seen on the sequenced stream. And it gets weirder from there — like, did you know that the iteration order of some collections (e.g. java.util.HashMap) is non-deterministic because it relies on the hashCode of the entry keys?!
but our changes enable us to integrate QuickFIX/J with the sequenced stream architecture in such a way that we no longer rely on disk logs for recovery (which is how most FIX sessions recover).
Our FIX spec is available in either the PDF format or the ATDL format (Algorithmic Trading Definition Language).
13:
Escape Analysis works, at least for some trivial cases. It is not as powerful as we'd like it, and code that is not hot enough will not enjoy it, but for hot code it will happen. I'd be happier if the flags for tracking when it happens were not debug only.
12:
This socket option tells the kernel that even if this port is busy (in the TIME_WAIT state), go ahead and reuse it anyway. If it is busy, but with another state, you will still get an address already in use error. It is useful if your server has been shut down, and then restarted right away while sockets are still active on its port.
11:
If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally. This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues. If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?”: the head and the tail compete with each other quite often since the queue normally is either full or empty, and when it's empty, they are normally pointing to the same cacheline. Why queue is not a good data structure for low latency?
Contention & Locking Overhead: locks / cache coherence traffic
Memory Allocation & Garbage Collection (GC): LMAX avoids this by using pre-allocated, garbage-free data structures.
Pointer Chasing & Cache Misses: LMAX uses a pre-allocated ring buffer (Disruptor) that is cache-friendly (sequential memory access).
Batching & False Sharing: Queues often process items one at a time, missing opportunities for batching (which improves throughput).
10:
Efficient pattern for single writer and single reader case. To ensure thread-safety, ReadWriteLock / Semaphore could be used.
9:
Performance Ninja Class is a FREE self-paced online course for developers who want to master software performance tuning. -> this is the author's amazing blog.
8:
Linux systems allow easily switching between programs of similar functionality or goal. So we can set a given version of a utility program or development tool for all users. Moreover, the change applies not only to the program itself but to its configuration or documentation as well.
7:
All writes that occur before a volatile store are visible by any other threads with the predicate that the other threads load this new store. However write that occur before a volatile load my or may not be seen by other threads if they do not load the new value.
In Java, the semantics of
volative
are defined to ensure visibility and ordering of variables across threads.
A
volatile
write in Java means that a StoreStore barrier and a LoadStore barrier are inserted. This ensures that
All previous writes (stores) are visible before the
volatile
write.The
volatile
write is visible before any subsequent writes (stores).A
volatile
read in Java means that a LoadLoad barrier and a LoadStore barrier are inserted. This ensures that
The
volatile
read is visible before any subsequent reads (loads).The
volativle
read is visible before any subsequent writes (stores).
6: Linux Default Route
4:
InheritableThreadLocal
就能实现这样的功能,这个类能让子线程继承父线程中已经设置的ThreadLocal值。
3:
Why are shutdown hooks run concurrently? Wouldn't it make more sense to run them in reverse order of registration?
Invoking shutdown hooks in their reverse order of registration is certainly intuitive, and is in fact how the C runtime library's atexit procedure works. This technique really only makes sense, however, in a single-threaded system. In a multi-threaded system such as Java platform the order in which hooks are registered is in general undetermined and therefore implies nothing about which hooks ought to be run before which other hooks. Invoking hooks in any particular sequential order also increases the possibility of deadlocks. Note that if a particular subsystem needs to invoke shutdown actions in a particular order then it is free to synchronize them internally.
2:
In computer programming, the exclusive or swap (sometimes shortened to XOR swap) is an algorithm that uses the exclusive or bitwise operation to swap the values of two variables without using the temporary variable which is normally required.
1:
sleep requests of ~1ms and ~1us reduce CPU usage to ~1% and ~10% respectively compared with busy waiting (100%)
Here again, there is no single answer as to how the system will behave. The key is to bias the situation as much as possible to avoid the thread being switched from a core, and the use of thread affinity (to avoid the thread being moved to another core) and CPU isolation (to avoid another process/thread contending with the thread) can be very effective in this case1. Careful use of affinity, isolation, and short sleep periods can result in responsive, low-jitter environments, which use considerably fewer CPU resources compared with busy waiting.
1 Other options include running with real-time priorities, however we want to keep the focus of this document on standard setups as much as possible
Lock Free
Testing and Evolving Requirements
Shared Mutable State
CPU Isolation and Thread Affinity
Event Driven Architecture
Challenges in Low-Latency Trading
Threading and Core Utilisation
Serialisation and Deserialisation
Message Passing and Data Persistence
Addressing Low-Latency Trading Pain Points
Thread Affinity and Event Loop Optimisation
Efficient Message Passing
Minimising Garbage Collection
Performance Tuning for High-Throughput Trading
Real-World Example: A High-Performance Trading Engine in Action
Accepting Market Data
Making Trading Decisions
Chronicle Queue Enterprise for Communication
Keeping Latency Stable
show-busy-java-threads; how to find the thread that uses the most CPU
top命令找出消耗CPU高的Java进程及其线程id
开启线程显示模式(top -H,或是打开top后按H)
按CPU使用率排序(top缺省是按CPU使用降序,已经合要求;打开top后按P可以显式指定按CPU使用降序)
记下Java进程id及其CPU高的线程id
查看消耗CPU高的线程栈:
手动转换线程id成十六进制(可以用printf %x 1234)
在jstack输出中查找十六进制的线程id(可以用vim的查找功能/0x1234,或是grep 0x1234 -A 20)
查看对应的线程栈,分析问题; 查问题时,会要多次上面的操作以分析确定问题
tcp-connection-state-counter
要实现这一点程序必须具备暂停运行以及恢复运行的能力,要想让程序具备暂停运行/恢复运行的能力就必须保存CPU上下文信息。
设计一个新的抽象概念,让各个运行的程序彼此隔离,为每个程序提供独立的内存空间,你决定采用段氏内存管理,每个运行的程序中的各个段都有自己的内存区域 现在你设计了struct context以及struct memory_map,显然它们都属于某一个运行起来的程序,“运行起来的程序”是一个新的概念,你给起了个名字叫做进程,process,现在进程上下文以及内存映射都可以放到进程这个结构体中
每个线程都是进程内的一个独立执行单元,它们:
共享进程的地址空间,这意味着所有线程可以直接访问相同的内存区域
共享打开的文件描述符,避免了重复打开关闭文件的开销
共享其他系统资源,如信号处理函数、进程工作目录等
仅维护独立的执行栈和寄存器状态,确保每个线程可以独立执行
An important thing to note is the limitation of the annotation processing API — it can only be used to generate new files, not to change existing ones. If you use Maven to build this jar and try to put this file directly into the src/main/resources/META-INF/services directory, you’ll encounter the following error:
This is because the compiler tries to use this file during the source-processing stage of the module itself when the BuilderProcessor file is not yet compiled. The file has to be either put inside another resource directory and copied to the META-INF/services directory during the resource copying stage of the Maven build, or (even better) generated during the build. The Google auto-service library, discussed in the following section, allows generating this file using a simple annotation.
This means that accept blocks the calling thread until a new connection is available from the OS, but the reverse is not true. The underlying OS will establish TCP connections for the application even if the program is not currently blocked at accept. In other words, accept asks the OS for the first ready-to-use connection, but the OS does not wait for the application to accept connections in order to establish new ones. It might establish many more.
Hatch is a modern, extensible Python project manager.
Learn how a hardware engineer works.
Enum may be a better option.
Explicit Inheritance vs Implicit Inheritance
LLM and content-type: text/event-stream
Bottom line: toArray(new T[0]) seems faster, safer, and contractually cleaner, and therefore should be the default choice now.
JDK based AOP leverage reflection which brings in performance cost; while CGLIB uses ASM to modify the original class's bytecode and generates its subclass in runtime to intercept the method call.
Learn how to use CMake properly; and note that CMake is a generator for a building system, itself is not a building system.
The key difference between CompletableFuture and Future is chain.
Compiler problems really are everywhere.
These can also be a template to teach you how to write a book
read each commit one by one
Concepts are an extension for templates.
They can be used to perform compile-time validation of template arguments through boolean predicates.
They can also be used to perform function dispatch based on properties of types.
Mental Mode about the understanding of inode, hard and soft links in Linux.
a soft link links a link file to a target file. This is in contrast to a hard link, which links a pathname to an inode.
The content of a soft link is the pathname of the target file it points to.
a hard link exists as a directory entry that links a pathname to an inode, while a soft link exists as a file that links its own pathname to another pathname.
Running Gradle Builds
Authoring Gradle Builds
Optimizing Gradle Builds
Dependency Management
Bottom line: toArray(new T[0]) seems faster, safer, and contractually cleaner, and therefore should be the default choice now. Future VM optimizations may close this performance gap for toArray(new T[size]), rendering the current "believed to be optimal" usages on par with an actually optimal one. Further improvements in toArray APIs would follow the same logic as toArray(new T[0])the collection itself should create the appropriate storage.
在GDB中你可以通过添加watchpoint来观察一段内存,这段内存被修改时程序将会停止,此时我们就能知道到底是哪行代码对该内存进行了修改。
This blog post provides an overview of common cross-architecture portability problems encountered in software development, particularly focusing on the challenges when targeting 32-bit systems. It discusses issues related to integer type sizes, address space limitations, large file support, the Y2K38 problem, byte order (endianness), and char signedness. While many of these issues are often discussed in the context of C programming, the author highlights that some, like address space limitations, can affect programs written in higher-level languages such as Python. The post emphasizes that achieving true cross-architecture portability requires careful consideration of these low-level details and can be challenging, especially when dealing with legacy or proprietary software.
Learn how to think like a master from DeepSeek.
isd (interactive systemd) – a better way to work with systemd units
The best and simplest free open source web page change detection, website watcher, restock monitor and notification service.
"Writing a Compiler in Go"
Still I was surprised by the result, and if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils.
long gamma
23: CPP RVO (Return Value Optimization)
21: Python Chained Expression
when run with
source
, it should beexec /bin/bash -c 'source ...'
useful starting from "So what is new compared to the last config?"
JNI bindings for Zstd native library that provides fast and high compression lossless algorithm for Android, Java and all JVM languages
mm
for minutes,MM
for months, and in most cases useyyyy
for years
31:
Below are some of the key points to consider when choosing to use event Loops:
Resource Utilization
30:
用进程id作为参数,jstack 出有问题的Java进程;
29:
当发生中断时,CPU使用中断号作为索引,查找中断向量表中的对应条目,从而获取中断处理程序的入口地址。
28:
27:
26:
24:
23:
22:
21:
20:
19:
18:
17:
16:
15:
27:
24:
: Note a file can be held open by a process while all hardlinks are subsequently unlinked, leaving the data accessible until the file is closed. The main use for multiply hardlinked files is to create efficient backups.
23:
22:
21:
20:
19:
6:
5:
4:
3:
2:
1:
25:
24:
20:
15:
14:
13:
12:
11:
easier to start
10:
9:
8:
7:
4:
3:
2:
1: