Damon's Blog

The magic you are looking for is in the work you're avoiding.

Gems

2025

December

November

  • 15: History of users modifying a file in Linux

    1. use stat command

    2. Find the Modify time

    3. Use last command to see the log in history

    4. Compare the log-in/log-out times with the file's Modify timestamp

  • 14: Aeron spy subscription

    The NOT_CONNECTED isn't an error; it is an indication that there are no subscribers. If you want to discard data when there are no subscribers, then don't offer the message again, but drop it instead.

  • 13: The Write-Ahead Log: The underrated Reliability Foundation for Databases and Distributed systems

    1. PostgreSQL: WAL for ACID Transactions and Replication

    2. Kafka: Logs as the System

    3. MongoDB: The Oplog for Replication

  • 11: Using Python with C++

  • 10: Booting my Raspberry Pi over my network made a huge performance difference

    One of the most noticeable improvements was speed. Even though the Pi boots over Ethernet, read and write performance is far better than I’ve ever seen with SD cards. System updates apply faster, logs write more consistently, and services start without delay. It feels smoother, especially when working with heavier tasks like database-driven apps or media servers.

  • 9: A Short Survey of Compiler Targets

    • Most modern compilers actually don’t emit machine code or assembly directly. They lower the source code down to a language-agnostic Intermediate representation (IR) first, and then generate machine code for major architectures (x86-64, ARM64, etc.) from it.

    • Sometimes you are okay with letting other compilers/runtimes take care of the heavy lifting. You can transpile your code to a another established high-level language and leverage that language’s existing compiler/runtime and toolchain.

    • Meta-tracing and Metacompilation frameworks are a more complex category. These are not the targets for your compiler backend, instead, you use them to build a custom JIT compiler for your language by specifying an interpreter for it.

  • 7: 向黄仁勋汇报的英伟达36人

    第二名就是第一个失败者。

  • 6: Asyncio Event Loops Tutorial

    • A Conceptual Overview of asyncio

      • The terms “coroutine function” and “coroutine object” are often conflated as coroutine. That can be confusing!

      • Similar to a coroutine function, calling a generator function does not run it. Instead, it creates a generator object

      • In practice, it’s recommended to use (and common to see) asyncio.run(), which takes care of managing the event loop and ensuring the provided coroutine finishes before advancing.

      • It’s important to be aware that the task itself is not added to the event loop, only a callback to the task is. This matters if the task object you created is garbage collected before it’s called by the event loop.

      • When the coroutine exits, local variables go out of scope and may be subject to garbage collection.

      • Unlike tasks, awaiting a coroutine does not hand control back to the event loop! The behavior of await coroutine is effectively the same as invoking a regular, synchronous Python function.

      • Each time a task is awaited, control needs to be passed all the way up the call stack to the event loop. That might sound minor, but in a large program with many await statements and a deep call stack, that overhead can add up to a meaningful performance drag.

      • The only way to yield (or effectively cede control) from a coroutine is to await an object that yields in its await method.

  • 3: TCP Socket Listen: A Tale of Two Queues

    • SYN Queue: tcp_max_syn_backlog

    • Accept Queue: backlog <= somaxconn

    • If SYN+ACK is lost, the server is responsible to retransmit it: net.ipv4.tcp_synack_retries = 5

    • we can indirectly get the status by counting the number of sockets in SYN_RECV state for a listening socket:

      • sudo netstat -antp | grep SYN_RECV | wc -l

      • ss -n state syn-recv sport :80 | wc -l

      • netstat -s | grep -i listen

        • 701 times the listen queue of a socket overflowed # accept queue overflow

        • 1246 SYNs to LISTEN sockets dropped # SYN queue overflow

      • SYN cookies can be used to alleviate the attack: net.ipv4.tcp_syncookies = 1

      • TCP Fast Open: sysctl net.ipv4.tcp_fastopen

        • In computer networking, TCP Fast Open (TFO) is an extension to speed up the opening of successive Transmission Control Protocol (TCP) connections between two endpoints. It works by using a TFO cookie (a TCP option), which is a cryptographic cookie stored on the client and set upon the initial connection with the server.[1] When the client later reconnects, it sends the initial SYN packet along with the TFO cookie data to authenticate itself. If successful, the server may start sending data to the client even before the reception of the final ACK packet of the three-way handshake, thus skipping a round-trip delay and lowering the latency in the start of data transmission.

        • With TFO enabled, Clients use sendto() instead of connect(); SYN packets carry data directly.

October

  • 30: Microservices in the Chronicle world

    • Microservices in the Chronicle world - Part 1

      • Microservices in the Chronicle world are designed around:

        • Simplicity - simple is fast, flexable and easier to maintain.

        • Transparency - you can’t control what you don’t understand.

        • Reproduceablity - this must be in your design to ensure a quality solution.

      • An asynchronous method call is one which:

        • doesn't return anything

        • doesn't alter it's arguments

        • doesn't throw any exceptions (although the underlying transport could)

    • Microservices in the Chronicle world - Part 2

      • In this part we look at turning a component into a service.

    • Microservices in the Chronicle World - Part 3

      • One of the problems with using microservices is performance. Latencies can be higher due to the cost of serialization, messaging and deserialization, and this reduces throughput.

      • JMH Benchmark on microservices

    • Microservices in the Chronicle world - Part 4

      • A common issue we cover in our workshops is, how to restart a queue reader after a failure.

    • Microservices in the Chronicle World - Part 5

      • In this part we look at putting a micro service together as a collection of services, and consider how we can evaluate the performance of these services. We introduce JLBH (Java Latency Benchmark Harness) to test these services.

  • 29: Lock-free Algorithms: Introduction

    • With lockfree algorithms a thread that can make forward progress is always one of the currently running threads, and thus it actually makes forward progress. With mutex-based algorithms there is also usually a thread that can make forward progress, however it may be a currently non-running thread, and thus no actual forward progress happens (at least until, for example, a page will be loaded from disk and/or several context switches happen and/or some amount of active spinning happens).

    • For example, it's generally unsafe to use locks in signal handlers, because the lock can be currently acquired by the preempted thread, and it instantly leads to a deadlock.

      • The thread cannot proceed because the signal handler is executed in the context of the thread that was interrupted by the signal. If the signal handler tries to acquire a lock that the thread already holds, the signal handler will block, waiting for the lock to be released. However the thread itself cannot release the lock because it is effectively paused while the signal handler is running. This creates a deadlock situation where neither the thread nor the signal handler can make progress.

    • Lock-free Algorithms: First things first

      • First, if there is write sharing system ungracefully degrades, the more threads we add the slower it becomes.

      • Second, if there is no write sharing system linearly scales. Yes, atomic RMW operations are slower than plain stores and loads, but they do scale linearly in itself.

      • Third, loads are always scalable. Several threads are able to read a memory location simultaneously. Read-only accesses are your best friends in a concurrent environment.

    • Lock-free Algorithms: Your Arsenal

      • Compare-And-Swap

      • Fetch-And-Add

      • Exchange

      • Atomic loads and stores

      • Mutexes and the company

        • Why not? The most stupid thing one can do is try to implement everything in a non-blocking style (of course, if you are not writing infantile research paper, and not betting a money). Generally it's perfectly Ok to use mutexes/condition variables/semaphores/etc on cold-paths. For example, during process or thread startup/shutdown mutexes and condition variables is the way to go.

  • 8: How to Leverage Method Chaining to Add Smart Message Routing

    This article has shown how it is possible to use method chaining to route messages, but this is not the only use-case for method chaining. This technique can also allow other types of metadata to be associated with business events. Other uses for method chaining and associating meta-information include setting a message priority for a priority queue or recording access history. Then, Dispatching events with associated metadata over an event-driven architecture (EDA) framework allows custom lightweight microservices to read and act upon that metadata.

  • 7: tunnel using websockets using cranker-connector

    Another idea other than ssh -R <remote_port>:localhost:<local_port> user@remote_host

  • 5: TEXMAKER

    • Texmaker is a free, modern and cross-platform LaTeX editor for Linux, macOS and Windows systems that integrates many tools needed to develop documents with LaTeX, in just one application.

  • 4: Simple Binary Encoding

  • 3: Why aeron`s logbuffers divide into three selections?

    The main points are that it enables an algorithm which is wait-free for concurrent publication and supports retransmits on the network in the event of loss. "Aeron: Open-source high-performance messaging" by Martin Thompson

    • Composable Design

    • OSI layer 4 Transport for message oriented streams

      • Connection Oriented Communication

      • Reliability

      • Flow Control

      • Congestion Avoidance/Control

      • Multiplexing

        • avoid head of line blocking

    • Design Principles

      • Clear separation of converns

      • Garbage free in steady state running

      • Lock-free, wait-free, and copy-free in data structures in the message path

      • Respect the Single Writer Principle

      • Major data structures are not shared

      • Don't burden the main path with exceptional cases

      • Non-blocking in the message

    • Putting a Disruptor in front of the network is not necessary as there is Zero Copy from the application to the network.

    • How the skip list is used to build the messaging system from the point of view of contiguity of streaming data? TODO

  • 2: Function Pointer to Member Function in C++

    Dereferencing the member function pointer from the class for the current object/pointer.

    // Declare a pointer to the member function 'add' of
    // MyClass
    int (MyClass::*ptrToMemberFunc)(int, int)
       = &MyClass::add;
    // Call the member function 'add' using the function
    // pointer
    int result = (obj.*ptrToMemberFunc)(20, 30);
  • 1: simple-binary-encoding design principles

    1. Copy-Free: The principle of copy-free is to not employ any intermediate buffers for the encoding or decoding of messages.

    2. Native Type Mapping: For example, a 64-bit integer can be encoded directly to the underlying buffer as a single x86_64 MOV assembly instruction.

    3. Allocate-Free: The design of SBE codecs are allocation-free by employing the flyweight pattern. The flyweight windows over the underlying buffer for direct encoding and decoding of messages.

      • Flyweight Pattern in Java: Here the flyweight pattern is used to minimize memory usage or computational expenses by sharing as much as possible with similar objects, which is different from the SBE flyweight pattern.

    4. Streaming Access: It is possible to backtrack to a degree within messages but this is highly discouraged from a performance and latency perspective.

      • Memory Access Patterns Are Important

        • Basically three major bets are taken on memory access patterns:

          • Temporal: Memory accessed recently will likely be required again soon.

          • Spatial: Adjacent memory is likely to be required soon.

            • For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine. This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.

          • Striding: Memory access is likely to follow a predictable pattern.

            • Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers. This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length.

        • By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.

        • Cache-Oblivious Algorithms and Cache-oblivious algorithm wiki

          • The idea behind cache-oblivious algorithms is efficient usage of processor caches and reduction of memory bandwidth requirements.

          • Cache-oblivious algorithms work by recursively dividing a problem's dataset into smaller parts and then doing as much computations of each part as possible. Eventually subproblem dataset fits into cache, and we can do significant amount of computations on it without accessing memory

        • When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm.

        • The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.

    5. Word Aligned Access: It is assumed the messages are encapsulated within a framing protocol on 8 byte boundaries. To achieve compact and efficient messages the fields should be sorted in order by type and descending size.

    6. Backward Compatibility: An extension mechanism is designed into SBE which allows for the introduction of new optional fields within a message that the new systems can use while the older systems ignore them until upgrade.

September

  • 21: 航天“双五归零”

    发现它,理解它,复现它,修复它,最后消灭它的所有同类。

  • 20: Agents & Idle Strategies

    A typical duty cycle will poll the doWork function of an agent until it returns zero. Once the zero is returned, the idle strategy will be called.

  • 19: The Problem with Threads

    Threads, as a model of computation, are wildly nondeterministic, and the job of the programmer becomes one of pruning that nondeterminism. Although many research techniques improve the model by offering more effective pruning, I argue that this is approaching the problem backwards. Rather than pruning nondeterminism, we should build from essentially deterministic, composable components. Nondeterminism should be explicitly and judiciously introduced where needed, rather than removed where not needed.

  • 18: Billions of Messages Per Minute Over TCP/IP

  • 17: The Unix Philosophy for Low Latency

    Much of Unix’s success can be attributed to the “Unix Philosophy” which can be very briefly summarised as:

    • Write programs that do one thing and do it well

    • Write programs to work together

    • Write programs to handle text streams, because that is a universal interface

  • 16: Base85 encoding

    • Like Base64, the goal of Base85 encoding is to encode binary data printable ASCII characters. But it uses a larger set of characters, and so it can be a little more efficient. Specifically, it can encode 4 bytes (32 bits) in 5 characters.

    • Base 32 and base 64 encoding

      • There are around 100 possible characters on a keyboard, and 64 is the largest power of 2 less than 100, and so base 64 is the most dense encoding using common characters in a base that is a power of 2.

    • Base 58 encoding and Bitcoin addresses

      • Base58 is nearly as efficient as base64, but more concerned about confusing letters and numbers. The number 1, the lower case letter l, and the upper case letter I all look similar, so base58 retains the digit 1 and does not use the lower case letter l or the capital letter I.

      • it may take up to 35 characters to represent a Bitcoin address in base58. Using base64 would have taken up to 34 characters, so base58 pays a very small price for preventing a class of errors relative to base64.

    • How UTF-8 works

      • Since the first bit of ASCII characters is set to zero, bytes with the first bit set to 1 are unused and can be used specially.

      • Unicode initially wanted to use two bytes instead of one byte to represent characters, which would allow for 2^16 = 65,536 possibilities, enough to capture a lot of the world’s writing systems. But not all, and so Unicode expanded to four bytes.

      • Although a Unicode character is ostensibly a 32-bit number, it actually takes at most 21 bits to encode a Unicode character for reasons explained here. How many possible Unicode characters there are and why

      • UTF-8 lets you take an ordinary ASCII file and consider it a Unicode file encoded with UTF-8. So UTF-8 is as efficient as ASCII in terms of space. But not in terms of time. If software knows that a file is in fact ASCII, it can take each byte at face value, not having to check whether it is the first byte of a multibyte sequence.

      • And while plain ASCII is legal UTF-8, extended ASCII is not. So extended ASCII characters would now take two bytes where they used to take one.

  • 15: Liquidity Models

    • check the informative Coding Example – Liquidity Model in Trading

  • 14: Creating Mappers Without Creating Underlying Objects in Java

    A HashMap with int keys and long values might, for each entry, create a wrapped Integer, a wrapped Long object, and a Node that holds the former values together with a hash value and a link to other potential Node objects sharing the same hash bucket. Perhaps even more tantalizing is that a wrapped Integer might be created each time the Map is queried! For example, using the Map::get operation.

  • 13: Java Memory Management

    • Phantom Reference: Used to schedule post-mortem cleanup actions, since we know for sure that objects are no longer alive. Used only with a reference queue, since the .get() method of such references will always return null. These types of references are considered preferable to finalizers.

    • -XX:+HeapDumpOnOutOfMemoryError

    • -verbose:gc

    • -Xms512m -Xmx1024m -Xss1m -Xmn256m

    • -Xlog:gc*:file=gc.log:time,uptime,level,tags -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime

  • 12: Java: Creating Terabyte Sized Queues with Low-Latency

    • The ConcurrentLinkedQueue will create a wrapping Node for each element added to the queue. This will effectively double the number of objects created.

    • Objects are placed on the Java heap, contributing to heap memory pressure and garbage collection problems. On my machine, this led to my entire JVM becoming unresponsive and the only way forward was to kill it forcibly using “kill -9”.

    • The queue cannot be read from other processes (i.e. other JVMs).

    • Once the JVM terminates, the content of the queue is lost. Hence, the queue is not durable.

    • a single MarketData instance can be reused over and over again because Chronicle Queue will flatten out the content of the current object onto the memory-mapped file, allowing object reuse.

  • 11: Java: How Object Reuse Can Reduce Latency and Improve Performance

    • Hence, contrary to many beliefs, creating a POJO, setting some values in one thread, and handing that POJO off to another thread will simply not work. The receiving thread might see no updates, might see partial updates (such as the lower four bits of a long were updated but not the upper ones), or all updates. To make thighs worse, the changes might be seen 100 nanoseconds later, one second later or they might never be seen at all. There is simply no way to know.

      • One way to avoid the POJO problem is to declare primitive fields (such as int and long fields) volatile and use atomic variants for reference fields. Declaring an array as volatile means only the reference itself is volatile and does not provide volatile semantics to the elements.

      • Another way to reuse objects is by means of ThreadLocal variables which will provide distinct and time-invariant instances for each thread.

      • It should be noted that there are other ways to ensure memory consistency. For example, using the perhaps less known Java class Exchanger.

      • Yet another way is to use open-source Chronicle Queue which provides an efficient, thread-safe, object creation-free means of exchanging messages between threads.

    • jmap -histo 8536

    • As can be seen, Chronicle Queue spends most of its time accessing field values in the POJO to be written to the queue using Java reflection. Even though it is a good indicator that the intended action (i.e. copying values from a POJO to a Queue) appears somewhere near the top, there are ways to improve performance even more by providing hand-crafted methods for serialization substantially reducing execution time. (instead of SelfDescribingMarshallable)

  • 10: Chronicle JLBH

    Java Latency Benchmark Harness is a tool that allows you to benchmark your code running in context, rather than in a microbenchmark.

  • 9: Chronicle Wire: Object Marshalling

    • Chronicle Wire is able to find a middle ground between compacting data formatting (storing more data into the same space) versus compressing data (reducing the amount of storage required).

    • Typically, when we talk about a byte, a byte can represent one of 256 different characters. Yet, rather than being able to represent one of 256 characters, because we used Base64LongConverter we are saying that the 8-bit byte can only represent one of 64 characters. By limiting the number of characters that can be represented in a byte, we are able to compress more characters into a long.

    • Chronicle-Wire: Acts as a serialization library that abstracts over various wire formats (e.g., YAML, JSON, binary). It handles marshalling (serialization) and unmarshalling (deserialization) of Java objects into/from these formats, emphasizing performance, schema evolution, and cross-platform compatibility.

    • Chronicle-Bytes: Focuses on low-level memory management and byte manipulation. It provides wrappers around byte arrays, ByteBuffers, and off-heap memory, offering thread-safe operations, elastic resizing, and deterministic resource release. It is similar to Java NIO's ByteBuffer but with extended features.

    • Did You Know the Fastest Way of Serializing a Java Field Is Not Serializing It at All?

      • Many JVMs will sort primitive class fields in descending field size order and lay them out in succession. This has the advantage that read and write operations can be performed on even primitive type boundaries.

      • Well, as it turns out, it is possible to access an object’s field memory region directly via Unsafe and use memcpy to directly copy the fields in one single sweep to memory or to a memory-mapped file.

    • High-Performance Java Serialization to Different Formats

      • The encoding will affect the number of bytes used to store the data, the more compact the format, the fewer bytes used. Chronicle Wire balances the compactness of the format without going to the extreme of compressing the data, which would use valuable CPU time, Chronicle Wire aims to be flexible and backwards compatible, but also very performant.

      • Some encodings are more performant, perhaps by not encoding the field names to reduce the size of the encoded data, this can be achieved by using Chronicle Wire’s Field Less Binary. However this is a trade-off, sometimes it is better to sacrifice a bit of performance and add the field names since it will give us both forwards and backwards compatibility.

  • 8: Chronicle-Map

    When deciding between on-heap and off-heap you are trading the extra memory you require for on-heap implementation against the extra latency to fetch the item from the queue in the off-heap implementation. The general rule is to favour on-heap unless you have very large maps. Another consideration is that off-heap maps will update faster than on-heap maps as there is no serialisation.

    • Java: ChronicleMap, Part 1: Go Off-Heap

      • jmap -histo 34366 | head to check the number of objects created.

      • -XX:NativeMemoryTracking=summary, we can retrieve the amount off-heap memory being used by issuing the following command: jcmd 34413 VM.native_memory | grep Internal

      • Many Garbage Collection (GC) algorithms complete in a time that is proportional to the square of objects that exist on the heap.

      • The mediator between heap and off-heap memory is often called a serializer.

        • Memory Layout of Objects in Java

          • For normal objects in Java, represented as instanceOop, the object header consists of mark and klass words plus possible alignment paddings. After the object header, there may be zero or more references to instance fields. So, that’s at least 16 bytes in 64-bit architectures because of 8 bytes of the mark, 4 bytes of klass, and another 4 bytes for padding.

          • For arrays, represented as arrayOop, the object header contains a 4-byte array length in addition to mark, klass, and paddings. Again, that would be at least 16 bytes because of 8 bytes of the mark, 4 bytes of klass, and another 4 bytes for the array length.

        • When you want to store a Java object (from the heap) into off-heap memory, the serializer's job is to convert that complex, structured object into a simple, flat sequence of bytes.

    • Java: ChronicleMap, Part 2: Super RAM Maps

      • Needless to say, you should make sure that the file you are mapping to is located on a file system with high random access performance. For example, a filesystem located on a local SSD.

  • 7: Improving Putty settings on Windows

    Make the Putty more developer friendly.

  • 6: log4j2: Garbage-free logging

    How to configure garbage-free logging with Log4j2.

  • 5: 如何让交易不被“压垮”?

    • 队列必须有界,而且这个界限就是你的拥塞窗口。

    • 最大延迟 ≈ (单次处理耗时 / 并发数) × 窗口大小 -> Little's Law: W = 1/ λ x L

    • 拥塞控制的核心,是一个反馈循环:感知拥塞,然后调整窗口。

      • 窗口占用率 TCP ECN

      • 处理单个请求的时间; 监控单次事务耗时的P99分位数,和监控队列深度同等重要。

      • 网络路由器在高负载时丢弃数据包

    • 有了拥塞窗口和拥塞信号,你就可以构建一个控制算法了。这和TCP的AIMD(加性增、乘性减)思想异曲同工。

      • 在网关层,直接拒绝。

      • 在网关层,感知撮合拥塞。

      • 在服务内,自然阻塞。

  • 3: How Can AI ID a Cat? An Illustrated Guide.

    A neuron with two inputs has three parameters. Two of them, called weights, determine how much each input affects the output. The third parameter, called the bias, determines the neuron’s overall preference for putting out 0 or 1.

  • 2: PerfectScramble

    This searches all possible arrangements of a 3x3 Rubik's Cube to find a scramble that is very difficult to solve.

August

July

  • 6: Bloom Filters by Example

    Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug the number n of elements you expect to insert, and try various values of k and m to configure your filter for your application. So, to choose the size of a bloom filter, we:

    • Check the value range of n.

    • Choose the number of bits m.

    • Calculate the optimal value of the number of hash functions k = (m/n)ln(2).

    • Calculate the error rate, if it's unacceptable, return to step 2 and try again.

  • 4: Memory Consistency Models: A Tutorial

    One common ordering challenge is memory consistency, which is the problem of defining how parallel threads can observe their shared memory state.

    • 低时延交易设计 之 大页内存

      • L1 数据缓存: 通常很小,约 32KB,缓存行为 64字节。

      • L2 缓存: 大一些,通常为 256KB 到 1MB,缓存行也是 64字节。

      • L3 缓存 (LLC): 更大,通常为 几MB到几十MB,共享于所有核心。

      • D_critical = L2_Size / N ≈ 128 bytes; 当 D 大于 128 字节时,你访问的 N 个缓存行总大小 8192 * 64B = 512KB 虽然不变,但它们在内存中的分布范围超过了 L2 缓存的大小。

        • 虽然有效数据量只有 512KB,但数组的总跨度是 8MB。CPU 的预取器无法有效预测这种极其稀疏的访问模式。

      • 当D=256时,数组大小正好是8K(N)×256(D)×4B(int)=8MB——这正是L2 TLB的极限。超过这个阈值后,每次内存访问都要穿透多层页表,相当于在电话簿里逐页翻找地址,而非直接查看速记本。

  • 3: Include Guards and their Optimizations

    This article discusses the purpose and importance of include guards in C/C++ projects. It also explores the optimizations that compilers have surrounding include guards to improve build times, and the how easy it is to unintentionally disable these optimizations!

  • 2: What is "stdafx.h" used for in Visual Studio?

    The trick consists of designating a special header file as the starting point of all compilation chains, the so called 'precompiled header' file, which is commonly a file named stdafx.h simply for historical reasons.

    Simply list all your big huge headers for your APIs in your stdafx.h file, in the appropriate order, and then start each of your CPP files at the very top with an #include "stdafx.h", before any meaningful content (just about the only thing allowed before is comments).

    Under those conditions, instead of starting from scratch, the compiler starts compiling from the already saved results of compiling everything in stdafx.h.

  • 1: Java with ANTLR

    ANTLR is a powerful parser generator that can be used to read, process, execute, or translate structured text or binary files. It is widely used for building languages, tools, and frameworks.

June

May

  • 30: Templating Maven Plugin

    The templating maven plugin handles copying files from a source to a given output directory, while filtering them. This plugin is useful to filter Java Source Code if you need for example to have things in that code replaced with some properties values.

  • 29: Beginner’s Guide To Bash getopts

    A beginner's guide to using getopts in bash scripts for parsing command-line options and arguments. Also How to Use Bash Getopts With Examples.

  • 24: Plain Vanilla

    An explainer for doing web development using only vanilla techniques. No tools, no frameworks — just HTML, CSS, and JavaScript. TODO

  • 23: Jane Street防抖动简明教程

    如何避免 Jitter (System Jitter and Where to Find It: A Whack-a-Mole Experiencer and magic-trace github)

    • 第一轮:干掉虚拟机!

    • 第二轮:消除中断的“骚扰”

    • 第三轮:隔离出 CPU

    • 第四轮:让时钟中断也“消停会儿”

    • 第五轮:关掉 CPU 的“自动挡”

    • 第六轮:给 CPU 加个“暂停”提示

    • 第七轮:氪金

  • 22: Gall's Law

    加尔定律经常被引用:“一个有效的复杂系统,总是从一个有效的简单系统进化而来。”

    但是,它的推论很少被引用:“一个从零开始设计的复杂系统永远不会有效,你必须从一个可以运行的简单系统开始。”

    There's More To That Nugget of Wisdom

  • 16: How Core Git Developers Configure Git

    What git config settings should be defaults by now? Here are some settings that even the core developers change.

    Why is Git Autocorrect too fast for Formula One drivers?

    • it's based on a fairly simple, modified Levenshtein distance algorithm - which is basically a way to figure out how expensive it is to change one string into a second string given single character edits, with some operations being more expensive than others.

    Experiment on your code freely with Git worktree

  • 15: The Unreasonable Effectiveness of an LLM Agent Loop with Tool Use

    With just that one very general purpose tool, the current models (we use Claude 3.7 Sonnet extensively) can nail many problems, some of them in "one shot."

  • 14: Ports that are blocked by browsers

    list of the ports blocked by Firefox.

  • 13: Pick the right clock

    • Choosing which timer to use is very simple and depends on how long the thing is that you want to measure. If you measure something over a very small time period, TSC will give you better accuracy. Conversely, it’s pointless to use the TSC to measure a program that runs for hours. Unless you really need cycle accuracy, the system timer should be enough for a large proportion of cases. It’s important to keep in mind that accessing system timer usually has higher latency than accessing TSC. Making a clock_gettime system call can be easily ten times slower than executing RDTSC instruction, which takes 20+ CPU cycles. This may become important for minimizing measurement overhead, especially in the production environment. Performance comparison of different APIs for accessing timers on various platforms is available on wiki page46 of CppPerformanceBenchmarks repository. "Performance Analysis and Tuning on Modern CPUs"

    • /sys/devices/system/clocksource/clocksource0/current_clocksource to check whether tsc is used

    • the clock_gettime() function from <time.h> can use the TSC (Time Stamp Counter), but it depends on:

      • The clock source (e.g., CLOCK_MONOTONIC, CLOCK_REALTIME).

      • The underlying system configuration (VDSO acceleration, TSC stability).

  • 12: Templating Maven Plugin

    The templating maven plugin handles copying files from a source to a given output directory, while filtering them. This plugin is useful to filter Java Source Code if you need for example to have things in that code replaced with some properties values.

  • 10: Concatenating kdb Columns

    • Suppose in a query you need to concatenate two kdb columns into one; for example, to join date and time into one field - kdb has nifty features to do it easily.

  • 9: vTable And vPtr in C++ and Understandig Virtual Tables in C++

    • how to design cpp similar to the interface in java

      1. runtime polymorphism vs compile time generics / templates

      2. runtime polymorphism with virtual methods and always with override keyword

      3. pure virtual function

      4. base contract class should always have virtual destructor to prevent memory leakage

    • Whenever a class contains a virtual function, the compiler creates a Vtable for that class. Each object of the class is then provided with a hidden pointer to this table, known as Vptr.

    • It's important to note that vptr is created only if a class has or inherits a virtual function.

    • This process is known as static dispatch or early binding: the compiler knows which routine to execute during compilation.

    • given that virtual functions can be redefined in subclasses, calls via pointers (or references) to a base type can not be dispatched at compile time. The compiler has to find the right function definition (i.e. the most specific one) at runtime. This process is called dynamic dispatch or late method binding.

    • Since derived classes are often handled via base class references, declaring a non-virtual destructor will be dispatched statically, obfuscating the destructor of the derived class.

  • 8: Latency percentiles are not additive

    Latency percentiles are simply not additive. Adding latency percentiles from multiple requests are indicative but not conclusive. And their summation is often too pessimistic and may trigger unnecessary overreaction.

  • 7: C++: C-Style arrays vs. std::array vs. std::vector and std::vector versus std::array in C++

    • std::array is a very thin wrapper around C-style arrays that go on the stack (to put it simply, they do not use operator new. The examples above do this). Like arrays that go on the stack, its size must be known at compile time

    • You should use stdarray when the array size is known at compile time. You should use stdvector when you do not, or the array can grow.

  • 6: Beej's Guide to Network Programming

    A good sites for all kinds of guides including the Network Programming.

  • 5: Solve a Hard Problem (Tinder). Chapter 8 of my upcoming book, The Cold Start Problem

    • What people are doing on their nights and weekends represents all the underutilized time and energy in the world that if put to good use, can become the basis of the hard side of an atomic network.

    • If there is no network in your product, add it through atomic network.

  • 4: A Candidate For the “Most Important const”

    The "const" is important. The first line is an error and the code won’t compile portably with this reference to non-const, because f() returns a temporary object (i.e., rvalue) and only lvalues can be bound to references to non-const.

April

  • 26: An Introduction to Epsilon GC: A No-Op Experimental Garbage Collector

    JEP 318 explains that “[Epsilon] … handles memory allocation but does not implement any actual memory reclamation mechanism. Once the available Java heap is exhausted, the JVM will shut down.”

  • 25: Proof Engineering: The Message Bus

    Every input into the system is assigned a globally unique monotonic sequence number and timestamp by a central component known as a sequencer. This sequenced stream of events is disseminated to all nodes/applications in the system, which only operate on these sequenced inputs, and never on any other external inputs that have not been sequenced. Any outputs from the applications must also first be sequenced before they can be consumed by other applications or the external world. Since all nodes in the distributed system are presented with the exact same sequence of events, it is relatively straightforward for them to arrive at the same logical state after each event, without incurring any overhead or issues related to inter-node communication.

  • 19: Finding Memory Leak through MAT

    The following 4-step approach proved to be most efficient to detect memory issues:

    1. Get an overview of the heap dump. See: Overview

    2. Find big memory chunks (single objects or groups of objects).

    3. Inspect the content of this memory chunk.

    4. If the content of the memory chunk is too big check who keeps this memory chunk alive This sequence of actions is automated in Memory Analyzer by the Leak Suspects Report.

  • 18: Suffering-oriented programming

    First make it possible. Then make it beautiful. Then make it fast.

  • 17: Proof Engineering: The Algorithmic Trading Platform

    • The best way to avoid GC is to not create garbage in the first place. This topic could fill a book, but the primary ways to do that are: (a) Do not create new objects in the critical path of processing. Create all the objects you’ll need upfront and cache them in object pools. (b) Do not use Java strings. Java strings are immutable objects that are a common source of garbage. We use pooled custom strings that are based on java.lang.StringBuilder (c) Do not use standard Java collections. More on this below (d) Careful about boxing/unboxing of primitive types, which can happen when using standard collections or during logging. (e) Consider using off-heap memory buffers where appropriate (we use some of the utilities available in chronicle-core).

    • Avoid standard Java collections. Most standard Java collections use a companion Entry or Node object, that is created and destroyed as items are added/removed. Also, every iteration through these collections creates a new Iterator object, which contributes to garbage. Lastly, when used with primitive data types (e.g. a map of long → Object), garbage will be produced with almost every operation due to boxing/unboxing. When possible, we use collections from agrona and fastutil (and rarely, guava).

    • Write deterministic code. We’ve alluded to determinism above, but it deserves elaboration, as this is key to making the system work. By deterministic code, we mean that the code should produce the exact same output each time it is presented with a given sequenced stream, down to even the timestamps. This is easier said than done, because it means that the code may not use constructs such as external threads, or timers, or even the local system clock. The very passage of time must be derived from timestamps seen on the sequenced stream. And it gets weirder from there — like, did you know that the iteration order of some collections (e.g. java.util.HashMap) is non-deterministic because it relies on the hashCode of the entry keys?!

    • but our changes enable us to integrate QuickFIX/J with the sequenced stream architecture in such a way that we no longer rely on disk logs for recovery (which is how most FIX sessions recover).

    • Our FIX spec is available in either the PDF format or the ATDL format (Algorithmic Trading Definition Language).

  • 13: The Escape of ArrayList.iterator()

    Escape Analysis works, at least for some trivial cases. It is not as powerful as we'd like it, and code that is not hot enough will not enjoy it, but for hot code it will happen. I'd be happier if the flags for tracking when it happens were not debug only.

  • 12: What is the meaning of SO_REUSEADDR (setsockopt option) - Linux?

    This socket option tells the kernel that even if this port is busy (in the TIME_WAIT state), go ahead and reuse it anyway. If it is busy, but with another state, you will still get an address already in use error. It is useful if your server has been shut down, and then restarted right away while sockets are still active on its port.

  • 11: Single Writer Principle

    If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally. This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues. If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?” LMAX - How to Do 100K TPS at Less than 1ms Latency: the head and the tail compete with each other quite often since the queue normally is either full or empty, and when it's empty, they are normally pointing to the same cacheline. Why queue is not a good data structure for low latency?

    • Contention & Locking Overhead: locks / cache coherence traffic

    • Memory Allocation & Garbage Collection (GC): LMAX avoids this by using pre-allocated, garbage-free data structures.

    • Pointer Chasing & Cache Misses: LMAX uses a pre-allocated ring buffer (Disruptor) that is cache-friendly (sequential memory access).

    • Batching & False Sharing: Queues often process items one at a time, missing opportunities for batching (which improves throughput). Little's law

  • 10: Double Buffer

    Efficient pattern for single writer and single reader case. To ensure thread-safety, ReadWriteLock / Semaphore could be used. Parallel C++: Double Buffering

  • 9: PERFORMANCE NINJA CLASS

    Performance Ninja Class is a FREE self-paced online course for developers who want to master software performance tuning. easyperf -> this is the author's amazing blog.

  • 8: The update-alternatives Command in Linux

    Linux systems allow easily switching between programs of similar functionality or goal. So we can set a given version of a utility program or development tool for all users. Moreover, the change applies not only to the program itself but to its configuration or documentation as well.

  • 7: Is a write to a volatile a memory-barrier in Java

    All writes that occur before a volatile store are visible by any other threads with the predicate that the other threads load this new store. However write that occur before a volatile load my or may not be seen by other threads if they do not load the new value.

    In Java, the semantics of volative are defined to ensure visibility and ordering of variables across threads.

    • A volatile write in Java means that a StoreStore barrier and a LoadStore barrier are inserted. This ensures that

      1. All previous writes (stores) are visible before the volatile write.

      2. The volatile write is visible before any subsequent writes (stores).

    • A volatile read in Java means that a LoadLoad barrier and a LoadStore barrier are inserted. This ensures that

      1. The volatile read is visible before any subsequent reads (loads).

      2. The volativle read is visible before any subsequent writes (stores).

  • 6: Linux Default Route

    commands, list routes: ip route or ip route list show interface: ifconfig add route: ip route add 192.168.1.0/24 via 10.217.245.129 dev bond1 show gateways: route -n check the interface assigned to the bonded interface: ip link show bond0 or cat /proc/net/bonding/bond0

    Linux setup default gateway with route command Route internet traffic through a specific interface in Linux Servers – CentOS / RHEL

  • 4: InheritableThreadLocal使用详解

    InheritableThreadLocal 就能实现这样的功能,这个类能让子线程继承父线程中已经设置的ThreadLocal值。

  • 3: Design of the Shutdown Hooks API

    Why are shutdown hooks run concurrently? Wouldn't it make more sense to run them in reverse order of registration?

    Invoking shutdown hooks in their reverse order of registration is certainly intuitive, and is in fact how the C runtime library's atexit procedure works. This technique really only makes sense, however, in a single-threaded system. In a multi-threaded system such as Java platform the order in which hooks are registered is in general undetermined and therefore implies nothing about which hooks ought to be run before which other hooks. Invoking hooks in any particular sequential order also increases the possibility of deadlocks. Note that if a particular subsystem needs to invoke shutdown actions in a particular order then it is free to synchronize them internally.

  • 2: XOR swap algorithm

    In computer programming, the exclusive or swap (sometimes shortened to XOR swap) is an algorithm that uses the exclusive or bitwise operation to swap the values of two variables without using the temporary variable which is normally required.

March

  • 31: Using Pausers in Event Loops

    • sleep requests of ~1ms and ~1us reduce CPU usage to ~1% and ~10% respectively compared with busy waiting (100%)

    • Here again, there is no single answer as to how the system will behave. The key is to bias the situation as much as possible to avoid the thread being switched from a core, and the use of thread affinity (to avoid the thread being moved to another core) and CPU isolation (to avoid another process/thread contending with the thread) can be very effective in this case1. Careful use of affinity, isolation, and short sleep periods can result in responsive, low-jitter environments, which use considerably fewer CPU resources compared with busy waiting.

    • 1 Other options include running with real-time priorities, however we want to keep the focus of this document on standard setups as much as possible

    • Why the Cool Kids Use Event Loops Below are some of the key points to consider when choosing to use event Loops:

      1. Lock Free

      2. Testing and Evolving Requirements

      3. Shared Mutable State

      4. CPU Isolation and Thread Affinity

      5. Event Driven Architecture

    • Building Fast Trading Engines: Chronicle’s Approach to Low-Latency Trading

      • Challenges in Low-Latency Trading

        1. Threading and Core Utilisation

        2. Serialisation and Deserialisation

        3. Message Passing and Data Persistence

      • Addressing Low-Latency Trading Pain Points

        1. Thread Affinity and Event Loop Optimisation

        2. Efficient Message Passing

        3. Minimising Garbage Collection

        4. Performance Tuning for High-Throughput Trading

      • Real-World Example: A High-Performance Trading Engine in Action

        1. Accepting Market Data

        2. Making Trading Decisions

        3. Chronicle Queue Enterprise for Communication

        4. Keeping Latency Stable

  • 30: github useful scripts

    • show-busy-java-threads; how to find the thread that uses the most CPU

      1. top命令找出消耗CPU高的Java进程及其线程id

        1. 开启线程显示模式(top -H,或是打开top后按H)

        2. 按CPU使用率排序(top缺省是按CPU使用降序,已经合要求;打开top后按P可以显式指定按CPU使用降序)

        3. 记下Java进程id及其CPU高的线程id

      2. 查看消耗CPU高的线程栈:

        1. 用进程id作为参数,jstack 出有问题的Java进程; jstack命令解析

        2. 手动转换线程id成十六进制(可以用printf %x 1234)

        3. 在jstack输出中查找十六进制的线程id(可以用vim的查找功能/0x1234,或是grep 0x1234 -A 20)

      3. 查看对应的线程栈,分析问题; 查问题时,会要多次上面的操作以分析确定问题

    • tcp-connection-state-counter

  • 29: 操作系统是如何一步步发明中断机制的?

    当发生中断时,CPU使用中断号作为索引,查找中断向量表中的对应条目,从而获取中断处理程序的入口地址。 操作系统是如何一步步发明进程、线程的?

    1. 要实现这一点程序必须具备暂停运行以及恢复运行的能力,要想让程序具备暂停运行/恢复运行的能力就必须保存CPU上下文信息。

    2. 设计一个新的抽象概念,让各个运行的程序彼此隔离,为每个程序提供独立的内存空间,你决定采用段氏内存管理,每个运行的程序中的各个段都有自己的内存区域 现在你设计了struct context以及struct memory_map,显然它们都属于某一个运行起来的程序,“运行起来的程序”是一个新的概念,你给起了个名字叫做进程,process,现在进程上下文以及内存映射都可以放到进程这个结构体中

    每个线程都是进程内的一个独立执行单元,它们:

    1. 共享进程的地址空间,这意味着所有线程可以直接访问相同的内存区域

    2. 共享打开的文件描述符,避免了重复打开关闭文件的开销

    3. 共享其他系统资源,如信号处理函数、进程工作目录等

    4. 仅维护独立的执行栈和寄存器状态,确保每个线程可以独立执行

  • 28: Java Annotation Processing and Creating a Builder

    An important thing to note is the limitation of the annotation processing API — it can only be used to generate new files, not to change existing ones. If you use Maven to build this jar and try to put this file directly into the src/main/resources/META-INF/services directory, you’ll encounter the following error:

    [ERROR] Bad service configuration file, or exception thrown while 
    constructing Processor object: javax.annotation.processing.Processor: 
    Provider com.baeldung.annotation.processor.BuilderProcessor not found

    This is because the compiler tries to use this file during the source-processing stage of the module itself when the BuilderProcessor file is not yet compiled. The file has to be either put inside another resource directory and copied to the META-INF/services directory during the resource copying stage of the Maven build, or (even better) generated during the build. The Google auto-service library, discussed in the following section, allows generating this file using a simple annotation.

  • 27: Blocking Sockets

    This means that accept blocks the calling thread until a new connection is available from the OS, but the reverse is not true. The underlying OS will establish TCP connections for the application even if the program is not currently blocked at accept. In other words, accept asks the OS for the first ready-to-use connection, but the OS does not wait for the application to accept connections in order to establish new ones. It might establish many more.

  • 26: hatch

    Hatch is a modern, extensible Python project manager.

  • 24: Building a (T1D) Smartwatch from Scratch

    Learn how a hardware engineer works.

  • 23: Booleans Are a Trap

    Enum may be a better option.

  • 22: On inheritance and subtyping

    Explicit Inheritance vs Implicit Inheritance

  • 21: Server-Sent Events (SSE) Are Underrated

    LLM and content-type: text/event-stream

  • 19: toArray with pre sized array

    • Bottom line: toArray(new T[0]) seems faster, safer, and contractually cleaner, and therefore should be the default choice now.

  • 18: AOP in JDK、CGLIB

    JDK based AOP leverage reflection which brings in performance cost; while CGLIB uses ASM to modify the original class's bytecode and generates its subclass in runtime to intercept the method call.

  • 17: A minimal CMake project template

    Learn how to use CMake properly; and note that CMake is a generator for a building system, itself is not a building system.

  • 16: A Guide to CompletableFuture

    The key difference between CompletableFuture and Future is chain.

  • 15: Writing Compilers

February

  • 27: The concept behind C++ concepts

    Concepts are an extension for templates.

    • They can be used to perform compile-time validation of template arguments through boolean predicates.

    • They can also be used to perform function dispatch based on properties of types.

  • 24: A mental model for Linux file, hard and soft links

    • Mental Mode about the understanding of inode, hard and soft links in Linux.

    • a soft link links a link file to a target file. This is in contrast to a hard link, which links a pathname to an inode.

    • The content of a soft link is the pathname of the target file it points to.

    • a hard link exists as a directory entry that links a pathname to an inode, while a soft link exists as a file that links its own pathname to another pathname.

    • symlinks, hardlinks and reflinks explained: Note a file can be held open by a process while all hardlinks are subsequently unlinked, leaving the data accessible until the file is closed. The main use for multiply hardlinked files is to create efficient backups.

  • 23: Gradle Tutorial

    • Running Gradle Builds

    • Authoring Gradle Builds

    • Optimizing Gradle Builds

    • Dependency Management

  • 22: Stackoverflow: toArray with pre sized array

    Bottom line: toArray(new T[0]) seems faster, safer, and contractually cleaner, and therefore should be the default choice now. Future VM optimizations may close this performance gap for toArray(new T[size]), rendering the current "believed to be optimal" usages on par with an actually optimal one. Further improvements in toArray APIs would follow the same logic as toArray(new T[0])the collection itself should create the appropriate storage.

  • 21: 比printf高效1000倍!如何精准捕捉C/C++野指针

    在GDB中你可以通过添加watchpoint来观察一段内存,这段内存被修改时程序将会停止,此时我们就能知道到底是哪行代码对该内存进行了修改。

  • 20: Overview of cross-architecture portability problems

    This blog post provides an overview of common cross-architecture portability problems encountered in software development, particularly focusing on the challenges when targeting 32-bit systems. It discusses issues related to integer type sizes, address space limitations, large file support, the Y2K38 problem, byte order (endianness), and char signedness. While many of these issues are often discussed in the context of C programming, the author highlights that some, like address space limitations, can affect programs written in higher-level languages such as Python. The post emphasizes that achieving true cross-architecture portability requires careful consideration of these low-level details and can be challenging, especially when dealing with legacy or proprietary software.

  • 6: The Impact of 25% Tariffs on Canadian GDP

    Learn how to think like a master from DeepSeek.

  • 5: isd – interactive systemd

    isd (interactive systemd) – a better way to work with systemd units

  • 4: changedetection.io

    The best and simplest free open source web page change detection, website watcher, restock monitor and notification service.

  • 2: Writing Compilers

    "Writing a Compiler in Go"

  • 1: Guava Splitter vs StringUtils

    Still I was surprised by the result, and if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils.

January

Contacts

LinkedIn
LinkedIn

Last updated

Was this helpful?