Misaligned memory access can lead to performance degradation due to additional CPU cycles required to access data. To improve performance, ensure that your data structures are properly aligned in memory. This can be achieved by reordering the members of a struct or using compiler-specific attributes to control alignment. Basically reorder the members of a struct to ensure that larger data types are aligned on their natural boundaries, and smaller data types are placed together to minimize padding. This can reduce the overall size of the struct and improve cache performance.
Avoid mispredicted branches by using arithmetic and bitwise operations instead of conditional statements. This can improve performance, especially in tight loops or performance-critical code because modern CPUs can execute instructions in parallel and avoid pipeline stalls caused by branch mispredictions.
and branchless binary search
3. Composition over Inheritance
Composition can often lead to better performance than inheritance because it allows for more flexible and efficient code. Inheritance can introduce additional overhead due to virtual function calls and the need to maintain a vtable, which can lead to slower execution. Composition, on the other hand, allows you to create complex objects by combining simpler ones without the overhead of inheritance. This can lead to faster execution and better memory usage, especially in cases where the behavior of the composed objects is not tightly coupled.
4. CRTP (Curiously Recurring Template Pattern)
CRTP is a C++ programming technique where a class template takes a derived class as a template parameter. This allows for static polymorphism, which can be more efficient than dynamic polymorphism (using virtual functions) because it avoids the overhead of virtual function calls. CRTP can be used to implement features like static interfaces, mixins, and compile-time polymorphism, which can lead to faster code execution and better optimization opportunities for the compiler.
5. Induction Variable
Induction variable is a variable that is incremented or decremented in a loop. It can be used to optimize loops by reducing the number of instructions executed in each iteration. For example, instead of calculating the value of i * 10 + 12 in each iteration of the loop, we can use an induction variable to keep track of the current value and update it in each iteration. This can lead to faster execution and better performance, especially in cases where the loop is executed many times.
6. Loop Unrolling
Loop unrolling is a technique where the body of a loop is replicated multiple times to reduce the overhead of loop control and increase instruction-level parallelism. By unrolling a loop, you can decrease the number of iterations and the number of branch instructions, which can lead to improved performance. However, it can also increase the code size, so it should be used judiciously.
7. Pointer Aliasing and Strict Alias
Pointer aliasing occurs when two or more pointers refer to the same memory location.
This is because, while it may be obvious to the user, at compile time, the compiler cannot guarantee that two pointer variables in the code will never point to the same memory address. Under those theoretical possible cases of pointer aliasing, some compiler optimizations would change the outcome of code; hence, those optimizations are disabled. For instance, the following code would prevent the compiler from applying loop-invariant code motion.
This can lead to performance degradation because the compiler may not be able to optimize code effectively due to potential side effects. To avoid pointer aliasing, you can use the restrict qualifier in C++ (available in C++11 and later) to indicate that a pointer is the only reference to a particular memory location. This allows the compiler to optimize code more aggressively, leading to improved performance.
An example of type-casting a pointer to be a different object is presented here. This example uses a conversion from double * into uint64_t * and modifies the sign bit using the uint64_t pointer.
Type-casting a pointer to an object to a pointer to a different object violates the strict aliasing rule stating that two pointers of different types cannot point to the same memory location, which really means that it is possible the compiler might not use the same register to store the two different pointers, even though they point to the same address. Remember that the CPU registers are the fastest form of storage available to the processor but are extremely limited in storage capacity. So, when an extra register gets used to store the same variable, it is an inefficient use of the registers and negatively impacts performance overall.
This is nothing more than a convoluted but more efficient method of achieving x = -std::abs(x), but it is to demonstrates how this violates the strict aliasing rule.
8. RVO (Return Value Optimization)
If a function returns an object by value, the compiler can optimize the code by eliminating the need to create a temporary object and directly constructing the return value in the caller's memory. This optimization is known as Return Value Optimization (RVO). By using RVO, you can avoid unnecessary copy or move operations, which can lead to improved performance, especially when returning large objects.
And if a local variable is returned, if
the type of the local object is the same as that returned by the function and
the local object is what’s being returned.
RVO will be applied here too. Check "Item 25: Use std::move on rvalue references, std::forward on universal references." for more details.
9. Strength Reduction
Strength reduction is a term used to describe compiler optimizations where complex operations that are quite expensive are replaced by instructions that are simpler and cheaper to improve performance. A classic example is one in which the compiler replaces operations involving division by some value with multiplication by the reciprocal of that value. Another example would be replacing multiplication by a loop index with an addition operation.
10. Tail Call Optimization
Tail call optimization refers to compiler optimization techniques in which recursive function calls are replaced by loops.
#include <cstdio>
#include <cstdint>
#include <cstdlib>
enum class Side : int16_t { BUY = 1, SELL = -1 };
int main() {
const auto fill_side = (rand() % 2 ? Side::BUY : Side::SELL);
const int fill_qty = 10;
printf("fill_side:%s fill_qty:%d.\n", (fill_side == Side::BUY ? "BUY" : (fill_side == Side::SELL ? "SELL" : "INVALID")), fill_qty);
{ // with branching
int last_buy_qty = 0, last_sell_qty = 0, position = 0;
if (fill_side == Side::BUY) { position += fill_qty; last_buy_qty = fill_qty;
} else if (fill_side == Side::SELL) { position -= fill_qty; last_sell_qty = fill_qty; }
printf("With branching - position:%d last-buy:%d last-sell:%d.\n", position, last_buy_qty, last_sell_qty);
}
{ // without branching
int last_qty[3] = {0, 0, 0}, position = 0;
auto sideToInt = [](Side side) noexcept { return static_cast<int16_t>(side); };
const auto int_fill_side = sideToInt(fill_side);
position += int_fill_side * fill_qty;
last_qty[int_fill_side + 1] = fill_qty;
printf("Without branching - position:%d last-buy:%d last-sell:%d.\n", position, last_qty[sideToInt(Side::BUY) + 1], last_qty[sideToInt(Side::SELL) + 1]);
}
}
template <classForwardIt, classT, classCompare>
ForwardIt branchless_lower_bound(ForwardIt first, ForwardIt last, const T& value, Compare comp)
{
auto length = last - first;
while (length > 0)
{
auto half = length / 2;
// 关键:用计算代替分支
// comp(...) 返回 0 或 1,结果要么是 first += 0,要么 first += (length-half)
first += comp(first[half], value) * (length - half);
length = half;
}
return first;
}
#include <cstdio>
#include <vector>
struct Order {
int id;
double price;
};
class InheritanceOrderBook : public std::vector<Order> {
};
class CompositionOrderBook {
std::vector<Order> orders_;
public:
auto size() const noexcept {
return orders_.size();
}
};
int main() {
InheritanceOrderBook i_book;
CompositionOrderBook c_book;
printf("InheritanceOrderBook::size():%lu CompositionOrderBook:%lu\n", i_book.size(), c_book.size());
}
#include <cstdio>
class RuntimeExample {
public:
virtual void placeOrder() {
printf("RuntimeExample::placeOrder()\n");
}
};
class SpecificRuntimeExample : public RuntimeExample {
public:
void placeOrder() override {
printf("SpecificRuntimeExample::placeOrder()\n");
}
};
template<typename actual_type>
class CRTPExample {
public:
void placeOrder() {
static_cast<actual_type *>(this)->actualPlaceOrder();
}
void actualPlaceOrder() {
printf("CRTPExample::actualPlaceOrder()\n");
}
};
class SpecificCRTPExample : public CRTPExample<SpecificCRTPExample> {
public:
void actualPlaceOrder() {
printf("SpecificCRTPExample::actualPlaceOrder()\n");
}
};
int main(int, char **) {
RuntimeExample *runtime_example = new SpecificRuntimeExample();
runtime_example->placeOrder();
CRTPExample <SpecificCRTPExample> crtp_example;
crtp_example.placeOrder();
return 0;
}
int main() {
[[maybe_unused]] int a[100];
// original
for(auto i = 0; i < 100; ++i)
a[i] = i * 10 + 12;
// optimized
int temp = 12;
for(auto i = 0; i < 100; ++i) {
a[i] = temp;
temp += 10;
}
}
int main() {
{ // original code.
int a[5];
a[0] = 0;
for(int i = 1; i < 5; ++i)
a[i] = a[i-1] + 1;
}
{ // loop unrolled code.
int a[5];
a[0] = 0;
a[1] = a[0] + 1;
a[2] = a[1] + 1;
a[3] = a[2] + 1;
a[4] = a[3] + 1;
}
}
#include <cstddef>
int main() {
const size_t size = 1024;
[[maybe_unused]] float x[size], a[size], b[size];
// no vectorization
for (size_t i = 0; i < size; ++i) {
x[i] = a[i] + b[i];
}
// vectorization
for (size_t i = 0; i < size; i += 4) {
x[i] = a[i] + b[i];
x[i + 1] = a[i + 1] + b[i + 1];
x[i + 2] = a[i + 2] + b[i + 2];
x[i + 3] = a[i + 3] + b[i + 3];
}
}
void func(int *a, int *b, int n) {
for (int i = 0; i < n; ++i) {
a[i] = *b;
}
}
void func_restrict(int *__restrict a, int *__restrict b, int n) {
for (int i = 0; i < n; ++i) {
a[i] = *b;
}
}
// then it is valid to apply loop invariant code motion like below
void func(int *__restrict a, int *__restrict b, int n) {
int tmp = *b; // load once
for (int i = 0; i < n; ++i) {
a[i] = tmp; // use the cached value
}
}
int main() {
int a[10], b;
func(a, &b, 10);
func_restrict(a, &b, 10);
}
#include <cstdio>
#include <cstdint>
int main() {
double x = 100;
const auto orig_x = x;
auto x_as_ui = (uint64_t *) (&x);
*x_as_ui |= 0x8000000000000000;
printf("orig_x:%0.2f x:%0.2f &x:%p &x_as_ui:%p\n", orig_x, x, &x, x_as_ui);
}
#include <iostream>
struct LargeClass {
int i;
char c;
double d;
};
auto rvoExample(int i, char c, double d) {
return LargeClass{i, c, d};
}
int main() {
LargeClass lc_obj = rvoExample(10, 'c', 3.14);
}
#include <cstdint>
int main() {
const auto price = 10.125; // prices are like: 10.125, 10.130, 10.135...
constexpr auto min_price_increment = 0.005;
[[maybe_unused]] int64_t int_price = 0;
// no strength reduction
int_price = price / min_price_increment;
// strength reduction, replace the division with multiplication by the reciprocal of min_price_increment, which is cheaper to compute.
constexpr auto inv_min_price_increment = 1 / min_price_increment;
int_price = price * inv_min_price_increment;
}
auto __attribute__ ((noinline)) factorial(unsigned n) -> unsigned {
return (n ? n * factorial(n - 1) : 1);
}
int main() {
[[maybe_unused]] volatile auto res = factorial(100);
}