1 Introduction
When it comes to memory usage, there are (typically) two types of
programs. The first type is programs that allocate memory in large
blocks. Usually, they know precisely how much memory they will need
and can allocate it in advance. Usually, these programs create a
monolith of memory with a fixed size (e.g. holding their memory in
arrays or vectors) and typically access it linearly (but only sometimes
so). These programs might even use dynamic memory, but when they do,
they commonly call malloc
only a few times during the program’s
lifetime. In other words, memory allocation and memory organization are
not typically limiting factors in these programs.
The second type of program, which is more well-known to us, uses memory
differently. For development reasons (e.g. scalability or
security), it is not straightforward to calculate how many resources a
program of this type will need. Thus, they might use data structures or
message passing, which require allocating a mass of small memory chunks
(using malloc
). Accordingly, the number of allocation/deallocation
requests is much more than the former type.
Programs indeed need to spend time on memory allocation. However, for performance-crucial programs, it may raise a performance bottleneck issue. One should investigate if the allocation time is notably long.
Please note that there are two things to consider when talking about the performance of dynamic memory usage:
- Memory Allocation Speed: depends mostly on how exemplary the
implementations of
malloc
orfree
are. - Allocator Access Speed: depends on hardware memory subsystems. Some system allocators are better at allocating memory in a way that benefits hardware memory subsystems.
This post focuses on allocation speed, demonstrates the causes of memory allocation/deallocation sluggishness, and suggests a (comprehensive) technique list on how to speed up critical places. The access speed will be the topic of a follow-up post.
2 Drawbacks of malloc
and free
design
Because malloc
and free
are parts of the C standard1, C/C++
compiler vendors have implemented them based on The C Standard Library
guideline. Undoubtedly, these functions meet all the requirements for
general purposes. However, when talking about performance-critical
software, they are usually the first thing individual want to replace.
2.1 Memory fragmentation
For most scenes, the allocator demands large blocks of memory from the OS. From such blocks, the allocator splits out into smaller chunks to serve the requests made by programs. There are many approaches to managing these chunks out of a large block. Each algorithm differs in speed (will the allocation be fast) and memory usage.

Given the already-allocated green chunks, when the allocator needs to allocate a new chunk, it traverses the list from Free Blocks Start. For speed optimization, the allocator can return the first chunk of the appropriate size (first-fit algorithm). For memory consumption, the allocator can return the chunk whose size most closely matches the requested size by the calling code (best-fit algorithm).
However, the allocator itself still needs an algorithm and takes time to find an appropriate memory block of a given size. Moreover, it gets more challenging to find the block over time. The reason for this is called memory fragmentation.
Given the following scenario, the heap consists of 5 chunks. The program allocates all of them, and after a time, it returns chunks 1, 3 and 4.

Memory Fragmentation Illustration. Each block is 16 bytes in size. Green blocks are marked as allocated, and red blocks are available.
In the above example, when the program requests a chunk of size 32 bytes (2 consecutive 16-byte chunks), the allocator would need to traverse to find the block of that size since it is available at 3 and 4. Suppose the program wants to allocate a block of size 48 bytes. In that case, the allocator will fail because it cannot find a subset of contiguous chunks. However, there are 48 bytes available in the large block (blocks 1, 3 and 4).
Therefore, as time passes, the malloc
and free
functions get
slower because the block of appropriate size is harder to find.
Moreover, the allocation may fail for large block requests if a
continuous chunk with the requested size is unavailable (even though
there is enough memory in total).
Memory fragmentation is a severe problem for long-running systems. It causes programs to become slower or run out of memory. For example, given a TV Box, the user changes a channel every 5 seconds for 48 hours. In the beginning, it took 1 second for the video to start running after changing the channel. Then after 48 hours, it took 7 seconds to do the same. It is memory fragmentation.
2.2 Thread Synchronization
For multi-threaded programs (typical for nowadays), malloc
and
free
are required to be thread-safe2. The simplest way a C/C++
library vendor can implement them thread-safe is to introduce mutexes to
protect the critical section of those functions. However, it comes with
a cost: mutex locking/unlocking are expensive operations on
multi-processor systems3. Synchronization can waste the speed
advantage, even how quickly the allocator can find a memory block of
appropriate size.
Several allocators have guaranteed the performance in multi-threaded/multi-processor systems to resolve this issue. In general, the main idea is to make it synchronization-free if the memory block is exactly accessed from a single thread. For example, reserving per-thread memory or maintaining per-thread cache for recently used memory chunks.
The above strategy works well for most of the use cases. Nonetheless, the runtime performance can be awful when the program allocates and deallocates memory in different threads (worst-case). Furthermore, this strategy generally increases program memory usage.
3 Program slowness caused by allocators
In overview, there are three main reasons which come from allocators that slow down the program:
Massive demand on the allocator: Enormous memory chunk requests will limit the program’s performance.
Memory fragmentation: The more fragmented memory is, the slower
malloc
andfree
are.Ineffective allocator implementation: Allocators by C/C++ Standard Library are for general purposes but not ideal for performance-critical programs.
Resolving any of the above causes should make the program faster (in principle). These mentioned reasons are not entirely independent of each other. For example, reducing memory fragmentation can also be fixed by decreasing the pressure on the allocator.
4 Optimization Strategies
According to Donald Knuth [5]:
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%”
One should consider profiling and benchmarking the program to find the root cause of slowness before optimizing. The following sections present effective techniques to reduce dynamic memory consumption and fasten runtime performance. They are suggestions for fine-tuning the applications. They do not provide or propose techniques to rewrite the program more efficiently.
4.1 Contiguous containers of pointers
In C++, polymorphism can be achieved by using vectors of pointers (e.g.,
vector<T*>
or vector<unique_ptr<T>>
). However, this solution puts
massive pressure on the system allocator: creating/releasing an
element(pointer) in the vector results in a call to new
(malloc
).
Large vector(s) of pointers will lead to degradation in speed performance as the data set grows because of the demand pressure on the allocator and memory fragmentation.
An approach for this issue is using a pointer of a vector of objects
(e.g., vector<T>*
or unique_ptr<vector<T>>
). This usage ensures
all the elements in a continuous block, improves data locality, and
drastically decreases the number of calls to the system allocator.
However, this solution seems verbose since it can only work for a single type. Fortunately, several C++ polymorphism libraries have been built to resolve real-world problems. A fitting example is proxy from Microsoft. It is an open-source, cross-platform, single-header library for the purpose of making runtime polymorphism easier to implement and faster. Wang [6] has posted a detailed introduction on how to use the library.
4.2 Custom STL Allocator
STL data structures like trees (e.g. set
and map
), hash maps
(e.g. unordered_set
and unordered_map
), and vectors of pointers
make many requests for memory chunks from the allocator,
increasesing memory fragmentation and, as a consequence, decreases the
performance.
Fortunately, STL data structures accept a user-specified Allocator
as one of the template arguments. These structures call allocate
to
request memory and deallocate
to release the unneeded memory from the
Allocator
. Users can specify custom Allocator
(by implementing these
functions and satisfying the STL Allocator
requirements4) for
their own needs.
4.2.1 STL Allocators - Per-Type Allocator
According to cppreference, issued in 01/02/2023, the C++20 Standard [7], [8] has stated that:
Every
Allocator
type is either stateful or stateless. Generally, a stateful allocator type can have unequal values which denote distinct memory resources, while a stateless allocator type denotes a single memory resource.
Given std::allocator
and CustomStatelessAllocator
are stateless
allocators, CustomStatefulAllocator
is a stateful allocator. Consider
the following information:
|
|
instance | allocator | group |
---|---|---|
vector_a | std::allocator<char> (defaulted) | 1 |
vector_b | std::allocator<char> | 1 |
vector_c | CustomStatelessAllocator<char> | 2 |
vector_d | CustomStatefulAllocator<char> | 3 |
set_a | std::allocator<char> (defaulted) | 1 |
set_b | std::allocator<char> | 1 |
set_c | CustomStatelessAllocator<char> | 2 |
set_d | CustomStatefulAllocator<char> | 3 |
map_a | std::allocator<std::pair<const char, bool>> (defaulted) | 4 |
map_b | std::allocator<std::pair<const char, bool>> | 4 |
map_c | CustomStatelessAllocator<std::pair<const char, bool>> | 5 |
map_d | CustomStatefulAllocator<std::pair<const char, bool>> | 6 |
From the results, it is evident that std::allocator
is per-type
allocators: all the data structures of the same type share one
instance of the allocator.
4.2.1.1 Design Overview
In a regular allocator, memory chunks belonging to different data structures may end up next to one another in the memory. With an STL allocator, data from different domains are guaranteed in separate memory blocks. This design leads to several improvements:
- Since data from the same data structure is in the same memory block,
it not only decreases memory fragmentation but also increases data
locality and access speed:
- STL allocator guarantees that memory chunks are returned to the allocator when the data structure is destroyed. Additionally, when all data structures of the same type are destroyed, the STL allocator returns the free memory to the system.
- Consecutive calls to
allocate
anddeallocate
guarantee to return neighboring chunks in memory
- Simple implementation opens up speed opportunities:
- The allocator return only chunks of a (or multiple of) constant size: memory-fragmentation-free. Any free element will be a perfect fit for any request. Also, since the size is already known (constant size), there is no need for chunk metadata that keeps the information about the chunk’s size.
- No thread-safe requirements unless the data structure is allocated from several threads.
4.2.1.2 Example of customizing STL allocator
Arena Allocator5 is commonly used for data structures with little changes after creation. Since deallocation is a no-op, destruction can be significantly fast.
This section provides a naive Arena Allocator implementation for
std::map
.
Given the definition of C++20 std::map
[8], issued in
01/02/2023 from cppreference:
|
|
There is a template parameter named Allocator
, and it defaults to
std::allocator
(line 5). One can replace the default allocator by
writing a class with a few methods that can replace the
std::allocator
.
Suppose a context uses std::map
to look up a considerable number of
elements, then destroy it as soon as afterwards. One can provide a
custom allocator that allocates from one specific block. Follow the
below example:
|
|
On line 5, call mmap
to allocate a large memory block from the OS (it
will not allocate all that RAM unless the program uses it). When the
arena_allocator
instance is destroyed (line 11), the block will get
returned to the system. Method allocate
returns the block’s
first available chunk (lines 17-19), and method deallocate
does
nothing. Therefore, the allocation is fast, and deallocating is a no-op.
This approach is ideal for the scene where allocating a complete
std::map
at the beginning and then does not do any removal operations.
The tree (std::map
is usually implemented as a red-black tree) will be
compact in memory, which is suiteable for the cache-hit rate and
performance. It will make no memory fragmentation since the whole block
is separated from other memory blocks.
When removing elements from std::map
, even though std::map
would
call deallocate
, no memory would be released. The program’s
memory consumption would go up. If this is the case, one might need to
implement the deallocate
method, but allocation also needs to become
more complex.
4.2.2 Per-Instance Allocator
Per-Instance Allocator is ideal when having several large data structures of the same type. By using it, the memory needed for a particular data structure instance will be allocated from a dedicated block instead of separating memory blocks per domain (allocating a memory pool for each type).
Take an example of std::map<student_id, student>
, given having two
instances, one for undergraduate students and the other for graduate
students. With standard STL allocators, both hash maps share the same
allocator. With a per-instance allocator, when destroying an instance,
the whole memory block used for that instance is empty and can be
directly returned to the OS. This benefits memory fragmentation
reduction and data locality increment (faster map traversal).
Unfortunately, STL data structures do not support per-instance allocators. Thus, rewriting the data structure with per-instance-allocator support is indeed a need.
4.2.3 Tuning the custom allocator
A custom allocator can be adjusted to a specific environment for maximum performance. The following approaches can be taken into account when choosing the right strategy for an individual’s custom allocator:
- Static allocator: Preloading with enough space to store elements can be perfectly fit for small data structures. If the context requires extra memory, the allocator can request it from the system allocator. This approach can substantially decrease the number of calls to the system allocator.
- Arena allocator: Releasing memory only when the allocator is destroyed. It is helpful for large data structures that are primarily static after creation.
- Cache allocator: When deallocating, keep specific chunks in the cache instead of returning the whole memory to the system allocator for future usage of allocating requests. Despite not decreasing memory fragmentation, this approach can slow down fragmenting progression.
4.3 Memory chunk caching for producer-consumer
Take an example where a producer thread allocates an object and sends it to the consumer thread. After processing, the consumer thread destroys the object and releases the memory back to the OS. This context puts significant pressure on the system allocator. One approach is to allocate all the objects from a dedicated memory pool using a custom allocator.
In C++, one could overload the operator new/delete
to use the new
allocator. However, synchronisation must be aware since the object
memory lifetime is controlled in separate threads. One idea is to use a
memory cache on both the allocation and deallocation places. Consider
the following source code:
|
|
Inside the custom allocator memory_pool
are three linked lists
containing the cached memory chunks: allocation_cache_
,
common_cache_
and deallocation_cache_
.
- When
deallocate
is called, the memory chunk is not released back to the system allocator. Instead, it is cached (line 37).- If
deallocation_cache_
is full, memory chunks from it are moved tocommon_cache_
(line 28). When thecommon_cache_
becomes full, the remaining memory chunks are released back to the system allocator (lines 32-34).deallocation_cache_
is empty after this operation. - Access to
common_cache_
has to be secured with a mutex (lines 27 and 29). - Access to
deallocation_cache_
is the typical case.memory_pool
only need to accesscommon_cache_
whendeallocation_cache_
is full.
- If
- When
allocate
is called, if available, a memory chunk is taken from the cache (line 20).- If
allocation_cache_
is empty, additional memory chunks are taken from thecommon_cache_
(line 10). Thememory_pool
will move chunks (as much as possible) from thecommon_cache_
to theallocation_cache_
untilallocation_cache_
is full. - If the
allocation_cache_
is not filled after this operation, the system allocator will request additional memory chunks (lines 15-17). - Access to
common_cache_
has to be secured with a mutex (lines 9 and 12). - Access to
allocation_cache_
is the regular case.memory_pool
only need to accesscommon_cache_
whenallocation_cache_
is empty.
- If
This solution works only with two threads, one allocates, and the other deallocates objects.
For optimally, the size of allocation_cache_
, deallocation_cache_
and common_cache_
need to be chosen carefully:
- Using small values forces the
memory_pool
to work more with the system allocator than the cached chunk lists. - Using large values causes the
memory_pool
to consume more memory. - An ideal size for
common_cache_
is about two times bigger than the capacity of the others.
4.4 Small Size Optimizations
Given the class small_vector
for storing integers with the following
(naive) implementation:
|
|
Suppose the profiling report shows that in most of the runtime cases, these objects have small sizes of up to 4. A solution could be pre-allocating four integers, thus, totally reducing calls to the system allocator:
|
|
The above codes illustrate creating a small_vector
with a capacity
less than or equal to kPreAllocatedSize
(lines 6-7) for ordinary use
cases by using pre_allocated_storage_
. Otherwise, call the system
allocator as the origin implementation (lines 9-10).
However, the drawback is the class size increment. On a 64-bit system, the original was 24 bytes, while the new is 40 bytes.
A common solution is using C unions to overlay the data for the
pre-allocated case and the heap-allocated case. The most significant
bit in size_
can be used for authorization between these cases (0
for pre-allocated and 1
for heap-allocated).
|
|
This approach is used in several places. For example, libc++
std::string
implementation6.
4.5 Fighting memory fragmentation
It is always challenging to reduce memory fragmentation or make it never happens. Sometimes, it is caused by the system. There are a few approaches that can help in these situations:
Restart: some systems will occasionally restart to avoid memory fragmentation. However, creating a restartable or state-restorable program or system can be challenging.
Pre-allocate memory upfront: Some programs pre-allocate all the needed memory at the start and then completely dispense with dynamic memory allocation. MISRA coding guidelines even forbids the usage of dynamic memory in the automotive industry:
MISRA C++ 2008 [11], 18-4-1 - Dynamic heap memory allocation shall not be used.
MISRA C:2004 [12], 20.4 - Dynamic heap memory allocation shall not be used.
MISRA C:2012 [13], 21.3 The memory allocation and deallocation functions of <stdlib.h> shall not be used
However, these guidelines are impossible for programs in other industries where the needed memory is unknown at the beginning of the program.
Changing the system allocator: several system allocators can be used instead of the built-in one and which promise faster allocation speed, better memory usage, less fragmentation, or better data locality.
5 System Allocators
This section describes a solution to speed up programs by using a better system allocator for individual needs. Several open-source allocators try to achieve efficient allocation and deallocation. Despite that, there has yet to be any allocator who has taken the holy grail.
Regularly, there are four perspectives that each allocator compromises on:
- Allocation Speed: Note that both the speed of
malloc
andfree
(new
anddelete
in C++) are essential. - Memory Consumption: Percentage of wasted memory after allocating. The allocator needs to keep some accounting info for each block, which generally takes up some space. Additionally, if the allocator optimizes for allocation speed, it can leave some memory unused.
- Memory Fragmentation: Some allocators have these issues than others, which can affect the speed of long-running applications.
- Cache/Data Locality: Allocators which pack data in smaller blocks and avoid memory losses have better cache/data locality (which will be discussed in a follow-up article)
5.1 Allocators on Linux
When one does not specify any configurations, the Linux programs use GNU Allocator, based on ptmalloc. Apart from it, there are several other open-source allocators commonly used on Linux: tcmalloc (by Google), jemalloc (by Facebook), mimalloc (by Microsoft) and hoard allocator.
GNU Allocator is not among the most efficient allocators. However, it does have one advantage, the worst-case runtime and memory usage will be all right.
Other allocators claim to be better in speed, memory usage, or cache/data locality. Before choosing a new system allocator, consider thinking about the following questions:
- Single-thread or multi-threaded?
- Maximum allocation speed or minimum memory consumption? What is the acceptable trade-off?
- Is the allocator for the whole program or only the most critical parts?
In Linux, after installing the allocator, one can use an environment
variable LD_PRELOAD
to replace the default allocator with a custom
one:
|
|
5.2 The Performance Test
Real-world programs differ too much from one another. Therefore, providing a comprehensive benchmark for allocator performance is impossible in this article’s scope. An allocator performing well under one load can have different behaviour under another.
“No Bugs” Hare [14] has compared several allocators on a test load that tries to simulate a real-world load. According to the author’s conclusion:
allocators are similar, and testing your particular application is more important than relying on synthetic benchmarks.
5.3 Notes on using allocators in the program
All the allocators can be fine-tuned to run better on a particular system, but the default configuration should be enough for most use cases. Fine-tuning can be done through environment variables, configuration files, or compilation options.
Usually, the allocators provide implementations for malloc
and free
,
and replace the functions with the same name provided by the Standard C
Library. This design means the program’s dynamic allocation goes
through the new allocator.
However, it is possible to keep default malloc
and free
implementations separately with custom implementations provided by the
chosen allocator. Allocators can provide prefixed versions of malloc
and free
for this purpose (e.g., jemalloc with the prefix je_
).
In this case, malloc
and free
will be left unchanged, and one can
use je_malloc
and je_free
to allocate memory only for some parts of
the program through jemalloc
.
6 Conclusion
This article represented several optimization solutions for how to allocate memory faster.
Ready-made allocators have the benefit of being relatively easy to set up, and one can see the improvements within minutes
Other battle-tested techniques are also powerful when used correctly. For example, decreasing allocation requests by avoiding pointers removes much stress from the system allocator; using custom allocators can benefit allocation speed and decrease memory fragmentation.
Lastly, people need to know their domain and profile the program to find the root cause of slowness before optimizing, then try multiple approaches. Repeating this process is indeed improving the program’s quality.
7 Appendix
8 References
[1] “Information technology - Programming languages - C,” International Organization for Standardization; ISO/IEC 9899:2018, International Standard, Jun. 2018.Available: https://www.iso.org/standard/74528.html
[2] “Information technology - Programming languages - C,” International Electrotechnical Commission; ISO/IEC 9899:2018, International Standard, Jun. 2018.Available: https://webstore.iec.ch/publication/63478
[3] Dummy00001, “How efficient is locking and unlocked mutex? What is the cost of a mutex?” stackoverflow, 2010 [Online].Available: https://stackoverflow.com/a/3652428
[4] C. Wood, “How efficient is locking and unlocked mutex? What is the cost of a mutex?” stackoverflow, 2019 [Online].Available: https://stackoverflow.com/a/49712993
[5] D. E. Knuth, “Structured programming with go to statements,” ACM Comput. Surv., vol. 6, no. 4, pp. 261–301, Dec. 1974, doi: 10.1145/356635.356640.
[6] M. Wang, “Proxy: Runtime polymorphism made easier than ever.” Microsoft, 2022 [Online].Available: https://devblogs.microsoft.com/cppblog/proxy-runtime-polymorphism-made-easier-than-ever/
[7] “Programming languages - C++,” International Organization for Standardization; ISO/IEC 14882:2020, International Standard, Dec. 2020.Available: https://www.iso.org/standard/79358.html
[8] “Programming languages - C++,” International Electrotechnical Commission; ISO/IEC 14882:2020, International Standard, Dec. 2020.Available: https://webstore.iec.ch/publication/68285
[9] M. Tofte and J.-P. Talpin, “Region-based memory management,” Information and Computation, vol. 132, no. 2, pp. 109–176, 1997, doi: https://doi.org/10.1006/inco.1996.2613.
[10] J. Laity, “libc++’s implementation of std::string.” joellaity, blog, 2020 [Online].Available: https://joellaity.com/2020/01/31/string.html
[11] Motor Industry Software Reliability Association, MISRA-C++:2008: Guidelines for the Use of the C++ Language in Critical Systems. MIRA Limited, 2008.Available: https://books.google.com.vn/books?id=bNUqPQAACAAJ
[12] Motor Industry Software Reliability Association, MISRA-C:2004: Guidelines for the Use of the C Language in Critical Systems. MIRA, 2004.Available: https://books.google.com.vn/books?id=j6oXAAAACAAJ
[13] Motor Industry Software Reliability Association and Motor Industry Software Reliability Association Staff and HORIBA MIRA Ltd and HORIBA MIRA Ltd. Staff, MISRA C:2012: Guidelines for the Use of the C Language in Critical Systems. Unknown Publisher, 2019.Available: https://books.google.com.vn/books?id=daApxQEACAAJ
[14] “No Bugs” Hare, “Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc While Trying to Simulate Real-World Loads.” IT Hare on Soft.ware, blog, 2018 [Online].Available: http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/
At the time of writing, C17 standard [1], [2] has stated the definitions of
malloc
andfree
functions:- 7.22.3.4 The malloc function (p: 254)
- 7.22.3.3 The free function (p: 254)
According to the C17 standard [1], [2]:
free
is thread-safe: it behaves as though only accessing the memory locations visible through its argument, and not any static storage. A call tofree
that deallocates a region of memory synchronizes-with a call to any subsequent allocation function that allocates the same or a part of the same region of memory. This synchronization occurs after any access to the memory by the deallocating function and before any access to the memory by the allocation function. There is a single total order of all allocation and deallocation functions operating on each particular region of memory.
↩︎malloc
is thread-safe: it behaves as though only accessing the memory locations visible through its argument, and not any static storage. A previous call tofree
orrealloc
that deallocates a region of memory synchronizes-with a call to malloc that allocates the same or a part of the same region of memory. This synchronization occurs after any access to the memory by the deallocating function and before any access to the memory by malloc. There is a single total order of all allocation and deallocation functions operating on each particular region of memory.Stackoverflow user “Dummy00001” has explained that the primary overhead of mutexes is from the memory/cache coherency guarantees [3]. In the same article, the user “Carlo Wood” has also benchmarked the number of clocks it takes to lock/unlock mutex on a decent multi-core machine and gave favorable results with “Dummy00001”’s statement[4]. ↩︎
A summary of C++20 Standard
Allocator
requirements [7], [8] can be found at cppreference ↩︎Region-based memory management[9] is a type of memory management in which each allocated object is assigned to a region. The region (also known as the arena) is a collection of allocated objects that can be efficiently reallocated or deallocated all at once. Like stack allocation, regions facilitate the allocation and deallocation of memory with low overhead. However, they are more flexible, allowing objects to live longer than the stack frame in which they were allocated. In typical implementations, all objects in a region are allocated in a single contiguous range of memory addresses, similar to how stack frames are typically allocated. ↩︎
The implementation of libc++
std::string
can be found at their repository. Unfortunately, the source code is hard to read and undocumented. Laity [10] has posted an overview of their design and optimisation for this implementation ↩︎