Apply to Speak at the DPDK Summit Montreal by July 31.
Skip to main content

Cache Awareness in DPDK Mempool

By June 5, 2024Blog

Author: Kamalakshitha Aligeri – Senior Software Engineer at Arm

The objective of DPDK is to accelerate packet processing by transferring the packets from the NIC  to the application directly, bypassing the kernel. The performance of DPDK relies on various factors such as memory access latency, I/O throughput, CPU performance, etc.

Efficient packet processing relies on ensuring that packets are readily accessible in the hardware  cache. Additionally, since the memory access latency of the cache is small, the packet processing  performance increases if more packets can fit into the hardware cache. Therefore, it is important  to know how the packet buffers are allocated in hardware cache and how it can be utilized to get  the maximum performance. 

With the default buffer size in DPDK, hardware cache is utilized to its full capacity, but it is not  clear if this is being done intentionally. Therefore, this blog helps in understanding how the  buffer size can have an impact on the performance and things to remember when changing the  default buffer size in DPDK in future. 

In this blog, I will describe, 

1. Problem with contiguous buffers 

2. Allocation of buffers with cache awareness 

3. Cache awareness in DPDK mempool 

4. l3fwd performance results with and without cache awareness 

Problem with contiguous buffers 

The mempool in DPDK is created from a large chunk of contiguous memory. The packets from  the network are stored in packet buffers of fixed size (objects in mempool). The problem with  contiguous buffers is when the CPU accesses only a portion of the buffer, such as in cases like  DPDK’s L3 forwarding application where only metadata and packet headers are accessed. Rest of  the buffer is not brought into the cache. This results in inefficient cache utilization. To gain a better  understanding of this problem, its essential to understand how the buffers are allocated in hardware  cache. 

How are buffers mapped in Hardware Cache? 

Consider a 1KB, 4-way set-associative cache with 64 bytes cache line size. The total number of  cache lines would be 1KB/64B = 16. For a 4-way cache, each set will have 4 cache lines. Therefore, there will be a total of 16/4 = 4 sets. 

As shown in Figure1, each memory address is divided into three parts: tag, set and offset. 

• The offset bits specify the position of a byte within a cache line (Since each cache line is  64 bytes, 6 bits are needed to select a byte in a single cache line). 

• The set bits determine which set the cache line belongs to (2 bits are needed to identify the  set among 4 ways).

• The tag bits uniquely identify the memory block. Once the set is identified with set bits,  the tag bits of the 4 ways in that set is compared against the tag bits of the memory address,  to check if the address is already present in the cache. 

Figure 1 Memory Address 

In Figure 2, each square represents a cache line of 64 bytes. Each row represents a set. Since it’s a  4-way cache, each set contains 4 cache lines in it – C0 to C3. 

Figure 2 Hardware Cache 

Let’s consider a memory area that can used to create a pool of buffers. Each buffer is 128 bytes,  hence occupies 2 cache lines. Assuming the first buffer address starts at 0x0, the addresses of the  buffers are as shown below. 

Figure 3 Contiguous buffers in memory

In the above figure the offset bits are highlighted in orange, set bits in green and tag bits in blue. Consider buffer 1’s address, where set bits “00” means the buffer maps to set0. Assuming initially  all the sets are empty, buffer 1 occupies the first cache line of 2 contiguous sets. 

Since buffer 1 address is 0x0 and the cache line size is 64 bytes, the first 64 bytes of the buffer  occupy the cache line in set0. For the next 64 bytes, the address becomes 0x40 (0b01000000) indicating set1 because the set bits are “01”. As a result, the last 64 bytes of the buffer occupy the  cache line in set1. Thus, the buffer is mapped into cache lines (S0, C0) and (S1, C0). 

Figure 4 Hardware cache with buffer 1 

Similarly, buffer 2 will occupy the first cache line of next two sets (S2, C0) and (S3, C0).

Figure 5 Hardware cache with 2 buffers 

The set bits in buffer 3 address “00” show that the buffer 3 maps to set 0 again. Since the first  cache line of set0 and set1 is occupied, buffer 3 occupies second cache line of set 0 and 1 (S0, C1)  and (S1, C1). 

Figure 6 Hardware cache with 3 buffers 

Similarly buffer 4 occupies the second cache-line of sets 2 and 3 and so on. Each buffer is  represented with a different color and a total of 8 buffers can occupy the hardware cache without  any evictions.  

Figure 7 Allocation of buffers in hardware cache 

Although the buffer size is 128 bytes, CPU might not access all the bytes. For example, for 64 bytes packets, only the first 64 bytes of the buffer are consumed by the CPU (i.e. one cache line  worth of data).  

Since the buffers are two cache lines long, and are contiguous, and only the first 64 bytes of each  buffer is accessed, only sets 0 and sets 2 are populated with data. Sets 1 and 3 go unused (unused  sets are shown with pattern in Figure 8).

Figure 8 Unused sets in hardware cache 

When buffer 9 needs to be cached, it maps to set 0 since set bits are “00”. Considering a LRU  replacement policy, the least recently used cache line of 4 ways (buffer 1, 3, 5 or 7) in set0 will be  evicted to accommodate buffer 9 even though set 1 and set 3 are empty. 

This is highly inefficient, as we are not utilizing the cache capacity to the full.  

Solution – Allocation of buffers with Cache awareness 

In the above example, if the ununsed cache sets can be utilized to allocate the subsequent buffers (buffers 9 – 16), we would utilize the cache in a more efficient manner. 

To accomplish this, the memory addresses of the buffers can be manipulated during the creation  of mempool. This can be achieved by inserting one cache line padding after every 8 buffers,  effectively aligning the buffer addresses in a way that utilizes the cache more efficiently. Let’s take the above example of contiguous buffer addresses and then compare it with same buffers  but with cache line padding. 

Figure 9 Without cache lines padding Figure 10 With cache lines padding

From figure 9 and 10, we can see that the buffer 9 address has changed from 0x400 to 0x440. With 0x440 address, the buffer 9 maps to set1. So, there is no need to evict any cache line from set0 and  we are utilizing the unused cache set 1. 

Similarly, buffer 10 maps to set3 instead of set2 and so on. This way buffer 9 to buffer 16, can  occupy the sets1 and 3 that are unused by buffers1 to 8. 

Figure 11 Hardware cache with cache awareness 

This approach effectively distributes the allocation of buffers to better utilize the hardware cache. Since for 64-byte packets, only the first cache line of each buffer contains useful data, we are  effectively utilizing the hardware cache capacity by accommodating useful packet data from 16  buffers instead of 8. This doubles the cache utilization, enhancing the overall performance of the  system. 

Padding of cache lines is necessary primarily when the cache size is exactly divisible by the buffer  size (which means buffer size is a power of 2). In cases where the buffer size does not divide  evenly into the cache size, part of the buffer is left unmapped. This residual portion effectively  introduces an offset like the one achieved through padding. 

Cache Awareness in DPDK Mempool 

In DPDK mempool, each buffer typically has a size of 2368 bytes and consists of several distinct  fields – header, object and trailer. Let’s look at each one of them.

Figure 13 Mempool buffer fields 

Header: This portion of the buffer contains metadata and control information needed by DPDK to  manage buffer efficiently. It includes information such as buffer length, buffer state or type and  helps to iterate on mempool objects. The size of the object header is 64 bytes. Object: This section contains actual payload or data. Within the object section, there are additional  fileds such as mbuf, headroom and packet data. The mbuf of 128 bytes contains metadata such as  message type, offset to start of the packet data and pointer to additional mbuf structures. Then  there is a headroom of 128 bytes. The packet data is 2048 bytes that contains packet headers and  payload. 

Trailer: The object trailer is 0 bytes, but a cookie of 8 bytes is added in debug mode. This cookie acts as a marker to prevent corruptions. 

With a buffer size of 2368 bytes (not a power of 2), the buffers are inherently aligned with cache  awareness without the need for cache line padding. In other words, the buffer size is such that it  optimizes cache utilization without the need for additional padding. 

The buffer size of 2368 bytes does not include the padding added to distribute buffers across  memory channels. 

To prove how the performance can vary with a buffer size that is power of 2, I ran an experiment  with 2048 buffer size and compared it against the default buffer size of mempool in DPDK. In the experiment 8192 buffers are allocated in the mempool and a histogram of cache sets for all  the buffers was plotted. The histogram illustrates the number buffers allocated in each cache set. 

Figure 14 Histogram of buffers – 2048 bytes 

With a buffer size of 2048 bytes, the same sets in the hardware cache are hit repeatedly, whereas  other sets are not utilized (we can see that from the gaps in the histogram) 

Figure 15 Histogram of buffers – 2368 bytes

With a buffer size of 2368 bytes, each set is being accessed only around 400 times. There are no  gaps in the above histogram, indicating that the cache is being utilized efficiently. 

DPDK l3fwd Performance 

The improved cache utilization observed in the histogram, attributed to cache awareness, is further  corroborated by the throughput numbers of the l3fwd application. The application is run on a  system with 64KB 4-way set associative cache. 

Below chart shows the throughput in MPPS for single core l3fwd test with 2048 and 2368 buffer  sizes 

Figure 16 l3fwd throughput comparison

There is a 17% performance increase with the 2368 buffer size. 

Conclusion 

Contiguous buffer allocation in memory with cache awareness enhances performance by  minimizing cache evictions and maximizing hardware cache utilization. In scenarios where the  buffer size is exactly divisible by the cache size (e.g., 2048 bytes), padding cache lines creates a offset in the memory addresses and better distribution of buffers in the cache. This led to a 17%  increase in performance for DPDK l3fwd application. 

However, with buffer sizes not precisely divisible by the cache size, as is the default in DPDK,  padding of cache lines already occurs because of the offset in the buffer addresses, resulting in an improved performance. 

For more information visit the programmers guide