Introduction

In a world where data centers consume nearly 1% of global electricity, optimizing the energy consumption of computer systems has become a critical priority. In 2022, the electricity consumption of data centers reached 460 TWh and could exceed 620 TWh by 2026 (source).

A low hanging fruit is software improvement: can we reduce the carbon footprint of our code?

This article will guide you through various methods of measuring energy consumption of your code, enabling you to better understand and optimize your programs.

Objectives

  • Introduce energy consumption measurement: Understand the impact of your programs on overall energy consumption.
  • Present tools and methods to analyze consumption: Learn to use measurement tools to identify areas for improvement.
  • Compare energy consumption between optimized and non-optimized program versions: Visualize the tangible energy savings through optimizations.

Building the Test Environment

Let’s start setting up a basic example that will help us exploring the different tools at your disposal. We will create a simple program that will allow us to easily observe improvements. Therefore, we will create a simple array summation program and implement its SIMD version to make it more efficient, this example is heavily inspired by the blog post of Prof. Lemire.

Attention: In all our test we disabled compiler optimisations, the courios reader could reproduce and study effects of enabling optimizations.

Here is the non-vectorized sum program that takes an array and its size as parameters, sums all the elements, and returns the total.

__attribute__((noinline)) __attribute__((optimize("no-tree-vectorize"))) float sum(float *data, size_t N) {
    float counter = 0;
​
    for (size_t i = 0; i < N; i++) {
        counter += data[i];
    }
​
    return counter;
}

And its SIMD version:

__attribute__((noinline)) float sumvec(float *data, size_t N){
    float sum = 0.f;
    __m128 counter = _mm_setzero_ps();
    __m128 vec = _mm_setzero_ps();
​
    size_t i = 0;
    int count = 0;
    for (; i < N; i += 4) {
        vec = _mm_loadu_ps(&data[i]);
        counter = _mm_add_ps(counter, vec);
    }
​
    counter = _mm_hadd_ps(counter, _mm_setzero_ps());
    counter = _mm_hadd_ps(counter, _mm_setzero_ps());
​
    sum = _mm_cvtss_f32(counter);
​
    i = N / 4 * 4;
​
    for (; i < N; ++i) {
        sum += data[i];
    }
    return sum;
}

The test arrays will always be initialized, quite arbitrarily, to 16MB.

Measuring Terminology

When performing consumption analyses, several terms often recur, and it is important to understand them before getting started.

  • RAPL: Running Average Power Limit, is a technology developed by Intel and integrated into some of their processors, allowing real-time monitoring and regulation of CPU energy consumption.
  • Package Zone: Represents the entire CPU package, i.e., all computing cores of a physical processor.
  • Core Zone: Represents individual computing units inside the CPU.
  • Uncore Zone: Represents the components of the processor that are not computing cores, such as the cache and memory controller.
  • GPU Zone: Sometimes certain tools allow us to measure the consumption of the graphics processor.
  • DRAM Zone: Represents the memory of the system, i.e., the RAM.
  • System (pSys) or Platform Zone: Represents the total energy consumption of the system, including all the components mentioned above.

Setting Up the Analyses

To ensure a stable test environment, we have minimized frequency variations of the CPU. We used the cpupower tool, specifically setting the governor mode to performance. This ensures the CPU frequency remains at its maximum, avoiding any unexpected power-saving measures.

The command to achieve this:

sudo cpupower frequency-set --governor performance

Additionally, to simplify all analyses, all commands in this document will be preceded by taskset 2 ./{my_app}, ensuring the use of a single CPU, core 2, to prevent workload distribution among multiple workers. We don’t use CPU 0 and 1, they are bad choices. The OS has several task running continuously on those.

Initial Analysis and Understanding

We will use perf for this initial analysis. Perf is a performance profiling and tracing tool integrated into the Linux kernel, allowing users to collect and analyze performance data to identify bottlenecks and optimize system operation. It also allows us to analyze the energy consumption of our programs easily.

For our two test functions, we have the following measurements:

Cores [J]GPU [J]Package [J]DRAM [J]PSYS [J]
Non vectorized1.510.011.680.112.87
Vectorized1.210.011.330.072.14

Command used: sudo perf stat -e power/energy-cores/ -e power/energy-gpu/ -e power/energy-pkg/ -e power/energy-ram/ -e power/energy-psys/ {app}

Perf revealed that the vectorized version of the code consumed less overall energy (2.14 J) compared to the non-vectorized version (2.87 J). The differences between specific measurements and overall consumption were 22% for the vectorized code and 15% for the non-vectorized code. If we change the compiling option of the program to -O3 we obtain an overall consumption equal for the vectorized version with always the same difference with the specific measurements. For the non-vectorized version, removing also the option no-tree-vectorize we obtain also the same results.

Another tool, likwid, and its extension Likwid-powermeter, allows us to make similar measurements. This tool can measure the package, DRAM, and pSys zones, and adds two other zones: PP0 and PP1 (Power plane 0/1). PP0 corresponds to our previously seen cores zone, measuring only CPU core consumption, while PP1 corresponds to the uncore zone.

PP0 [J]PP1 [J]Package [J]DRAM [J]PSYS [J]
Non vectorized1.520.011.670.082.70
Vectorized1.190.011.320.072.16

Command used: sudo likwid-powermeter {app}

The measurements from likwid are very similar to those obtained with perf. The overall consumption for the vectorized code was 2.16 J, compared to 2.70 J for the non-vectorized code. The differences between the sum of specific measurements and overall consumption were 19% for the vectorized code and 21% for the non-vectorized code.

Our simple analysis demonstrates that the vectorized program saves between 25% and 34% of energy, depending on the measurement tool.

Targeted Analysis with Likwid

The methods discussed earlier are not targeted to the specific functions of interest and measure the entire program, including the creation and population of the array. However, we can be more precise and analyze only the summation algorithm.

First, let’s examine this with Likwid and its extension Likwid-perfctr. With this extension, we can place markers inside our code to target specific regions and, using precise performance groups (which can be created manually), measure the events of interest.

To use Likwid-perfctr, you need a few prerequisites:

  • Link the Likwid library to your project: liblikwid
  • Add the -DLIKWID-PERFMON option to the compilation flags
  • Include likwid-marker.h in your file

Once these prerequisites are met, you can use the markers. At the beginning of your program, or at least before the targeted code region, initialize the Likwid markers with LIKWID_MARKER_INIT; and be sure to stop them properly with LIKWID_MARKER_CLOSE; when no longer needed. Then, you can target your code with:

LIKWID_MARKER_START("my_region");
// Code to measure
LIKWID_MARKER_STOP("my_region");

Likwid provides default performance groups based on your architecture, but it is not difficult to create a custom group that measures only what we want. Here, we are interested in energy consumption measurements. We can list all energy-related counters with the command: likwid-perfctr -e | grep PWR (sometimes it is necessary to run the commands with sudo).

Typically, you should see counters like this, unless your architecture is not supported by Likwid:

$ likwid-perfctr -e | grep PWR

PWR0, Energy/Power counters (RAPL)
PWR1, Energy/Power counters (RAPL)
PWR2, Energy/Power counters (RAPL)
PWR3, Energy/Power counters (RAPL)
PWR4, Energy/Power counters (RAPL)
PWR_PKG_ENERGY, 0x2, 0x0, PWR0
PWR_PP0_ENERGY, 0x1, 0x0, PWR1
PWR_PP1_ENERGY, 0x4, 0x0, PWR2
PWR_DRAM_ENERGY, 0x3, 0x0, PWR3
PWR_PLATFORM_ENERGY, 0x5, 0x0, PWR4

We can see the five counters mentioned previously: Package, PP0, PP1, DRAM, and Platform (pSys).

To simplify the analysis, here’s a performance group file, which specifies the counters and calculations for analyzing only these five counters:

SHORT Power and Energy consumption

EVENTSET
PWR0 PWR_PKG_ENERGY
PWR1 PWR_PP0_ENERGY
PWR2 PWR_PP1_ENERGY
PWR3 PWR_DRAM_ENERGY
PWR4 PWR_PLATFORM_ENERGY

METRICS
Runtime (RDTSC) [s] time
Energy PKG [J] PWR0
Power PKG [W] PWR0/time
Energy PP0 [J] PWR1
Power PP0 [W] PWR1/time
Energy PP1 [J] PWR2
Power PP1 [W] PWR2/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
Energy Platform [J] PWR4
Power Platform [W] PWR4/time

Save this file as a text file and place it in {whereis likwid}/perfgroups/my_arch/. You can name it anything you like. Once the file is in the correct location and your code is ready, you can analyze the values with the same code as before.

An example of a code

Here are the results:

PP0 [J]PP1 [J]Package [J]DRAM [J]PSYS [J]
Non vectorized0.6900.750.041.26
Vectorized0.2900.320.020.54

Command used: sudo likwid-perfctr -C 2 -g CUSTOM_GROUP -m {app}

From this analysis, we can see that we are now ignoring everything in our program except the execution of our function. This results in a significant reduction in energy consumption for both measurements.

Targeted Analysis with Perf

Perf is typically used from the command line, but with some adjustments, it can be integrated into a program to create markers similar to those in Likwid. Our goal is to launch our program without Perf analyzing the code initially and then instruct Perf to start the measurement at a specified point.

To launch Perf in a disabled state, we can use the --delay=-1 option, which tells Perf to start disabled. Next, we need to establish a communication channel with Perf to enable it later. Perf provides the --control option for this purpose, allowing us to refer to a file descriptor for communication. For example, you can use --control fd:$perf_ctl.ctl,$perf_ctl.ack to set up this communication.

This setup enables precise control over when Perf starts collecting performance data, ensuring that measurements are taken only at the desired points in your program. Let’s setup our programm!

First, let’s create a data structure and constants to interact with Perf.

typdef struct {
    int ctl_fd;
    int ack_fd;
    bool enable;
} PerfManager;

const char *enable_cmd = "enable";
const char *disable_cmd = "disable";
const char *ack_cmd = "ack\n";

The structure contains two file descriptors for communication with Perf and a control boolean. These file descriptors will point to FIFOs.

The constants define the commands to send to Perf.

Next, we create a function to send commands to Perf for activation/deactivation.

void send_command(PerfManager *pm, const char *command) {
    if (pm->enable) {
        write(pm->ctl_fd, command, strlen(command));
        char ack[5];
        read(pm->ack_fd, ack, 5);
        assert(strcmp(ack, ack_cmd) == 0);
    }
}

Then, we prepare our structure with the necessary information. We’ll return to the environment variables later.

void PerfManager_init(PerfManager *pm) {
    char *ctl_fd_env = getenv("PERF_CTL_FD");
    char *ack_fd_env = getenv("PERF_ACK_FD");
    if (ctl_fd_env && ack_fd_env) {
        pm->enable = true;
        pm->ctl_fd = atoi(ctl_fd_env);
        pm->ack_fd = atoi(ack_fd_env);
    } else {
        pm->enable = false;
        pm->ctl_fd = -1;
        pm->ack_fd = -1;
    }
}

Finally, we prepare the functions to pause and resume Perf:

void PerfManager_pause(PerfManager *pm) {
    send_command(pm, disable_cmd);
}

void PerfManager_resume(PerfManager *pm) {
    send_command(pm, enable_cmd);
}

Now, simply initialize our structure in the main function, resume Perf before the code section we want to analyze, and pause it afterward.

int main(int argc, char** argv) {
    PerfManager pmon;
    PerfManager_init(&pmon);
    
    PerfManager_resume(&pmon);
	// Code to measure
    PerfManager_pause(&pmon);
	
    return 0;
}

To run our application, we need to prepare a few elements for proper communication and control of Perf. We recommend putting everything into a script to avoid many manual commands.

  1. First, create two FIFOs, one for sending control commands and one for receiving responses from Perf.
  2. Associate the two FIFOs with file descriptors.
  3. To facilitate access to these file descriptors from the C code, create environment variables.
  4. Finally, launch Perf, specifying what we want to analyze, that we want it to start disabled, and which controller (file descriptor) it should communicate with.
#!/bin/bash

FIFO_PREFIX="perf_fd"

rm -rf ${FIFO_PREFIX}.*

mkfifo ${FIFO_PREFIX}.ctl
mkfifo ${FIFO_PREFIX}.ack

exec {perf_ctl_fd}<>${FIFO_PREFIX}.ctl
exec {perf_ack_fd}<>${FIFO_PREFIX}.ack

export PERF_CTL_FD=${perf_ctl_fd}
export PERF_ACK_FD=${perf_ack_fd}

sudo perf stat \
    -e power/energy-psys/ \
    -C 2 \
    --delay=-1 \
    --control fd:${perf_ctl_fd},${perf_ack_fd} \
    -- taskset 2 ./{my_app}

Finally, we can observe the measurements as before:

Cores [J]GPU [J]Package [J]DRAM [J]PSYS [J]
Non vectorized0.880.010.950.051.48
Vectorized0.3300.350.020.62

As previously mentioned, we have observed a significant reduction in consumption, although we have only focused on the summation code and not the rest. We see a slight difference in the analysis; likwid has measured slightly less consumption, but the difference is identical (13-15%) in both the vectorized and non-vectorized versions.

Targeted Analysis with Powercap lib

Powercap is a library that utilizes the Linux kernel’s powercap subsystem to retrieve energy consumption data for various hardware components.

To understand its usage, I will detail the steps required:

  1. Retrieve the number of available packages on your architecture with uint32_t powercap_rapl_get_num_instances().
  2. Create an instance for each package of type powercap_rapl_pkg.
  3. Initialize all packages with int powercap_rapl_init(uint32_t id, powercap_rapl_pkg* pkg, int read_only).
  4. Define a zone to analyze of type powercap_rapl_zone.
  5. Verify the compatibility of the desired zone with the packages on the architecture using int powercap_rapl_is_zone_supported(const powercap_rapl_pkg* pkg, powercap_rapl_zone zone).
  6. Take an energy consumption measurement before and after the code to be measured using int powercap_rapl_get_energy_uj(const powercap_rapl_pkg* pkg, powercap_rapl_zone zone, uint64_t* val).
  7. Properly destroy the packages using int powercap_rapl_destroy(powercap_rapl_pkg* pkg).

Here’s an example code:

int main(int argc, char** argv){
    uint32_t npackages = powercap_rapl_get_num_instances();
    
    powercap_rapl_pkg pkg[npackages];
    
    for (size_t i = 0; i < npackages; i++) {
        printf("Initializing RAPL package %ld\n", i);
        if (powercap_rapl_init(i, &pkg[i], true)) {
            printf("Error initializing RAPL, you may need privileged access\n");
            powercap_rapl_destroy(&pkg[i]);
            return EXIT_FAILURE;
        }
    }
    
    powercap_rapl_zone zone = POWERCAP_RAPL_ZONE_PSYS;
    
    bool supported[npackages];
    bool has_one_supported = false;
    for (size_t i = 0; i < npackages; i++) {
        supported[i] = powercap_rapl_is_zone_supported(&pkg[i], zone);

        if (!supported[i]) {
            printf("RAPL is not supported on package %ld\n", i);
        } else {
            has_one_supported = true;
        }
    }

    if (!has_one_supported) {
        printf("No supported package for %s\n", to_string(zone));
        continue;
    }
    
    uint64_t energy_uj1[npackages];
    uint64_t energy_uj2[npackages];
    
    for(size_t j = 0; j < npackages; ++j){
        if (supported[j]){
            if (powercap_rapl_get_energy_uj(&pkg[j], zone, &energy_uj1[j])) {
                printf("Failed to get energy on package %ld\n", j);
                break;
            }
        }
    }
    
    // CODE TO MEASURE
    
    for (size_t j = 0; j < npackages; j++) {
        if (supported[j]){
            if (powercap_rapl_get_energy_uj(&pkg[j], zone, &energy_uj2[j])) {
                printf("Failed to get energy on package %ld\n", j);
                break;
            }
        }
    }
    
    for (size_t i = 0; i < npackages; i++) {
        powercap_rapl_destroy(&pkg[i]);
    }

    return EXIT_SUCCESS;  
}

The available zones with Powercap are:

  • POWERCAP_RAPL_ZONE_PACKAGE for the Package zone.
  • POWERCAP_RAPL_ZONE_CORE for the Core zone.
  • POWERCAP_RAPL_ZONE_UNCORE for the Uncore zone.
  • POWERCAP_RAPL_ZONE_DRAM for the DRAM zone.
  • POWERCAP_RAPL_ZONE_PSYS for the PSys zone.
Core [J]Uncore [J]Package [J]DRAM [J]PSYS [J]
Non vectorized0.760.010.890.04None
Vectorized0.380.00.390.02None

As noted previously, there are reductions in consumption, and the values are very similar to those observed with Likwid and perf. But powercap is not able to give a descent value for global consumption.

Conclusion

The exploration and analysis detailed in this article underscore the critical importance of energy consumption optimization in computing systems, particularly in data centers. Through rigorous measurement and comparison of vectorized and non-vectorized array summation programs, several key insights were obtained:

  • Significant Energy Savings: The vectorized version of the program consistently demonstrated substantial energy savings. Using tools like perf and likwid, it was evident that the SIMD-optimized code reduced energy consumption by approximately 40-48%, depending on the measurement tool used.
  • Measurement Precision: The targeted analysis revealed how precise measurements, focusing solely on the summation algorithm, can provide more accurate insights into the energy efficiency of specific code sections. By using tools such as Likwid-perfctr and powercap, we could isolate and evaluate the exact energy impact of our optimizations, excluding the noise from other parts of the program.
  • Tool Efficacy and Consistency: The consistency across different tools (perf, likwid, and powercap) in reporting energy savings for the vectorized code reinforces the reliability of these measurement methods. This consistency is crucial for developers aiming to make informed decisions about code optimizations and their impact on energy efficiency.

In conclusion, the measurement methods and tools discussed not only provide a comprehensive understanding of energy consumption but also empower developers to make effective optimizations. By embracing these practices, the tech industry can move towards more energy-efficient and sustainable operations, addressing one of the critical challenges of our time.

In terms of usage, the data shows that Likwid and Powercap are more efficient for targeted measurements, while Perf seems to struggle in this area. Additionally, Likwid is very easy to use, although it is not implemented for all cpu architectures.

Sources