It is summer, the semester is over, and with many colleagues and students on holiday, there is finally a little time for experimenting.
After years of impeccable service, I’ve recently updated my laptop, a Yoga 910-13IKB, to a new Yoga Slim 7i Gen 9 (14″ Intel). With this new laptop comes a new CPU: Intel(R) Core(TM) Ultra 7 155H. Considering this is the first time I’ve my hands on an heterogeneous CPU from Intel (working mostly on embedded big.LITTLE is not exactly new), I was curios to see what are the performance of the three types of cores it has.
I’m using Fedora 40 with the latest kernel:
$ uname -srv
Linux 6.9.11-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 25 18:17:34 UTC 2024
and thanks to
$ lstopo
we can have a look at the cores we have
There are in this architecture 3 different types of cores, Performance (P-cores), Efficient (E-cores), and Low Power Efficient-cores (LP-cores). My processor has 6 P-cores (supporting HT), 8 E-Cores, and 2 LP-cores.
Looking online, I haven’t found any great source of information about how they are used by the Linux scheduler (some work is in progress and some will land with 6.11 probably, but I haven’t found a clear source to get the big picture), and I was wondering if I could set-up something as a poor man alternative to IntelĀ® Thread Director. With taskset
and stress-ng,
I tried to measure them.
Here is my script:
#!/bin/bash
stressng=/home/al/tmp/stress-ng/stress-ng
REST=120
#### TEST SCRIPT
# constant time
DURATION=60s
# P core from lstopo
taskset -c 4 $stressng -c 1 -M -t $DURATION --rapl
sleep $REST
# P core multithread
taskset -c 4-5 $stressng -c 2 -M -t $DURATION --rapl
sleep $REST
# E core
taskset -c 12 $stressng -c 1 -M -t $DURATION --rapl
sleep $REST
# LP core
taskset -c 20 $stressng -c 1 -M -t $DURATION --rapl
sleep $REST
# constant amount of work
WORK=200000
# P core from lstopo
taskset -c 4 $stressng -c 1 -M --cpu-ops $WORK --rapl
sleep $REST
# P core multithread
taskset -c 4-5 $stressng -c 2 -M --cpu-ops $WORK --rapl
sleep $REST
# E core
taskset -c 12 $stressng -c 1 -M --cpu-ops $WORK --rapl
sleep $REST
# LP core
taskset -c 20 $stressng -c 1 -M --cpu-ops $WORK --rapl
to be run with sudo
to get the privileges needed to pin processes to specific cups and read the rapl power data (BTW if your distribution stress-ng command does not support –rapl, build it from source). It has two parts. In the first it measures how many instructions per second the core executes at 100% load, than it sets an arbitrary amount of work and measures the time and power to complete it.
After switching in textual mode (init 3)
to have a less noisy system and connecting the AC power cord, I run the above script named ultra7power.sh
$ sudo ./ultra7power.sh > ultra7_155h_powerlog
and here is the result:
$ cat ultra7_155h_powerlog
# P core
[10429] setting to a 1 min run per stressor
[10429] dispatching hogs: 1 cpu
[10429] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[10429] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[10429] cpu 148759 60.00 55.68 0.00 2479.31 2671.50 92.81 6516
[10429] cpu:
[10429] core 14.66 W
[10429] pkg-0 16.78 W
[10429] psys 22.25 W
[10429] uncore 0.00 W
[10429] skipped: 0
[10429] passed: 1: cpu (1)
[10429] failed: 0
[10429] metrics untrustworthy: 0
[10429] successful run completed in 1 min
# P core multithread
[10441] setting to a 1 min run per stressor
[10441] dispatching hogs: 2 cpu
[10441] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[10441] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[10441] cpu 305994 60.00 119.57 0.00 5099.69 2559.18 99.64 6692
[10441] cpu:
[10441] core 21.21 W
[10441] pkg-0 23.35 W
[10441] psys 30.13 W
[10441] uncore 0.00 W
[10441] skipped: 0
[10441] passed: 2: cpu (2)
[10441] failed: 0
[10441] metrics untrustworthy: 0
[10441] successful run completed in 1 min
# E core
[10448] setting to a 1 min run per stressor
[10448] dispatching hogs: 1 cpu
[10448] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[10448] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[10448] cpu 106126 60.00 59.71 0.00 1768.76 1777.43 99.51 6516
[10448] cpu:
[10448] core 5.66 W
[10448] pkg-0 7.95 W
[10448] psys 11.32 W
[10448] uncore 0.00 W
[10448] skipped: 0
[10448] passed: 1: cpu (1)
[10448] failed: 0
[10448] metrics untrustworthy: 0
[10448] successful run completed in 1 min
# LP core
[11112] setting to a 1 min run per stressor
[11112] dispatching hogs: 1 cpu
[11112] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[11112] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[11112] cpu 69473 60.00 59.66 0.00 1157.80 1164.53 99.42 6516
[11112] cpu:
[11112] core 5.35 W
[11112] pkg-0 10.42 W
[11112] psys 14.27 W
[11112] uncore 0.00 W
[11112] skipped: 0
[11112] passed: 1: cpu (1)
[11112] failed: 0
[11112] metrics untrustworthy: 0
[11112] successful run completed in 1 min
# P core
[11311] defaulting to a 1 day run per stressor
[11311] dispatching hogs: 1 cpu
[11311] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[11311] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[11311] cpu 200000 81.96 74.76 0.00 2440.14 2675.21 91.21 6516
[11311] cpu:
[11311] core 15.10 W
[11311] pkg-0 17.22 W
[11311] psys 23.09 W
[11311] uncore 0.00 W
[11311] skipped: 0
[11311] passed: 1: cpu (1)
[11311] failed: 0
[11311] metrics untrustworthy: 0
[11311] successful run completed in 1 min, 21.96 secs
# P core multithread
[11333] defaulting to a 1 day run per stressor
[11333] dispatching hogs: 2 cpu
[11333] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[11333] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[11333] cpu 200004 39.10 77.91 0.00 5115.73 2567.09 99.64 6516
[11333] cpu:
[11333] core 21.31 W
[11333] pkg-0 23.44 W
[11333] psys 29.77 W
[11333] skipped: 0
[11333] passed: 2: cpu (2)
[11333] failed: 0
[11333] metrics untrustworthy: 0
[11333] successful run completed in 39.84 secs
# E core
[11343] defaulting to a 1 day run per stressor
[11343] dispatching hogs: 1 cpu
[11343] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[11343] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[11343] cpu 200000 114.52 113.96 0.00 1746.48 1754.93 99.52 6516
[11343] cpu:
[11343] core 5.60 W
[11343] pkg-0 7.83 W
[11343] psys 11.31 W
[11343] uncore 0.00 W
[11343] skipped: 0
[11343] passed: 1: cpu (1)
[11343] failed: 0
[11343] metrics untrustworthy: 0
[11343] successful run completed in 1 min, 54.52 secs
# LP core
[11350] defaulting to a 1 day run per stressor
[11350] dispatching hogs: 1 cpu
[11350] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
[11350] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
[11350] cpu 200000 213.64 171.98 0.00 936.17 1162.94 80.50 6692
[11350] cpu:
[11350] core 3.66 W
[11350] pkg-0 9.16 W
[11350] psys 12.95 W
[11350] uncore 0.00 W
[11350] skipped: 0
[11350] passed: 1: cpu (1)
[11350] failed: 0
[11350] metrics untrustworthy: 0
[11350] successful run completed in 3 mins, 33.64 secs
First, a note: there are may factors impacting these measures. I’ve removed some of them (wifi, etc.), but I’ve in no way controlled them all. Another important point is that modern processors have a large number of power/performance nobs that can impact heavily the obtained performance. Some are software-accessible (I made my best to have them constant during measurements), but some are related to the physical design of your machine. All of this to say that your mileage can vary quite considerably.
From the computational point of view, numbers confirm P-cores are the best performing ones, followed by E-cores and finally LP-cores. No surprises here. What is harder to explain is power consumption. It is true that P-cores requires more power to complete the job, but they are so fast that they end up consuming the least amount of energy. What is less evident is that LP-cores are the worst performing ones in term of energy. While core consumption seems to suggest they require less power, they are slow and strangely pkg-0 and psys measurements rise when I use them. I cannot really explain this behavior, and I don’t find any useful information online. If you have an idea, feel free to share it, I would really like to understand this point1.
Now that I have numbers, I can start experimenting by pinning the great power offenders on my box (Firefox, Skype, Teams, etc.) to some less energy hungry processors and see if my battery will last longer without much impact on perceived performance. We will see when I’ll find some time for that.
That’s all folks.
Have a nice summer,
A.
- May be I have a clue… looking at the lstopo output attentively, it is clear the LP-cores have no L3 cache. I’ve to check the workload proposed here, but if it cannot fit in the L2 cache, the extra load-store pressure on the memory controller may explain my preliminary results. I’ll investigate.