It is summer, the semester is over, and with many colleagues and students on holiday, there is finally a little time for experimenting.

After years of impeccable service, I’ve recently updated my laptop, a Yoga 910-13IKB, to a new Yoga Slim 7i Gen 9 (14″ Intel). With this new laptop comes a new CPU: Intel(R) Core(TM) Ultra 7 155H. Considering this is the first time I’ve my hands on an heterogeneous CPU from Intel (working mostly on embedded big.LITTLE is not exactly new), I was curios to see what are the performance of the three types of cores it has.

I’m using Fedora 40 with the latest kernel:

$ uname -srv
Linux 6.9.11-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 25 18:17:34 UTC 2024

and thanks to

$ lstopo

we can have a look at the cores we have

There are in this architecture 3 different types of cores, Performance (P-cores), Efficient (E-cores), and Low Power Efficient-cores (LP-cores). My processor has 6 P-cores (supporting HT), 8 E-Cores, and 2 LP-cores.

Looking online, I haven’t found any great source of information about how they are used by the Linux scheduler (some work is in progress and some will land with 6.11 probably, but I haven’t found a clear source to get the big picture), and I was wondering if I could set-up something as a poor man alternative to IntelĀ® Thread Director. With taskset and stress-ng, I tried to measure them.

Here is my script:

#!/bin/bash

stressng=/home/al/tmp/stress-ng/stress-ng
REST=120

#### TEST SCRIPT

# constant time
DURATION=60s

# P core from lstopo
taskset -c 4 $stressng -c 1 -M -t $DURATION  --rapl
sleep $REST

# P core multithread
taskset -c 4-5 $stressng -c 2 -M -t $DURATION --rapl
sleep $REST

# E core
taskset -c 12 $stressng -c 1 -M -t $DURATION  --rapl
sleep $REST

# LP core
taskset -c 20 $stressng -c 1 -M -t $DURATION  --rapl
sleep $REST

# constant amount of work
WORK=200000

# P core from lstopo
taskset -c 4 $stressng -c 1 -M --cpu-ops $WORK --rapl
sleep $REST

# P core multithread
taskset -c 4-5 $stressng -c 2 -M --cpu-ops $WORK --rapl
sleep $REST

# E core
taskset -c 12 $stressng -c 1 -M --cpu-ops $WORK --rapl
sleep $REST

# LP core
taskset -c 20 $stressng -c 1 -M --cpu-ops $WORK --rapl

to be run with sudo to get the privileges needed to pin processes to specific cups and read the rapl power data (BTW if your distribution stress-ng command does not support –rapl, build it from source). It has two parts. In the first it measures how many instructions per second the core executes at 100% load, than it sets an arbitrary amount of work and measures the time and power to complete it.

After switching in textual mode (init 3) to have a less noisy system and connecting the AC power cord, I run the above script named ultra7power.sh

$ sudo ./ultra7power.sh > ultra7_155h_powerlog

and here is the result:

$ cat ultra7_155h_powerlog 
# P core
[10429] setting to a 1 min run per stressor
[10429] dispatching hogs: 1 cpu
[10429] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[10429] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[10429] cpu		 148759     60.00     55.68	 0.00	   2479.31	  2671.50	 92.81  	6516
[10429] cpu:
[10429]  core			 14.66 W
[10429]  pkg-0  		 16.78 W
[10429]  psys			 22.25 W
[10429]  uncore 		  0.00 W
[10429] skipped: 0
[10429] passed: 1: cpu (1)
[10429] failed: 0
[10429] metrics untrustworthy: 0
[10429] successful run completed in 1 min

# P core multithread
[10441] setting to a 1 min run per stressor
[10441] dispatching hogs: 2 cpu
[10441] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[10441] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[10441] cpu		 305994     60.00    119.57	 0.00	   5099.69	  2559.18	 99.64  	6692
[10441] cpu:
[10441]  core			 21.21 W
[10441]  pkg-0  		 23.35 W
[10441]  psys			 30.13 W
[10441]  uncore 		  0.00 W
[10441] skipped: 0
[10441] passed: 2: cpu (2)
[10441] failed: 0
[10441] metrics untrustworthy: 0
[10441] successful run completed in 1 min

# E core
[10448] setting to a 1 min run per stressor
[10448] dispatching hogs: 1 cpu
[10448] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[10448] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[10448] cpu		 106126     60.00     59.71	 0.00	   1768.76	  1777.43	 99.51  	6516
[10448] cpu:
[10448]  core			  5.66 W
[10448]  pkg-0  		  7.95 W
[10448]  psys			 11.32 W
[10448]  uncore 		  0.00 W
[10448] skipped: 0
[10448] passed: 1: cpu (1)
[10448] failed: 0
[10448] metrics untrustworthy: 0
[10448] successful run completed in 1 min

# LP core
[11112] setting to a 1 min run per stressor
[11112] dispatching hogs: 1 cpu
[11112] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[11112] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[11112] cpu		  69473     60.00     59.66	 0.00	   1157.80	  1164.53	 99.42  	6516
[11112] cpu:
[11112]  core			  5.35 W
[11112]  pkg-0  		 10.42 W
[11112]  psys			 14.27 W
[11112]  uncore 		  0.00 W
[11112] skipped: 0
[11112] passed: 1: cpu (1)
[11112] failed: 0
[11112] metrics untrustworthy: 0
[11112] successful run completed in 1 min

# P core
[11311] defaulting to a 1 day run per stressor
[11311] dispatching hogs: 1 cpu
[11311] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[11311] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[11311] cpu		 200000     81.96     74.76	 0.00	   2440.14	  2675.21	 91.21  	6516
[11311] cpu:
[11311]  core			 15.10 W
[11311]  pkg-0  		 17.22 W
[11311]  psys			 23.09 W
[11311]  uncore 		  0.00 W
[11311] skipped: 0
[11311] passed: 1: cpu (1)
[11311] failed: 0
[11311] metrics untrustworthy: 0
[11311] successful run completed in 1 min, 21.96 secs

# P core  multithread
[11333] defaulting to a 1 day run per stressor
[11333] dispatching hogs: 2 cpu
[11333] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[11333] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[11333] cpu		 200004     39.10     77.91	 0.00	   5115.73	  2567.09	 99.64  	6516
[11333] cpu:
[11333]  core			 21.31 W
[11333]  pkg-0  		 23.44 W
[11333]  psys			 29.77 W
[11333] skipped: 0
[11333] passed: 2: cpu (2)
[11333] failed: 0
[11333] metrics untrustworthy: 0
[11333] successful run completed in 39.84 secs

# E core
[11343] defaulting to a 1 day run per stressor
[11343] dispatching hogs: 1 cpu
[11343] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[11343] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[11343] cpu		 200000    114.52    113.96	 0.00	   1746.48	  1754.93	 99.52  	6516
[11343] cpu:
[11343]  core			  5.60 W
[11343]  pkg-0  		  7.83 W
[11343]  psys			 11.31 W
[11343]  uncore 		  0.00 W
[11343] skipped: 0
[11343] passed: 1: cpu (1)
[11343] failed: 0
[11343] metrics untrustworthy: 0
[11343] successful run completed in 1 min, 54.52 secs

# LP core
[11350] defaulting to a 1 day run per stressor
[11350] dispatching hogs: 1 cpu
[11350] stressor       bogo ops real time  usr time  sys time	bogo ops/s     bogo ops/s CPU used per       RSS Max
[11350] 			  (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)  	(KB)
[11350] cpu		 200000    213.64    171.98	 0.00	    936.17	  1162.94	 80.50  	6692
[11350] cpu:
[11350]  core			  3.66 W
[11350]  pkg-0  		  9.16 W
[11350]  psys			 12.95 W
[11350]  uncore 		  0.00 W
[11350] skipped: 0
[11350] passed: 1: cpu (1)
[11350] failed: 0
[11350] metrics untrustworthy: 0
[11350] successful run completed in 3 mins, 33.64 secs

First, a note: there are may factors impacting these measures. I’ve removed some of them (wifi, etc.), but I’ve in no way controlled them all. Another important point is that modern processors have a large number of power/performance nobs that can impact heavily the obtained performance. Some are software-accessible (I made my best to have them constant during measurements), but some are related to the physical design of your machine. All of this to say that your mileage can vary quite considerably.

From the computational point of view, numbers confirm P-cores are the best performing ones, followed by E-cores and finally LP-cores. No surprises here. What is harder to explain is power consumption. It is true that P-cores requires more power to complete the job, but they are so fast that they end up consuming the least amount of energy. What is less evident is that LP-cores are the worst performing ones in term of energy. While core consumption seems to suggest they require less power, they are slow and strangely pkg-0 and psys measurements rise when I use them. I cannot really explain this behavior, and I don’t find any useful information online. If you have an idea, feel free to share it, I would really like to understand this point1.

Now that I have numbers, I can start experimenting by pinning the great power offenders on my box (Firefox, Skype, Teams, etc.) to some less energy hungry processors and see if my battery will last longer without much impact on perceived performance. We will see when I’ll find some time for that.

That’s all folks.

Have a nice summer,

A.

  1. May be I have a clue… looking at the lstopo output attentively, it is clear the LP-cores have no L3 cache. I’ve to check the workload proposed here, but if it cannot fit in the L2 cache, the extra load-store pressure on the memory controller may explain my preliminary results. I’ll investigate.