Prefetcher小练习

这是我的嵌入式笔记第六篇,原文写于2015年。

作業要求 (B)

  • 閱讀 Week #8 效能分析: Prefetching 提到的論文: “When Prefetching Works, When It Doesn’t, and Why”,在 Linux/x86_64 (注意,要用 64-bit 系統,不能透過虛擬機器執行) 上編譯並執行 prefetcher(source)
    • 說明 naive_transpose, sse_transpose, sse_prefetch_transpose 之間的效能差異,以及 prefetcher 對 cache 的影響
  • 在 github 上 fork prefetcher,嘗試用 AVX 進一步提昇效能
    • 修改 Makefile,產生新的執行檔,分別對應於 naive_transpose, sse_transpose, sse_prefetch_transpose (學習 Homework #2 的做法)
    • 用 perf 分析 cache miss/hit
    • 參考 Performance of SSE and AVX Instruction Sets,用 SSE/AVX intrinsic 來改寫程式碼
    • 詳細描述實驗設計,以及你的觀察
  • 建立新的 Hackpad,列於「+作業區」,需要標注「開發紀錄 (B)」

Learn prefetcher

  • 先分析一下原始碼,首先是naive_transpose,其實現最簡單的矩陣轉置想法,從矩陣的左上角第一個元素開始,把舊矩陣中的元素按轉置後的順序存入新的矩陣中
1
2
3
4
5
6
7
8
void naive_transpose(int *src, int *dst, int w, int h)
{
for(int x = 0; x < w; x++){
for(int y = 0; y < h; y++){
*(dst + x*h + y) = *(src + y*w + x);
}
}
}
  • 接著sse_prefetch_transpose,使用了Intel處理器SIMD的技術,在+Week#1有做簡單的整理,簡單說就是一次將4筆資料放入sse暫存器中,執行一條指令就可以完成4筆資料處理。
    具體實現的過程參考了文章Programming trivia: 4x4 integer matrix transpose in SSE2
    • SSE指令的格式:
    • _mm_unpacklo_epi32(I0, I1)讀入兩個128位暫存器後會使用他們的2個低32位值,返回[a0, b0, a1, b1],_mm_unpackhi_epi32同理,而_mm_unpacklo_epi64則是一次取64位
  • 這種方法的效能改進在:
    • 一條指令處理4筆數據,要比4筆數據4條指令處理快
    • loop unrolling:
      • 執行loop循環的組合語言代碼執行次數會變少
      • branch prediction miss機率降低
      • Wikipedia還提到如果數據沒有相依性有機會使用並行處理,在這裡SIMD已經實現
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void sse_transpose(int *src, int *dst, int w, int h)
{
for(int x = 0; x < w; x+=4){
for(int y = 0; y < h; y+=4){
__m128i I0 = _mm_loadu_si128 ((__m128i*)(src+y*w+x));
__m128i I1 = _mm_loadu_si128 ((__m128i*)(src+(y+1)*w+x));
__m128i I2 = _mm_loadu_si128 ((__m128i*)(src+(y+2)*w+x));
__m128i I3 = _mm_loadu_si128 ((__m128i*)(src+(y+3)*w+x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i*)(dst+(x*h)+y), I0);
_mm_storeu_si128((__m128i*)(dst+((x+1)*h)+y), I1);
_mm_storeu_si128((__m128i*)(dst+((x+2)*h)+y), I2);
_mm_storeu_si128((__m128i*)(dst+((x+3)*h)+y), I3);
}
}
}
  • 最後是sse_prefetch_transpose,相比sse_transpose多使用了4次_mm_prefetch指令
    • void _mm_prefetch(char * p , int i ) 會將地址p的數據加載到cache的一條cache line,int i有_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2和_MM_HINT_NTA共4種,表示了不同的prefetch方式:
      • T0 - T2對應了L1 - L3 caches,NTA表示加載數據在L1 cache并標記為首先被替換的
      • 實際將T1替換為T0和T2運行程式,運行的時間區別不大,可能是測量的方式不精准
  • 為什麼PFDIST要設為8?實際運行結果是PFDIST=4比PFDIST=8平均慢10000us左右
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
for(int x = 0; x < w; x+=4){
for(int y = 0; y < h; y+=4){
#define PFDIST 8

_mm_prefetch(src+(y+PFDIST)*w+x, _MM_HINT_T1);
_mm_prefetch(src+(y+PFDIST+1)*w+x, _MM_HINT_T1);
_mm_prefetch(src+(y+PFDIST+2)*w+x, _MM_HINT_T1);
_mm_prefetch(src+(y+PFDIST+3)*w+x, _MM_HINT_T1);

__m128i I0 = _mm_loadu_si128 ((__m128i*)(src+y*w+x));
__m128i I1 = _mm_loadu_si128 ((__m128i*)(src+(y+1)*w+x));
__m128i I2 = _mm_loadu_si128 ((__m128i*)(src+(y+2)*w+x));
__m128i I3 = _mm_loadu_si128 ((__m128i*)(src+(y+3)*w+x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i*)(dst+(x*h)+y), I0);
_mm_storeu_si128((__m128i*)(dst+((x+1)*h)+y), I1);
_mm_storeu_si128((__m128i*)(dst+((x+2)*h)+y), I2);
_mm_storeu_si128((__m128i*)(dst+((x+3)*h)+y), I3);
}
}
}

Reproduce Prefetcher by using AVX

修改Makefile執行檔

  • 在這裡使用了gcc -D來定義了兩個宏,一個”"$@.h"“用在#include時可以找到對應的頭文件,另一個”"$@"“是在printf時輸出對應的版本名稱
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CFLAGS = -msse2 --std gnu99 -O0 -Wall
EXEC = naive_transpose sse_transpose sse_prefetch_transpose
all: $(EXEC) format
SRCS_common = main.c
naive_transpose:
$(CC) $(CFLAGS) -DIMPL="\"$@.h\"" -DSTR="\"$@\"" -o $@ $(SRCS_common) $@.c
sse_transpose:
$(CC) $(CFLAGS) -DIMPL="\"$@.h\"" -DSTR="\"$@\"" -o $@ $(SRCS_common) $@.c
sse_prefetch_transpose:
$(CC) $(CFLAGS) -DIMPL="\"$@.h\"" -DSTR="\"$@\"" -o $@ $(SRCS_common) $@.c
clean:
$(RM) $(EXEC) perf.*
format:
astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]
run: $(EXEC)
./naive_transpose
./sse_transpose
./sse_prefetch_transpose
make clean

使用Perf分析cache miss/hit

  • 在運行Perf之前將main.c中的test part和打印全部刪掉,排除額外的程式碼以增加cache測量的準確率
  • echo "echo 1 > /proc/sys/vm/drop_caches" | sudo sh
    perf stat -r 100 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_orig
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Performance counter stats for './naive_transpose' (100 runs):
16857556 cache-misses # 93.608 % of all cache refs ( +- 0.06% )
18008720 cache-references ( +- 0.01% )
21069717 L1-dcache-load-misses ( +- 0.00% )
4255304 L1-dcache-store-misses ( +- 0.00% )
0 L1-dcache-prefetch-misses
24305 L1-icache-load-misses ( +- 0.46% )
0.352085215 seconds time elapsed ( +- 0.04% )

Performance counter stats for './sse_transpose' (100 runs):
4334609 cache-misses # 79.424 % of all cache refs ( +- 0.03% )
5457532 cache-references ( +- 0.02% )
8516824 L1-dcache-load-misses ( +- 0.01% )
4292592 L1-dcache-store-misses ( +- 0.03% )
0 L1-dcache-prefetch-misses
27101 L1-icache-load-misses ( +- 0.37% )
0.242525082 seconds time elapsed ( +- 0.04% )

Performance counter stats for './sse_prefetch_transpose' (100 runs):
4346864 cache-misses # 79.615 % of all cache refs ( +- 0.03% )
5459859 cache-references ( +- 0.02% )
8533348 L1-dcache-load-misses ( +- 0.02% )
4308308 L1-dcache-store-misses ( +- 0.05% )
0 L1-dcache-prefetch-misses
24590 L1-icache-load-misses ( +- 0.29% )
0.184224127 seconds time elapsed ( +- 0.08% )

參考 Performance of SSE and AVX Instruction Sets,用 SSE/AVX intrinsic 來改寫程式碼

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
for ( int x = 0; x < w; x += 8 ) {
for ( int y = 0; y < h; y += 8 ) {
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
__m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
__m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
__m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));

__m256i T0 = _mm256_unpacklo_epi32(I0, I1);
__m256i T1 = _mm256_unpackhi_epi32(I0, I1);
__m256i T2 = _mm256_unpacklo_epi32(I2, I3);
__m256i T3 = _mm256_unpackhi_epi32(I2, I3);
__m256i T4 = _mm256_unpacklo_epi32(I4, I5);
__m256i T5 = _mm256_unpackhi_epi32(I4, I5);
__m256i T6 = _mm256_unpacklo_epi32(I6, I7);
__m256i T7 = _mm256_unpackhi_epi32(I6, I7);

I0 = _mm256_unpacklo_epi64(T0, T2);
I1 = _mm256_unpackhi_epi64(T0, T2);
I2 = _mm256_unpacklo_epi64(T1, T3);
I3 = _mm256_unpackhi_epi64(T1, T3);
I4 = _mm256_unpacklo_epi64(T4, T6);
I5 = _mm256_unpackhi_epi64(T4, T6);
I6 = _mm256_unpacklo_epi64(T5, T7);
I7 = _mm256_unpackhi_epi64(T5, T7);

T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
T1 = _mm256_permute2x128_si256(I1, I5, 0x20);
T2 = _mm256_permute2x128_si256(I2, I6, 0x20);
T3 = _mm256_permute2x128_si256(I3, I7, 0x20);
T4 = _mm256_permute2x128_si256(I0, I4, 0x31);
T5 = _mm256_permute2x128_si256(I1, I5, 0x31);
T6 = _mm256_permute2x128_si256(I2, I6, 0x31);
T7 = _mm256_permute2x128_si256(I3, I7, 0x31);

_mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
_mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
_mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
_mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
_mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
_mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5);
_mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6);
_mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7);
}
}
  • AVX指令集是256-bit,所以這裡一次處理8個byte,loop一次加8
  • 首先將依次將8行數組的元素載入到暫存器中
  • _mm256_unpacklo/hi_epi32函式讀入兩個256-bit的數,將低/高128-bit以32-bit為單位交錯排列,舉例:
    __m256i A = [ A0, A1, A2, A3, A4, A5, A6, A7 ];
    __m256i B = [ B0, B1, B2, B3, B4, B5, B6, B7 ];
    __m256i C = _mm256_unpacklo_epi32(I0, I1) = [ A0, B0, A1, B1, A2, B2, A3, B3 ];
  • _mm256_unpacklo_epi64同理
  • 以下對比了sse、avx共4個版本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Performance counter stats for './sse_transpose' (100 runs):
4329716 cache-misses # 79.312 % of all cache refs ( +- 0.03% )
5459067 cache-references ( +- 0.02% )
8514863 L1-dcache-load-misses ( +- 0.01% )
4290179 L1-dcache-store-misses ( +- 0.03% )
0 L1-dcache-prefetch-misses
26387 L1-icache-load-misses ( +- 0.46% )
0.242315241 seconds time elapsed ( +- 0.04% )

Performance counter stats for './sse_prefetch_transpose' (100 runs):
4345707 cache-misses # 79.577 % of all cache refs ( +- 0.03% )
5460983 cache-references ( +- 0.02% )
8532546 L1-dcache-load-misses ( +- 0.02% )
4307174 L1-dcache-store-misses ( +- 0.04% )
0 L1-dcache-prefetch-misses
24769 L1-icache-load-misses ( +- 0.46% )
0.184028053 seconds time elapsed ( +- 0.06% )

Performance counter stats for './avx_transpose' (100 runs):
3305343 cache-misses # 74.670 % of all cache refs ( +- 0.02% )
4426624 cache-references ( +- 0.01% )
8043457 L1-dcache-load-misses ( +- 0.02% )
4859334 L1-dcache-store-misses ( +- 0.02% )
0 L1-dcache-prefetch-misses
25696 L1-icache-load-misses ( +- 0.52% )
0.188154465 seconds time elapsed ( +- 0.02% )

Performance counter stats for './avx_prefetch_transpose' (100 runs):
3320332 cache-misses # 51.121 % of all cache refs ( +- 0.02% )
6495085 cache-references ( +- 0.02% )
8076035 L1-dcache-load-misses ( +- 0.02% )
4886925 L1-dcache-store-misses ( +- 0.02% )
0 L1-dcache-prefetch-misses
26493 L1-icache-load-misses ( +- 0.35% )
0.187412805 seconds time elapsed ( +- 0.05% )
  • 從cache-miss來看avx版本有25%左右的提升,從執行時間上單純avx版比sse版有明顯提升,但輸sse_prefetch;而avx的prefetch版與avx原版幾乎無差別,還輸sse_prefetch
  • 關於prefetch版,PFDIST取了8/16/32/64,目前暫取16。看來需要去When Prefetching Works, When It Doesn’t, and Why找答案了

Reference