Cache and Related Part3: Coherent

mwish

2020-11-01

（图来自 CMU 15-418 Spring 2020 和 UCB CS152）

当我们涉及多核的时候，我们的多个核可能会共享一份内存：

8A003FAC-7726-4EFA-8F98-721EAE514AC4

单个变量的访问可能会有问题：

BB44A00B-A60B-4CB8-B063-728410EB5DE4

我们访问 memory, 可能在直觉上会是：

读到最近一级别存储 x 的上一个写入的值

但是，memory 会遇到的问题是

有 global 的 storage 和 local cache，而且这两个可能是不一致的
1. 所以我们有 write back 和 write through 的策略，来 trade 性能和 consitent
有多个 local cache 的话，可能也会导致这种视图上的不一致

但是说到底，“读到上一个写入的值”，或者，“写入时间最近的值”，究竟是什么语义呢：

018B5523-101F-4E00-AF0D-B3C0EE915E7F

事实上，我们之前介绍过 CPU Cache, 现在给出一个 L1 L2 L3 的视图

15293AC1-A830-4AB5-96BE-0ACEE96B962D

显然，现在我们对同一个变量的访问，如上上张图所示，我各个处理器本身是顺序处理的，而它们在一起，会得到一个诡异的偏序，而 P1 P2 P3 的 x 值和内存中的 x 值是 inconsistent 的。这是问题所在：我们无法定义“上一个”逻辑意义上是什么。

要在保证性能的同时，也作出各个处理器对 x 的值读写的偏序保证，我们需要 coherence

Single CPU System: I/O

BF8647F2-AC2B-4ADD-BC99-353EEF618C16

单核 CPU IO 的时候，假设要读/写数据，可能会：

processor 写到了 write-buffer 里，而网卡可能没有读到 buffer 中的数据，而是读到了 memory 中的 stale data，造成写 stale data
网络中数据写到了 memory 中，通知 processor, processor lw 读到了 cache，造成读 stale data

某种意义上说，这也是一种不一致导致的，为了解决这个问题，需要下列的支持

CPU 写的时候可以写到和设备的 shared buffer 中，而不是本身的 write buffer。让设备能够读到正确信息
OS 可以把读写的 page 设置成不可 cache 的，同时，IO 完成的时候 flush cache page
在生产中，DMA transfer 的频率远比 lw sw 少，所以可以接受代价较高。

Coherence

终于到了 conherence 了，那么我们来讲讲，memory coherence，我必须要说，下面这张图理的绝对很清晰了：

A850F71C-E08B-4D09-A735-5E01922E811D

单个 Processor 对单个变量 x 的读写是顺序的
write propagation: P 对一个变量的写入必须最终对其他 Process 可见
write serialization : 写有全局的顺序

42AA0894-4FD5-4BF0-9A22-98F1A12F5B3C

以上便是违背 write serialization 的例子，对于 x 写入 “a" "b", P3 P4 观察到的顺序不一致。

那么，在这种假设下，假设有三个线程 t0, t1, t2:

// t0
for (i = 0; i < N; i += 2) {
	X = i;
}

// t1
for (j = 1; j < N; j += 2) {
	X = j;
}

// t3
for (k = 0; k < ..; k++) {
	s[k] = X;
}

那么，数组 s 中，可以看到我写的脑瘫程序可以输出：

26980 626437 1226236 1806376 2449353 3083919 3382384 3734980 4165338 5994059 
4872578 7435291 5586348 8893777 6262932 6539670 6875646 7205346 7535204 7873218 
8256550 8658202 14268993 14634249 9901488 10347038 15693957 16119785 16529365 17135341 
17836189 12867162 19251979 19962671 13929908 14261510 14598770 22698103 15331624 24051107

这里奇数/偶数都是强 order 的，但是他们之间并没有什么保证，只能保证全局有一致的 order

(题外话：但是我看到这有个问题，就是 atomic + relaxed 和 non-atomic 有什么区别，瞅了眼： https://stackoverflow.com/questions/63810298/what-is-the-difference-between-load-store-relaxed-atomic-and-normal-variable 这似乎保证单个操作的原子性，比如我们 RV32I 的 auipc 和 add…)

关于实现 Cache Coherent，是下一节的内容，我们再次先看 Memory Model 吧。

Memory Consistency

最实用、最有挑战性、最令人困惑的部分来了。我们现在有了变量 x 的 coherence，我们还需要什么呢？

Cache Coherence 保证了单个 cache block 的 loads and stores 是按照定义运行的，但是我们假设有一个 produce-consume:

// produce
value <- 5
flag <- ok

// consume
wait until flag is ok
-- 操作 value

那么，很尴尬的是，你会发现 cache coherent 没有提供可靠的保障。

“Memory Consistency Model” (sometimes called “Memory Ordering”): do all loads and stores, even to separate cache blocks, behave correctly?

实际上，你会发现问题的来源很多，而且来源不是上层应用，而是我们讲了2节的 cache write-buffer multi-level cache 这些给我们带来利好的东西，和 OoO 执行、编译器乱序等正常对程序员透明，现在全冒出来了的东西。

Uniprocess Memory Model

单个程序按 program order 的语义执行、读写

5E768853-5AB5-4B1A-B940-F2E960E9C93A

这图相当于我们 Part2 说的。

那么，在并行的时候，也就是我们说的 producer-consumer 中，可能会有下图的问题：

97CF1A96-F3F9-4976-B35F-D30F6CD258BA

write buffer 和 OoO 等乱序不能保证写入会被顺序的传播，没有 happens-before 的语义。

更要命的是，cache 也会在其中添乱：（当然不止是 cache，所以回头来，你要理解到，volatile 这个词在 C/C++ 中是危险的，不能认为它们读写内存就万事大吉了，不过我也不太清楚搞 NVM 的大佬们会不会用到这个）

065D74C5-5B97-40DD-8087-1A31D2CD4A5C

由于 P3 上有 A 的 cache, 它并不会从内存中读取应读的值，cache coherent 也没被违背，但是这会儿程序的语义就没法保证了！

更要命的是，还有编译器的乱序，关于编译器乱序可以参考没有使用 atomic 的 LevelDB 的 SkipList, 它只是写了一个编译器 barrier，而没有用 CPU 的 barrier 和指令。

Sequential Consistency: SC

这个是老熟人了：

Formalized by Lamport (1979)

accesses of each processor in program order
all accesses appear in sequential order

我理解就是简单说所有操作有一个全序，就所有操作偏序变全序。你可以用 seq_cst 来体验一下（

下面是它的硬件模型，描述的非常详细：

8D07B6F6-9B06-45FB-A3DB-5BB1DDBE47AF

这个时候，写入要求：

For each processor, delay start of memory access until previous one completes: each processor has only one outstanding memory access at a time

读取要求

a read completes when its return value is bound

7A92B093-6B60-49FE-BA37-B121D2B1B686

上述会带来严格的限制和可靠性，但是性能…

5BC3991D-5012-489B-B26E-64169EAE16E7

顺便可以提一下，对 x86 而言，一些 LOCK 指令会获取 Lock 相关的信息，而 MFENCE 会清空 Store Buffer。这里指令会按照发送的顺序来提交。

Relaxed Memory Model

于是我们有 relaxed 一些的模型，例如 TSO/PSO 等：

71298E26-F19C-4084-B6D9-3D0F971475C5

注：x86的 memory model 类似 TSO，只允许 store-load 乱序。

这在优化中获得了权衡：

166633FA-6A66-4370-A075-3FB0735136E8

ARM & IBM Power 的内存模型

在这里，我们的硬件和指令可能有下面的语义：

70B65496-6C17-4448-9BD0-F14C42DD2CC4

（上图实际上是一个 non-multi-copy 的模型）

下面是指令，它的 commit 可能是乱序的，而且指令可能会在分支预测阶段：

读取的时候，可能会直接读取
写的时候，要等待之前的内容 Commit

C33C7C13-990F-440F-BC8F-95465B2E2B5C

摘录一下原文：

For a read instruction, as soon as an address for the read is known, the read might be satisﬁed, binding its value to one received from the local memory (or in some cases forwarded from earlier in the thread). That value could immediately be used by later instructions in the thread that depend on it, but it and they are subject to being restarted or (if this is a speculative path) aborted until the read is committed.

For a write instruction, the key points are when the address and value become determined. After that (subject to other conditions) the write can be committed, sent to the local memory; this is not subject to restart or abort. After that, the write might propagate to other threads, becoming readable by them.

同时，指令发送的顺序和收到的顺序可能是不一样的，这里有一些分类：

Weak, multi-copy-atomic memory models
1. all processors see writes by another processor in same order
2. RISC-V RVWMO, baseline weak memory model for RISC-V
Weak, non-multi-copy-atomic memory models
1. processors can see another’s writes in different orders
2. ARM v7, original ARM v8, IBM POWER

IBM Power 和 ARM 中，可能会有 Load Buffer，读起来的顺序和写的顺序是不一样的。

可能出现的问题和同步

但是，这个时候，我们仍然要面对最初的那个问题：

// produce
value <- 5
flag <- ok

// consume
wait until flag is ok
-- 操作 value

Data Race:

two conflicting accesses on different processors
not ordered by intervening accesses

显然，我们对 flag 的操作本身没有经过同步，这个在仅允许 store-load 乱序的系统中，甚至没啥问题（不考虑编译器乱序）。但是，在 RISC-V 这种系统，或者 ARM 里，你就等死吧。我们需要一些机制来同步：

Properly Synchronized Programs:

all synchronizations are explicitly identified
all data accesses are ordered through synchronization

我们可以使用：

memory barrier
显式的锁，来隐式提供 fence

逻辑如下：

CBE31CAA-B88D-410F-9F82-21D0F68FE031

x86 的 xchg 显式的帮你完成这一切，你可以 acquire 来获得锁，在互斥区维护 invariant，release 来释放锁。锁前，锁中，锁后都是可以 re-order 的，但是它们不能越过锁。
MFENCE LFENCE MFENCE , fence会 stall 程序，把之前的操作执行完，来创建偏序。

RISC-V 中的同步

RISC-V 提供了 multi-copy model，然后提供了 amo 来支持原子操作，首先它们对操作的数据是原子完成的。首先有 aq 和 rl 两位：

如果提供了 aq，这条指令 “Happens Before” 之后的 load 和 store
如果提供了 rl，这条指令 “Happens After” 之前的 store 和 load

这里可以参考 spinlock:

// Acquire the lock.
// Loops (spins) until the lock is acquired.
void
acquire(struct spinlock *lk)
{
  // 如果:
  // * acquire
  // * timer interrupt
  // * timer interrupt 在调度之前拿同一把锁, 哦吼.
  push_off(); // disable interrupts to avoid deadlock.
  if(holding(lk))
    panic("acquire");

  // On RISC-V, sync_lock_test_and_set turns into an atomic swap:
  //   a5 = 1
  //   s1 = &lk->locked
  //   amoswap.w.aq a5, a5, (s1)
  //
  // 这个地方使用 acquire, 因为下面的内容不会被放到这前面.
  while(__sync_lock_test_and_set(&lk->locked, 1) != 0)
    ;

  // Tell the C compiler and the processor to not move loads or stores
  // past this point, to ensure that the critical section's memory
  // references happen strictly after the lock is acquired.
  // On RISC-V, this emits a fence instruction.
  //
  // 相当于手动给 lk->cpu 插入了一个 fencing, 让上面的东西不会被重拍下来
  __sync_synchronize();

  // Record info about lock acquisition for holding() and debugging.
  lk->cpu = mycpu();
}

// Release the lock.
void
release(struct spinlock *lk)
{
  if(!holding(lk))
    panic("release");

  // 这个不会被重排到下面去.
  lk->cpu = 0;

  // Tell the C compiler and the CPU to not move loads or stores
  // past this point, to ensure that all the stores in the critical
  // section are visible to other CPUs before the lock is released,
  // and that loads in the critical section occur strictly before
  // the lock is released.
  // On RISC-V, this emits a fence instruction.
  __sync_synchronize();

  // Release the lock, equivalent to lk->locked = 0.
  // This code doesn't use a C assignment, since the C standard
  // implies that an assignment might be implemented with
  // multiple store instructions.
  // On RISC-V, sync_lock_release turns into an atomic swap:
  //   s1 = &lk->locked
  //   amoswap.w zero, zero, (s1)
  __sync_lock_release(&lk->locked);

  pop_off();
}

(上面用的仍然是 gnu C 的 __sync_ 扩展，因为没有 amoswap.rl, 或者 ac，所以手动插入了一些 barrier)

关于 fence，可以参考：

D2ADEADD-86EA-4011-8033-2A4F28C1A960

举例而言，对于下面的图，fence w, w 和 fence r,r 之间构建了 barrier，来保证顺序：

D67A4594-E3A7-478D-A476-52922A7C3818

上述例子中，fence w, w 和 fence r, r 构建了一个偏序关系。sw data, (xdatap) 保证被 lw xdata, (xdatap) 感知了。

RISC-V 的支持：RV32A

RISC-V 具有宽松的内存一致性模型(relaxed memory consistency model)，因此其他线程看到的内存访问可以是乱序的。图 6.2 中，所有的 RV32A 指令都有一个请求位(aq)和一个释放位(rl)。aq 被置位的原子指令保证其它线程在随后的内存访问中看到顺序的 AMO 操作;rl 被置位的原子指令保证其它线程在此之前看到顺序的原子操作。想要了解更详细的有关知识，可以查看[Adve and Gharachorloo 1996]。

4390486F-3FF8-4C71-A8FB-DC0598E077CF

RV32A 没有直接提供了 CAS FAA 的指令，而是提供了两套指令。RV32A 认为这样有更好的可扩展性：

23F56E11-4461-4BB4-9D02-C2E38A64A60D

编译器乱序和语言的 `memory_order`

https://chromium.googlesource.com/external/leveldb/+/v1.15/port/atomic_pointer.h#61

除了内存，编译器也会完成乱序。最佳的例子是 LevelDB 之前的编译器 barrier。

语言会提供对应的语义，如 C++ 的六种 order （实话说我 consume 那几个完全不懂）: https://en.cppreference.com/w/cpp/atomic/memory_order

或者简单点可以看看 Golang 的（

References

CMU 15-418 Spring 2020
UCB CS152 Spring 2020
CPU Cache and Memory Ordering 何登成
RISC-V Reader
Computer Organization and Design RISC-V edition