Syntax of folly::ConcurrentHashMap

Java 的大部分语义是 share by reference 的,所以 ConcurrentHashMap 的增插删改不改变之前拿到的 Reference,这点我是能理解的。C++ / Rust 的语义则不是,在这两种语言中,也有不少 Concurrent Map 的基础设施,比如 datamap 或者 folly::ConcurrentHashMap,这种要如何保证 ConcurrentHashMap 的 value 语义?一边 overwrite/set,另一边 iterator 还在持有会怎么样?通用要共享会包一层 std::shared_ptr / Arc 吗?另外,这些 concurrent map 的 gc 又是怎么处理的?

我们直接看这个类型的代码: https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h

我们首先会看到:

1
2
Readers are always wait-free.
Writers are sharded, but take a lock that only locks part of the map.

有并发设计经验的人会知道,shared_ptr 或者 Arc 本身不一定必要。这里内部实现用了 Hazptr,我们关注一下这里的引用语义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
  be accessed while Iterators are still valid!

Therefore operator[] and at() return copies, since they do not
return an iterator. The returned value is const, to remind you
that changes do not affect the value in the map.

erase() calls the hash function, and may fail if the hash
function throws an exception.

The interface adds assign_if_equal, since find() doesn't take a lock.

Only const version of find() is supported, and const iterators.
Mutation must use functions provided, like assign().

iteration iterates over all the buckets in the table, unlike
std::unordered_map which iterates over a linked list of elements.
If the table is sparse, this may be more expensive.

1: ConcurrentHashMap, based on Java's ConcurrentHashMap.
Very similar to std::unordered_map in performance.

2: ConcurrentHashMapSIMD, based on F14ValueMap. If the map is
larger than the cache size, it has superior performance due to
vectorized key lookup.

USAGE FAQs

Q: Is simultaneous iteration and erase() threadsafe?
Example:

ConcurrentHashMap<int, int> map;

Thread 1: auto it = map.begin();
while (it != map.end()) {
// Do something with it
it++;
}

Thread 2: map.insert(2, 2); map.erase(2);

A: Yes, this is safe. However, the iterating thread is not
guaranteed to see (or not see) concurrent insertions and erasures.
Inserts may cause a rehash, but the old table is still valid as
long as any iterator pointing to it exists.

Q: How do I update an existing object atomically?

A: assign_if_equal is the recommended way - readers will see the
old value until the new value is completely constructed and
inserted.


Q: Are pointers to values safe to access withoutholding an
iterator?

A: The SIMD version guarantees that references to elements are
stable across rehashes, the non-SIMD version does not. Note that
unless you hold an iterator, you need to ensure there are no
concurrent deletes/updates to that key if you are accessing it via
reference.

上述内容其实告诉你:

  1. 提供了 assign_if_equal 等逻辑,允许你进行 epoch based 一些设计
  2. iterator 都是 const 的,然后承担了一些 hazptr 持有者(保证生命周期)的职责
  3. (其实)相当于写入都是写入新的值。

ConcurrentHashMap 很大,顺带一提,你随便往模版里写点啥 stack 上就 2KiB 了。

另一部分其实是内存的问题:

1
2
3
4
5
6
7
8
9
10
11
Q: Why do map.erase() and clear() not actually destroy elements?

A: Hazard Pointers are used to improve the performance of
concurrent access. They can be thought of as a simple Garbage
Collector. To reduce the GC overhead, a GC pass is only run after
reaching a certain memory bound. erase() will remove the element
from being accessed via the map, but actual destruction may happen
later, after iterators that may point to it have been deleted.

The only guarantee is that a GC pass will be run on map destruction
- no elements will remain after map destruction.

这里是 batch 的去做 GC,如果你的 value 是个复杂的 object,可能容易堆内存。