Integer & Endian

古早写过一篇:https://zhuanlan.zhihu.com/p/94406822

Two’s Complement(二进制补码)

虽然这是大部分入门计算机的人都学过的——不如说整篇文章都是,但是我们还是走流程复习一遍

如果表示 unsigned 相对来说很简单,那么表示 signed 就不行了。我们可以先验的知道一些结论,例如:

https://en.cppreference.com/w/cpp/language/types

  • char 的范围是 -128 to 127

那么,教科书可能会提到的一些朴素的实现:比方说用边缘的某一位表示 signed 或者 unsigned. 因为这样可能会有一个 +0 和一个 -0

不过我们回忆一下,也有这种模型来的不是,比如 IEEE 浮点数。

Sign and Magnitude: 最高位表示 signed

会引入 +0 -0 ,同时,加法/减法需要不同的行为。

https://en.wikipedia.org/wiki/Signed_number_representations#Signed_magnitude_representation

One’s Complement

positive numbers have leading 0s, negative numbers have leadings 1s.

https://en.wikipedia.org/wiki/Ones%27_complement

当然这个也有 +0 -0 的问题,One’s Complement 也是最高位表示 signed/unsigned 的

这个加减法都可以一定程度上统一,但是会带来 negative zero 的问题,而不带来 negative zero 又给计算增加了额外的复杂度。

Two’s complement

https://en.wikipedia.org/wiki/Two%27s_complement#Arithmetic_operations

https://zh.wikipedia.org/wiki/%E4%BA%8C%E8%A3%9C%E6%95%B8

现在有了 two complement, 这个问题就被很好地解决啦!

本节,完结!

Store data in memory

这一节或许需要回顾:https://zhuanlan.zhihu.com/p/118749234

当数据存在内存的时候,我们或许会有字节流:0x28872887,假如这里面全是32位有符号整数,我们怎么解释它们呢?

在 32位 RISC-V 中,我们是 Little-Endian 的,这意味着我们有如下布局:

DE4A8B7C-67CC-4304-A0BC-2721362FB6B1

发现没有,每个 byte 的顺序是不变的,但是一个 16 位整数中, byte 之间的顺序是不一样的,对于 0x28 来说,如果我们是 Little Endian, 那么,我们是 0x28, 而对于 BigEndian 来说,我们是 0x82.

那么再回到 0x28872887, 假设我们会有一个 int32_t s[4] 变成了以上的 0x28872887,或者有:

1
2
3
4
5
6
struct X {
int i1;
int i2;
int i3;
int i4;
}

那么,他们会怎么办?假如是大端序?答案是 i1 i2 i3 i4 仍然只代表 0x28 0x87 0x28 0x87 所在的那 16位,这个顺序是不会改变的。

Optimization

https://en.wikipedia.org/wiki/Endianness#Optimization

Little-Endian 提供了一个很好的性质,帮助用户完成上述的优化。

转化

你接触过 Unix 网络编程的话,可能会对下列的 api 有一定印象:

https://linux.die.net/man/3/htons

网络传输一般采用大端序,也被称之为网络字节序,或网络序。现在假设你做了一个程序,需要从网络读写,可以走上面这一套…

等等,我们是不是忽略了什么?那浮点数呢?

我们都知道浮点数有 IEEE 754 标准,那么:

Floating point[edit]

Although the ubiquitous x86 processors of today use little-endian storage for all types of data (integer, floating point), there are a number of hardware architectures where floating-point numbers are represented in big-endian form while integers are represented in little-endian form.[13] There are ARM processors that have half little-endian, half big-endian floating-point representation for double-precision numbers: both 32-bit words are stored in little-endian like integer registers, but the most significant one first. Because there have been many floating-point formats with no “network“ standard representation for them, the XDR standard uses big-endian IEEE 754 as its representation. It may therefore appear strange that the widespread IEEE 754 floating-point standard does not specify endianness.[14] Theoretically, this means that even standard IEEE floating-point data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may in practice safely assume that the endianness is the same for floating-point numbers as for integers, making the conversion straightforward regardless of data type. (Small embedded systems using special floating-point formats may be another matter however.)

在 boost.endian中,甚至对 floating point 只有有限的支持:

Is there floating point support?

An attempt was made to support four-byte floats and eight-byte doubles, limited to IEEE 754 (also known as ISO/IEC/IEEE 60559) floating point and further limited to systems where floating point endianness does not differ from integer endianness. Even with those limitations, support for floating point types was not reliable and was removed. For example, simply reversing the endianness of a floating point number can result in a signaling-NAN.

Support for float and double has since been reinstated for endian_buffer, endian_arithmetic and the conversion functions that reverse endianness in place. The conversion functions that take and return by value still do not support floating point due to the above issues; reversing the bytes of a floating point number does not necessarily produce another valid floating point number.

走向 Reader

所以,当我们读取的时候,我们可能有大端序的整数读取:

1
2
3
4
5
6
7
8
9
10
template<typename T>
void Read_(const char* buf, T* v) {
static_assert(is_fundamental_v<T>);
*v = buf[0];
for (size_t i = 1; i < sizeof(T); ++i) {
*v <<= 8;
*v |= static_cast<uint8_t>(buf[i]);
}
}

整型转换

整型转换

任何整数类型或无作用域枚举类型的纯右值都可隐式转换成任何其他整数类型。若其转换列出于整数类型提升之下,则它是提升而非转换。

  • 若目标类型为无符号,则结果值是等于源值 2n
    \
    的最小无符号值,其中 n 是用于表示目标类型的位数。

  • 若目标类型有符号,则当源整数能以目标类型表示时,不更改其值。否则结果是实现定义的 (C++20 前)与源值对 2n
    \
    同余的唯一目标类型值,其中 n 是用于表示目标类型的位数 (C++20 起)(注意这不同于未定义的有符号整数算术溢出)。

  • 若源类型为 bool,则值 false 转换为目标类型的零,而值 true 转换成目标类型的一(注意若目标类型为 int,则这是整数类型提升,而非整数类型转换)。
  • 若目标类型为 bool,则这是布尔转换(见下文)。

事实上,TiKV 代码依赖了这点,unsigned/signed 在下层统一用 signed 表示,然后依赖转换的语义。