Arrow Scalar

Array & ArrayData

在之前的博客 https://blog.mwish.me/2023/05/04/Type-and-Array-in-Columnar-System/https://blog.mwish.me/2024/06/02/Arrow-Compute-Scalar-Function-Framework/ 中,我们简单的聊到了 Arrow Array 这一 Arrow 数据的核心。我们简单总结一下:

  • Array: 不同种类的 ::arrow::Array, Array 上封装了对成员的访问( IsNull, 和不同类型的 visit ith 成员, 迭代器等)
  • ArrayData: 相对于 Array 是一个基础类型,ArrayData 则是 Array 的实际 Payload,它以组合的形式内嵌在 Array 中,并且可以被视作是 final 类型. ArrayData 包含 typebuffer ,null flag 和可能拥有的子节点,buffer 包含数据 buffer 和 Arrow Columnar 标准指定了对于每种 type . 我们在下面贴出 ArrayData 的成员
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  std::shared_ptr<DataType> type;
int64_t length = 0;
mutable std::atomic<int64_t> null_count{0};
// The logical start point into the physical buffers (in values, not bytes).
// Note that, for child data, this must be *added* to the child data's own offset.
int64_t offset = 0;
std::vector<std::shared_ptr<Buffer>> buffers;
std::vector<std::shared_ptr<ArrayData>> child_data;

// The dictionary for this Array, if any. Only used for dictionary type
std::shared_ptr<ArrayData> dictionary;

// The statistics for this Array.
std::shared_ptr<ArrayStatistics> statistics;
};

通常来说,从 ArrayData 可以直接生成一个 Array.

Span of array

arrow::Buffer 是一个简单的 own or not owned buffer. 我们可以看到,实际上,这里还有一种类型 ArraySpan 表示对 Array 的引用

1
2
3
4
5
6
7
8
9
10
/// \brief A non-owning Buffer reference
struct ARROW_EXPORT BufferSpan {
// It is the user of this class's responsibility to ensure that
// buffers that were const originally are not written to
// accidentally.
uint8_t* data = NULLPTR;
int64_t size = 0;
// Pointer back to buffer that owns this memory
const std::shared_ptr<Buffer>* owner = NULLPTR;
};

ArraySpan 本质上也是 Array 的 ref, 看其结构体如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/// \brief EXPERIMENTAL: A non-owning ArrayData reference that is cheaply
/// copyable and does not contain any shared_ptr objects. Do not use in public
/// APIs aside from compute kernels for now
struct ARROW_EXPORT ArraySpan {
const DataType* type = NULLPTR;
int64_t length = 0;
mutable int64_t null_count = kUnknownNullCount;
int64_t offset = 0;
BufferSpan buffers[3];

/// If dictionary-encoded, put dictionary in the first entry
std::vector<ArraySpan> child_data;

/// \brief Populate ArraySpan to look like an array of length 1 pointing at
/// the data members of a Scalar value
void FillFromScalar(const Scalar& value);

std::shared_ptr<ArrayData> ToArrayData() const;
std::shared_ptr<Array> ToArray() const;

ArraySpan(const ArrayData& data);
explicit ArraySpan(const Scalar& data) { FillFromScalar(data); }
};

这部分含义很好懂,实际上一些计算 kernel 也是走这块来实现执行的。但是你会发现,这里还有个 ArraySpan(const Scalar& data), 直接从 Scalar 构造 Span! 神秘吧,所以我们接下来会走到 Span

Scalar

扯了这么久,我们终于到了 ::arrow::Scalar. ::arrow::Scalar 是和 ::arrow::Array 一样的基类型. 上面也绑定了很多和 Array 一样的操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/// \brief Base class for scalar values
///
/// A Scalar represents a single value with a specific DataType.
/// Scalars are useful for passing single value inputs to compute functions,
/// or for representing individual array elements (with a non-trivial
/// wrapping cost, though).
struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
public util::EqualityComparable<Scalar> {
/// \brief The type of the scalar value
std::shared_ptr<DataType> type;

/// \brief Whether the value is valid (not null) or not
bool is_valid = false;
};

Primitive

这里提供了给非 Null 类型 Scalar 的基类:

  • PrimitiveScalar 提供了一个 T 作为 ArrowType,然后用 CType 当成内部的 value type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
namespace internal {

struct ARROW_EXPORT PrimitiveScalarBase : public Scalar {
explicit PrimitiveScalarBase(std::shared_ptr<DataType> type)
: Scalar(std::move(type), false) {}

using Scalar::Scalar;
/// \brief Get a const pointer to the value of this scalar. May be null.
virtual const void* data() const = 0;
/// \brief Get an immutable view of the value of this scalar as bytes.
virtual std::string_view view() const = 0;
};

template <typename T, typename CType = typename T::c_type>
struct PrimitiveScalar : public PrimitiveScalarBase {
using PrimitiveScalarBase::PrimitiveScalarBase;
using TypeClass = T;
using ValueType = CType;

// Non-null constructor.
PrimitiveScalar(ValueType value, std::shared_ptr<DataType> type)
: PrimitiveScalarBase(std::move(type), true), value(value) {}

explicit PrimitiveScalar(std::shared_ptr<DataType> type)
: PrimitiveScalarBase(std::move(type), false) {}

ValueType value{};

const void* data() const override { return &value; }
std::string_view view() const override {
return std::string_view(reinterpret_cast<const char*>(&value), sizeof(ValueType));
};
};

} // namespace internal

当 Primitive 实现时,它用 ValueType来存储数据,然后生成 ArraySpan 的时候,流程如下(忽略 null 之类数据的填充):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if (type_id == Type::BOOL) {
const auto& scalar = checked_cast<const BooleanScalar&>(value);
this->buffers[1].data = scalar.value ? &kTrueBit : &kFalseBit;
this->buffers[1].size = 1;
} else if (is_primitive(type_id) || is_decimal(type_id) ||
type_id == Type::DICTIONARY) {
const auto& scalar = checked_cast<const internal::PrimitiveScalarBase&>(value);
const uint8_t* scalar_data = reinterpret_cast<const uint8_t*>(scalar.view().data());
this->buffers[1].data = const_cast<uint8_t*>(scalar_data);
this->buffers[1].size = scalar.type->byte_width();
if (type_id == Type::DICTIONARY) {
// Populate dictionary data
const auto& dict_scalar = checked_cast<const DictionaryScalar&>(value);
this->child_data.resize(1);
this->child_data[0].SetMembers(*dict_scalar.value.dictionary->data());
}
}

Dict

上面这段代码也有 Dict 相关的实现,Dict Scalar 实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
/// \brief A Scalar value for DictionaryType
///
/// `is_valid` denotes the validity of the `index`, regardless of
/// the corresponding value in the `dictionary`.
struct ARROW_EXPORT DictionaryScalar : public internal::PrimitiveScalarBase {
using TypeClass = DictionaryType;
struct ValueType {
// Index 可能是各种 int 类型, 所以用一个 Scalar 来抽象.
std::shared_ptr<Scalar> index;
std::shared_ptr<Array> dictionary;
} value;
};

Var-length & scratch space

这里有一个比较重要的部分是 scratch_space_, 这部分内容相当于人造了一份 buffer 的内存空间,然后在构造函数里 Impl::FillScratchSpace 填充

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
namespace internal {

constexpr auto kScalarScratchSpaceSize = sizeof(int64_t) * 2;

template <typename Impl>
struct ArraySpanFillFromScalarScratchSpace {
// 16 bytes of scratch space to enable ArraySpan to be a view onto any
// Scalar- including binary scalars where we need to create a buffer
// that looks like two 32-bit or 64-bit offsets.
alignas(int64_t) mutable uint8_t scratch_space_[kScalarScratchSpaceSize];

private:
template <typename... Args>
explicit ArraySpanFillFromScalarScratchSpace(Args&&... args) {
Impl::FillScratchSpace(scratch_space_, std::forward<Args>(args)...);
}

ArraySpanFillFromScalarScratchSpace() = delete;

friend Impl;
};

} // namespace internal

Scratch 是做什么的呢?我们回顾一下 Arrow 的几个 Buffer:

  1. Validity Buffer
  2. DataBuffer
  3. 扩展的变长或者其他内容。这里本身 Arrow ArrayData 可以有不止3个 Buffer,但是 ArraySpan 最多三个,因为 StringView 这种类型单个 str 也至多引用一个 Buffer

那么 Validity Buffer 指向 static 的内存区域( Boolean Scalar 的 Data Buffer 也是这么做的)。对于 Primitive Type,这里 DataBuffer 指向 PrimitiveScalar&value 。这是个很好理解的。对于 List 和 Binary 来说,他们的 DataBuffer 可能是:

  • Offset Buffer ( 对于 List/String )
  • Offset/Pointer Buffer ( 对于 StringView 等)
  • Offset-Length (对于 ListView 等)

BinaryScalar 就借用了这份 scratch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct ARROW_EXPORT BinaryScalar
: public BaseBinaryScalar,
private internal::ArraySpanFillFromScalarScratchSpace<BinaryScalar> {
using TypeClass = BinaryType;
using ArraySpanFillFromScalarScratchSpace =
internal::ArraySpanFillFromScalarScratchSpace<BinaryScalar>;

explicit BinaryScalar(std::shared_ptr<DataType> type)
: BaseBinaryScalar(std::move(type)),
ArraySpanFillFromScalarScratchSpace(this->value) {}

BinaryScalar(std::shared_ptr<Buffer> value, std::shared_ptr<DataType> type)
: BaseBinaryScalar(std::move(value), std::move(type)),
ArraySpanFillFromScalarScratchSpace(this->value) {}
/// ...
};

这个 Scratch 给足静态空间,然后用来填充需要的 DataBuffer 等 Buffer。至于 Binary 的数据 Buffer 则存放在 Base 类型上:

1
2
3
4
5
6
7
8
9
struct ARROW_EXPORT BaseBinaryScalar : public internal::PrimitiveScalarBase {
using ValueType = std::shared_ptr<Buffer>;

// The value is not supposed to be modified after construction, because subclasses have
// a scratch space whose content need to be kept consistent with the value. It is also
// the user of this class's responsibility to ensure that the buffer is not written to
// accidentally.
const std::shared_ptr<Buffer> value = NULLPTR;
};

Struct

StructArray 本身被实现为 validity + 子节点。这里的 Scalar 包含多个子 Scalar

1
2
3
4
5
struct ARROW_EXPORT StructScalar : public Scalar {
using TypeClass = StructType;
using ValueType = std::vector<std::shared_ptr<Scalar>>;

ScalarVector value;

转化为

1
2
3
4
5
6
7
8
if (type_id == Type::STRUCT) {
const auto& scalar = checked_cast<const StructScalar&>(value);
this->child_data.resize(this->type->num_fields());
DCHECK_EQ(this->type->num_fields(), static_cast<int>(scalar.value.size()));
for (size_t i = 0; i < scalar.value.size(); ++i) {
this->child_data[i].FillFromScalar(*scalar.value[i]);
}
}

Lists / Map

List Map 都是嵌套的结构,他们有着「相同的、未知长度的子类型」。在这里,子类型被巧妙的用 ::arrow::Array 来实现。

1
2
3
4
5
6
7
8
struct ARROW_EXPORT BaseListScalar : public Scalar {
using ValueType = std::shared_ptr<Array>;
// The value is not supposed to be modified after construction, because subclasses have
// a scratch space whose content need to be kept consistent with the value. It is also
// the user of this class's responsibility to ensure that the array is not modified
// accidentally.
const std::shared_ptr<Array> value;
};

而不同的 Scalar 实现了独立的 Scratch 填充逻辑。