Arrow Scalar

mwish

2024-10-31

Array & ArrayData

在之前的博客 https://blog.mwish.me/2023/05/04/Type-and-Array-in-Columnar-System/ 和 https://blog.mwish.me/2024/06/02/Arrow-Compute-Scalar-Function-Framework/ 中，我们简单的聊到了 Arrow Array 这一 Arrow 数据的核心。我们简单总结一下：

Array: 不同种类的 ::arrow::Array, Array 上封装了对成员的访问( IsNull, 和不同类型的 visit ith 成员, 迭代器等）
ArrayData: 相对于 Array 是一个基础类型，ArrayData 则是 Array 的实际 Payload，它以组合的形式内嵌在 Array 中，并且可以被视作是 final 类型. ArrayData 包含 type 和 buffer ，null flag 和可能拥有的子节点，buffer 包含数据 buffer 和 Arrow Columnar 标准指定了对于每种 type . 我们在下面贴出 ArrayData 的成员

  std::shared_ptr<DataType> type;
  int64_t length = 0;
  mutable std::atomic<int64_t> null_count{0};
  // The logical start point into the physical buffers (in values, not bytes).
  // Note that, for child data, this must be *added* to the child data's own offset.
  int64_t offset = 0;
  std::vector<std::shared_ptr<Buffer>> buffers;
  std::vector<std::shared_ptr<ArrayData>> child_data;

  // The dictionary for this Array, if any. Only used for dictionary type
  std::shared_ptr<ArrayData> dictionary;

  // The statistics for this Array.
  std::shared_ptr<ArrayStatistics> statistics;
};

通常来说，从 ArrayData 可以直接生成一个 Array.

Span of array

arrow::Buffer 是一个简单的 own or not owned buffer. 我们可以看到，实际上，这里还有一种类型 ArraySpan 表示对 Array 的引用

/// \brief A non-owning Buffer reference
struct ARROW_EXPORT BufferSpan {
  // It is the user of this class's responsibility to ensure that
  // buffers that were const originally are not written to
  // accidentally.
  uint8_t* data = NULLPTR;
  int64_t size = 0;
  // Pointer back to buffer that owns this memory
  const std::shared_ptr<Buffer>* owner = NULLPTR;
};

而 ArraySpan 本质上也是 Array 的 ref, 看其结构体如下:

/// \brief EXPERIMENTAL: A non-owning ArrayData reference that is cheaply
/// copyable and does not contain any shared_ptr objects. Do not use in public
/// APIs aside from compute kernels for now
struct ARROW_EXPORT ArraySpan {
  const DataType* type = NULLPTR;
  int64_t length = 0;
  mutable int64_t null_count = kUnknownNullCount;
  int64_t offset = 0;
  BufferSpan buffers[3];
  
  /// If dictionary-encoded, put dictionary in the first entry
  std::vector<ArraySpan> child_data;
    
  /// \brief Populate ArraySpan to look like an array of length 1 pointing at
  /// the data members of a Scalar value
  void FillFromScalar(const Scalar& value);
  
  std::shared_ptr<ArrayData> ToArrayData() const;
  std::shared_ptr<Array> ToArray() const;
  
  ArraySpan(const ArrayData& data);
  explicit ArraySpan(const Scalar& data) { FillFromScalar(data); }
};

这部分含义很好懂，实际上一些计算 kernel 也是走这块来实现执行的。但是你会发现，这里还有个 ArraySpan(const Scalar& data), 直接从 Scalar 构造 Span! 神秘吧，所以我们接下来会走到 Span

Scalar

扯了这么久，我们终于到了 ::arrow::Scalar. ::arrow::Scalar 是和 ::arrow::Array 一样的基类型. 上面也绑定了很多和 Array 一样的操作

/// \brief Base class for scalar values
///
/// A Scalar represents a single value with a specific DataType.
/// Scalars are useful for passing single value inputs to compute functions,
/// or for representing individual array elements (with a non-trivial
/// wrapping cost, though).
struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
                             public util::EqualityComparable<Scalar> {
  /// \brief The type of the scalar value
  std::shared_ptr<DataType> type;

  /// \brief Whether the value is valid (not null) or not
  bool is_valid = false;
};

Primitive

这里提供了给非 Null 类型 Scalar 的基类：

PrimitiveScalar 提供了一个 T 作为 ArrowType，然后用 CType 当成内部的 value type

namespace internal {

struct ARROW_EXPORT PrimitiveScalarBase : public Scalar {
  explicit PrimitiveScalarBase(std::shared_ptr<DataType> type)
      : Scalar(std::move(type), false) {}

  using Scalar::Scalar;
  /// \brief Get a const pointer to the value of this scalar. May be null.
  virtual const void* data() const = 0;
  /// \brief Get an immutable view of the value of this scalar as bytes.
  virtual std::string_view view() const = 0;
};

template <typename T, typename CType = typename T::c_type>
struct PrimitiveScalar : public PrimitiveScalarBase {
  using PrimitiveScalarBase::PrimitiveScalarBase;
  using TypeClass = T;
  using ValueType = CType;

  // Non-null constructor.
  PrimitiveScalar(ValueType value, std::shared_ptr<DataType> type)
      : PrimitiveScalarBase(std::move(type), true), value(value) {}

  explicit PrimitiveScalar(std::shared_ptr<DataType> type)
      : PrimitiveScalarBase(std::move(type), false) {}

  ValueType value{};

  const void* data() const override { return &value; }
  std::string_view view() const override {
    return std::string_view(reinterpret_cast<const char*>(&value), sizeof(ValueType));
  };
};

}  // namespace internal

当 Primitive 实现时，它用 ValueType来存储数据，然后生成 ArraySpan 的时候，流程如下（忽略 null 之类数据的填充）：

if (type_id == Type::BOOL) {
  const auto& scalar = checked_cast<const BooleanScalar&>(value);
  this->buffers[1].data = scalar.value ? &kTrueBit : &kFalseBit;
  this->buffers[1].size = 1;
} else if (is_primitive(type_id) || is_decimal(type_id) ||
           type_id == Type::DICTIONARY) {
  const auto& scalar = checked_cast<const internal::PrimitiveScalarBase&>(value);
  const uint8_t* scalar_data = reinterpret_cast<const uint8_t*>(scalar.view().data());
  this->buffers[1].data = const_cast<uint8_t*>(scalar_data);
  this->buffers[1].size = scalar.type->byte_width();
  if (type_id == Type::DICTIONARY) {
    // Populate dictionary data
    const auto& dict_scalar = checked_cast<const DictionaryScalar&>(value);
    this->child_data.resize(1);
    this->child_data[0].SetMembers(*dict_scalar.value.dictionary->data());
  }
}

Dict

上面这段代码也有 Dict 相关的实现，Dict Scalar 实现如下:

/// \brief A Scalar value for DictionaryType
///
/// `is_valid` denotes the validity of the `index`, regardless of
/// the corresponding value in the `dictionary`.
struct ARROW_EXPORT DictionaryScalar : public internal::PrimitiveScalarBase {
  using TypeClass = DictionaryType;
  struct ValueType {
    // Index 可能是各种 int 类型, 所以用一个 Scalar 来抽象.
    std::shared_ptr<Scalar> index;
    std::shared_ptr<Array> dictionary;
  } value;
};

Var-length & scratch space

这里有一个比较重要的部分是 scratch_space_, 这部分内容相当于人造了一份 buffer 的内存空间，然后在构造函数里 Impl::FillScratchSpace 填充

namespace internal {

constexpr auto kScalarScratchSpaceSize = sizeof(int64_t) * 2;

template <typename Impl>
struct ArraySpanFillFromScalarScratchSpace {
  //  16 bytes of scratch space to enable ArraySpan to be a view onto any
  //  Scalar- including binary scalars where we need to create a buffer
  //  that looks like two 32-bit or 64-bit offsets.
  alignas(int64_t) mutable uint8_t scratch_space_[kScalarScratchSpaceSize];

 private:
  template <typename... Args>
  explicit ArraySpanFillFromScalarScratchSpace(Args&&... args) {
    Impl::FillScratchSpace(scratch_space_, std::forward<Args>(args)...);
  }

  ArraySpanFillFromScalarScratchSpace() = delete;

  friend Impl;
};

} // namespace internal

Scratch 是做什么的呢？我们回顾一下 Arrow 的几个 Buffer:

Validity Buffer
DataBuffer
扩展的变长或者其他内容。这里本身 Arrow ArrayData 可以有不止3个 Buffer，但是 ArraySpan 最多三个，因为 StringView 这种类型单个 str 也至多引用一个 Buffer

那么 Validity Buffer 指向 static 的内存区域（ Boolean Scalar 的 Data Buffer 也是这么做的）。对于 Primitive Type，这里 DataBuffer 指向 PrimitiveScalar 的 &value 。这是个很好理解的。对于 List 和 Binary 来说，他们的 DataBuffer 可能是：

Offset Buffer ( 对于 List/String ）
Offset/Pointer Buffer ( 对于 StringView 等）
Offset-Length （对于 ListView 等）

BinaryScalar 就借用了这份 scratch:

struct ARROW_EXPORT BinaryScalar
    : public BaseBinaryScalar,
      private internal::ArraySpanFillFromScalarScratchSpace<BinaryScalar> {
  using TypeClass = BinaryType;
  using ArraySpanFillFromScalarScratchSpace =
      internal::ArraySpanFillFromScalarScratchSpace<BinaryScalar>;
      
  explicit BinaryScalar(std::shared_ptr<DataType> type)
    : BaseBinaryScalar(std::move(type)),
      ArraySpanFillFromScalarScratchSpace(this->value) {}

  BinaryScalar(std::shared_ptr<Buffer> value, std::shared_ptr<DataType> type)
    : BaseBinaryScalar(std::move(value), std::move(type)),
      ArraySpanFillFromScalarScratchSpace(this->value) {}
  /// ...
};

这个 Scratch 给足静态空间，然后用来填充需要的 DataBuffer 等 Buffer。至于 Binary 的数据 Buffer 则存放在 Base 类型上：

struct ARROW_EXPORT BaseBinaryScalar : public internal::PrimitiveScalarBase {
  using ValueType = std::shared_ptr<Buffer>;

  // The value is not supposed to be modified after construction, because subclasses have
  // a scratch space whose content need to be kept consistent with the value. It is also
  // the user of this class's responsibility to ensure that the buffer is not written to
  // accidentally.
  const std::shared_ptr<Buffer> value = NULLPTR;
};

Struct

StructArray 本身被实现为 validity + 子节点。这里的 Scalar 包含多个子 Scalar

struct ARROW_EXPORT StructScalar : public Scalar {
  using TypeClass = StructType;
  using ValueType = std::vector<std::shared_ptr<Scalar>>;

  ScalarVector value;

转化为

if (type_id == Type::STRUCT) {
  const auto& scalar = checked_cast<const StructScalar&>(value);
  this->child_data.resize(this->type->num_fields());
  DCHECK_EQ(this->type->num_fields(), static_cast<int>(scalar.value.size()));
  for (size_t i = 0; i < scalar.value.size(); ++i) {
    this->child_data[i].FillFromScalar(*scalar.value[i]);
  }
}

Lists / Map

List Map 都是嵌套的结构，他们有着「相同的、未知长度的子类型」。在这里，子类型被巧妙的用 ::arrow::Array 来实现。

struct ARROW_EXPORT BaseListScalar : public Scalar {
  using ValueType = std::shared_ptr<Array>;
  // The value is not supposed to be modified after construction, because subclasses have
  // a scratch space whose content need to be kept consistent with the value. It is also
  // the user of this class's responsibility to ensure that the array is not modified
  // accidentally.
  const std::shared_ptr<Array> value;
};

而不同的 Scalar 实现了独立的 Scratch 填充逻辑。