ArrayData: 相对于 Array 是一个基础类型,ArrayData 则是 Array 的实际 Payload,它以组合的形式内嵌在 Array 中,并且可以被视作是 final 类型. ArrayData 包含 type 和 buffer ,null flag 和可能拥有的子节点,buffer 包含数据 buffer 和 Arrow Columnar 标准指定了对于每种 type . 我们在下面贴出 ArrayData 的成员
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
std::shared_ptr<DataType> type; int64_t length = 0; mutable std::atomic<int64_t> null_count{0}; // The logical start point into the physical buffers (in values, not bytes). // Note that, for child data, this must be *added* to the child data's own offset. int64_t offset = 0; std::vector<std::shared_ptr<Buffer>> buffers; std::vector<std::shared_ptr<ArrayData>> child_data;
// The dictionary for this Array, if any. Only used for dictionary type std::shared_ptr<ArrayData> dictionary;
// The statistics for this Array. std::shared_ptr<ArrayStatistics> statistics; };
通常来说,从 ArrayData 可以直接生成一个 Array.
Span of array
arrow::Buffer 是一个简单的 own or not owned buffer. 我们可以看到,实际上,这里还有一种类型 ArraySpan 表示对 Array 的引用
1 2 3 4 5 6 7 8 9 10
/// \brief A non-owning Buffer reference structARROW_EXPORT BufferSpan { // It is the user of this class's responsibility to ensure that // buffers that were const originally are not written to // accidentally. uint8_t* data = NULLPTR; int64_t size = 0; // Pointer back to buffer that owns this memory const std::shared_ptr<Buffer>* owner = NULLPTR; };
/// \brief EXPERIMENTAL: A non-owning ArrayData reference that is cheaply /// copyable and does not contain any shared_ptr objects. Do not use in public /// APIs aside from compute kernels for now structARROW_EXPORT ArraySpan { const DataType* type = NULLPTR; int64_t length = 0; mutableint64_t null_count = kUnknownNullCount; int64_t offset = 0; BufferSpan buffers[3]; /// If dictionary-encoded, put dictionary in the first entry std::vector<ArraySpan> child_data; /// \brief Populate ArraySpan to look like an array of length 1 pointing at /// the data members of a Scalar value voidFillFromScalar(const Scalar& value); std::shared_ptr<ArrayData> ToArrayData()const; std::shared_ptr<Array> ToArray()const; ArraySpan(const ArrayData& data); explicitArraySpan(const Scalar& data){ FillFromScalar(data); } };
/// \brief Base class for scalar values /// /// A Scalar represents a single value with a specific DataType. /// Scalars are useful for passing single value inputs to compute functions, /// or for representing individual array elements (with a non-trivial /// wrapping cost, though). structARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>, public util::EqualityComparable<Scalar> { /// \brief The type of the scalar value std::shared_ptr<DataType> type;
/// \brief Whether the value is valid (not null) or not bool is_valid = false; };
Primitive
这里提供了给非 Null 类型 Scalar 的基类:
PrimitiveScalar 提供了一个 T 作为 ArrowType,然后用 CType 当成内部的 value type
using Scalar::Scalar; /// \brief Get a const pointer to the value of this scalar. May be null. virtualconstvoid* data()const= 0; /// \brief Get an immutable view of the value of this scalar as bytes. virtual std::string_view view()const= 0; };
template <typename T, typename CType = typename T::c_type> struct PrimitiveScalar : public PrimitiveScalarBase { using PrimitiveScalarBase::PrimitiveScalarBase; using TypeClass = T; using ValueType = CType;
/// \brief A Scalar value for DictionaryType /// /// `is_valid` denotes the validity of the `index`, regardless of /// the corresponding value in the `dictionary`. structARROW_EXPORT DictionaryScalar : public internal::PrimitiveScalarBase { using TypeClass = DictionaryType; structValueType { // Index 可能是各种 int 类型, 所以用一个 Scalar 来抽象. std::shared_ptr<Scalar> index; std::shared_ptr<Array> dictionary; } value; };
template <typename Impl> structArraySpanFillFromScalarScratchSpace { // 16 bytes of scratch space to enable ArraySpan to be a view onto any // Scalar- including binary scalars where we need to create a buffer // that looks like two 32-bit or 64-bit offsets. alignas(int64_t) mutableuint8_t scratch_space_[kScalarScratchSpaceSize];
structARROW_EXPORT BaseBinaryScalar : public internal::PrimitiveScalarBase { using ValueType = std::shared_ptr<Buffer>;
// The value is not supposed to be modified after construction, because subclasses have // a scratch space whose content need to be kept consistent with the value. It is also // the user of this class's responsibility to ensure that the buffer is not written to // accidentally. const std::shared_ptr<Buffer> value = NULLPTR; };
structARROW_EXPORT StructScalar : public Scalar { using TypeClass = StructType; using ValueType = std::vector<std::shared_ptr<Scalar>>;
ScalarVector value;
转化为
1 2 3 4 5 6 7 8
if (type_id == Type::STRUCT) { constauto& scalar = checked_cast<const StructScalar&>(value); this->child_data.resize(this->type->num_fields()); DCHECK_EQ(this->type->num_fields(), static_cast<int>(scalar.value.size())); for (size_t i = 0; i < scalar.value.size(); ++i) { this->child_data[i].FillFromScalar(*scalar.value[i]); } }
Lists / Map
List Map 都是嵌套的结构,他们有着「相同的、未知长度的子类型」。在这里,子类型被巧妙的用 ::arrow::Array 来实现。
1 2 3 4 5 6 7 8
structARROW_EXPORT BaseListScalar : public Scalar { using ValueType = std::shared_ptr<Array>; // The value is not supposed to be modified after construction, because subclasses have // a scratch space whose content need to be kept consistent with the value. It is also // the user of this class's responsibility to ensure that the array is not modified // accidentally. const std::shared_ptr<Array> value; };