Skip to content

Commit 8bcdc0f

Browse files
mapleFUpitrou
andauthored
apacheGH-41186: [C++][Parquet][Doc] Denote PARQUET:field_id in parquet.rst (apache#41187)
### Rationale for this change Denote PARQUET:field_id in parquet.rst ### What changes are included in this PR? Just a doc improvement ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: apache#41186 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
1 parent c8f89d0 commit 8bcdc0f

File tree

1 file changed

+18
-4
lines changed

1 file changed

+18
-4
lines changed

docs/source/cpp/parquet.rst

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -522,17 +522,16 @@ An Arrow Dictionary type is written out as its value type. It can still
522522
be recreated at read time using Parquet metadata (see "Roundtripping Arrow
523523
types" below).
524524

525-
Roundtripping Arrow types
526-
~~~~~~~~~~~~~~~~~~~~~~~~~
525+
Roundtripping Arrow types and schema
526+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
527527

528528
While there is no bijection between Arrow types and Parquet types, it is
529529
possible to serialize the Arrow schema as part of the Parquet file metadata.
530530
This is enabled using :func:`ArrowWriterProperties::store_schema`.
531531

532532
On the read path, the serialized schema will be automatically recognized
533533
and will recreate the original Arrow data, converting the Parquet data as
534-
required (for example, a LargeList will be recreated from the Parquet LIST
535-
type).
534+
required.
536535

537536
As an example, when serializing an Arrow LargeList to Parquet:
538537

@@ -542,13 +541,28 @@ As an example, when serializing an Arrow LargeList to Parquet:
542541
:func:`ArrowWriterProperties::store_schema` was enabled when writing the file;
543542
otherwise, it is decoded as an Arrow List.
544543

544+
Parquet field id
545+
""""""""""""""""
546+
547+
The Parquet format supports an optional integer *field id* which can be assigned
548+
to a given field. This is used for example in the
549+
`Apache Iceberg specification <https://github.com/apache/iceberg/blob/main/format/spec.md#column-projection>`__.
550+
551+
On the writer side, if ``PARQUET:field_id`` is present as a metadata key on an
552+
Arrow field, then its value is parsed as a non-negative integer and is used as
553+
the field id for the corresponding Parquet field.
554+
555+
On the reader side, Arrow will convert such a field id to a metadata key named
556+
``PARQUET:field_id`` on the corresponding Arrow field.
557+
545558
Serialization details
546559
"""""""""""""""""""""
547560

548561
The Arrow schema is serialized as a :ref:`Arrow IPC <format-ipc>` schema message,
549562
then base64-encoded and stored under the ``ARROW:schema`` metadata key in
550563
the Parquet file metadata.
551564

565+
552566
Limitations
553567
~~~~~~~~~~~
554568

0 commit comments

Comments
 (0)