Fix timestamp with timezone mapping in iceberg type converter #23534

auden-woolfson · 2024-08-27T21:59:44Z

Description

Fixes bug described in #23529

== RELEASE NOTES ==

Iceberg Connector Changes
* Add logic to iceberg type converter for timestamp with timezone :pr:`23534`

agrawalreetika

Can we add some tests with column types Timestamp with timezone?

tdcmeehan · 2024-08-28T13:21:36Z

+1. Let's add some end to end tests. Additionally, we may want to remove the validation added here, since I believe we should support this properly now: #22926

hantangwangd · 2024-08-28T16:56:45Z

Also add test cases involving Timestamp with timezone in filter conditions and partition columns, I'm a little concerned about the behavior in these scenarios.

auden-woolfson · 2024-08-28T23:52:08Z

Also add test cases involving Timestamp with timezone in filter conditions and partition columns, I'm a little concerned about the behavior in these scenarios.

Just to clarify, do you want the timestamp with timezone to be a part of the table that is being partitioned or the type of the partition column? Currently it is not supported as one of the types for partition columns.

hantangwangd · 2024-08-29T03:16:27Z

Just to clarify, do you want the timestamp with timezone to be a part of the table that is being partitioned or the type of the partition column? Currently it is not supported as one of the types for partition columns.

Yes, that's right. But I think it's better for us to first figure out how to handle it in these cases when we start to support it.

A very important question is, what format of long type data do we plan to actually store in data files for Timestamp with timezone? Presto has a special encoding for data with type of timestamp with timezone, which mix the time zone information with UTC values in millis. Meanwhile, Iceberg spec store the timestamp tz data as a UTC values in micros and do not retain the source time zone.

If we store the data following Iceberg spec, then we will lose the information of time zone; and if we store the data following Presto's format, then we may meet problems involving cross-engine compatibility.

cc: @tdcmeehan @agrawalreetika @ZacBlanco

tdcmeehan · 2024-08-29T12:42:57Z

@hantangwangd I don't see this as a choice, we must store the data according to the Iceberg spec, which means we'll lose the embedded time zone information. This is fine--semantically, it's the same thing, and the only thing that might be confusing is the user, when retrieving stored Iceberg timestamps, will see that the timezones have been adjusted to UTC. But the point in time values will remain the same, and this is merely a limitation of the Iceberg table format.

hantangwangd · 2024-08-29T16:21:09Z

@tdcmeehan Completely agree with your viewpoint.

That means we need to perform transformation logics for data with type of timestamp with timezone when writing/reading, parsing filter conditions, and handling partition values, besides doing the type conversion. It's not to say completing all these works all at once, but it can be divided into a series of PRs to complete.

ZacBlanco

Minor nits. One question about removing the verifyTypeSupported method

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergNativeMetadata.java

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

agrawalreetika

Overall looks good to me.

Please add a document entry in https://prestodb.io/docs/current/connector/iceberg.html#type-mapping
Squash all the commits into 1 "Fix timestamp with timezone mapping in iceberg type converter"

agrawalreetika · 2024-09-14T03:56:00Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/TypeConverter.java

@@ -117,6 +117,10 @@ public static Type toPrestoType(org.apache.iceberg.types.Type type, TypeManager
            case TIME:
                return TimeType.TIME;
            case TIMESTAMP:
+                Types.TimestampType timestampType = (Types.TimestampType) type.asPrimitiveType();
+                if (timestampType.shouldAdjustToUTC()) {
+                    return TimestampWithTimeZoneType.TIMESTAMP_WITH_TIME_ZONE;


Add static import for this

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

...arquet/src/main/java/com/facebook/presto/parquet/reader/LongTimestampMicrosColumnReader.java

auden-woolfson · 2024-09-24T23:46:12Z

...to-parquet/src/main/java/com/facebook/presto/parquet/batchreader/decoders/ValuesDecoder.java

@@ -60,6 +60,9 @@ interface Int64TimeAndTimestampMicrosValuesDecoder
        void readNext(long[] values, int offset, int length)
                throws IOException;

+        void readNextWithTimezone(long[] values, int offset, int length)
+                throws IOException;
+


This is necessary because the decoder is accesses the individual raw values from parquet, not the batch reader. The batch reader has the column descriptor (metadata) which should tell it whether or not to pack the value with timezone, like is done in the regular column reader. The actual packing should be done within the decoder, so there needs to be a way for decoders to dynamically switch between with and without timezone mode.

The approach I took here is just to copy the readNext method from each implementation and add packDateTimewithZone. A stateful approach (instance variable bool withTimestamp or something) would probably be more efficient for future development and code execution but for now I am seeing if this implementation works.

ZacBlanco · 2024-10-21T13:53:51Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+    public Object[][] createTestTimestampWithTimezoneData()
+    {
+        return new Object[][] {
+                {getQueryRunner()},


Wondering if we need a whole separate test for this. Can't we just create a dataProvider which passes in true/false values and lets us construct a valid session in the beginning of the test method? Then you can pass the session to all of the execute/assertQuery methods?

Can you elaborate on this please? I'm not sure what you are referring to as the separate test. I can have the data provider pass in one true and one false value and add a condition inside the test function itself, is that what you are asking for here? If so what purpose would that serve? Thanks

Creating an entirely new query runner is expensive. We should be able to test with the batch read optimization by just setting a session property.

In this data provider you should can create two sessions using the default session. One with batch read enabled and disabled and then run the test. It will take far less time to run. You will need to update all of the query statements to pass the session variable though.

public Object[][] batchReadSessions() { return new Object[][] { {Session.builder(getSession()) .setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "true") .build()}, {Session.builder(getSession()) .setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "false") .build()} }; }

This would also allow you to remove the additional logic inside of createQueryRunner

...to-parquet/src/main/java/com/facebook/presto/parquet/batchreader/decoders/ValuesDecoder.java

ZacBlanco · 2024-10-21T14:00:16Z

.../parquet/batchreader/decoders/rle/Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder.java

@@ -69,7 +91,13 @@ public void readNext(long[] values, int offset, int length)
                    final LongDictionary localDictionary = dictionary;
                    for (int srcIndex = currentBuffer.length - currentCount; destinationIndex < endIndex; srcIndex++) {
                        long dictionaryValue = localDictionary.decodeToLong(localBuffer[srcIndex]);
-                        values[destinationIndex++] = MICROSECONDS.toMillis(dictionaryValue);
+                        long millisValue = MICROSECONDS.toMillis(dictionaryValue);
+                        if (isWithTimezone()) {


Rather than having a branch statement inside the hot loop for a lot of these readers, I think we should set the "reading function" in the constructor of the reader as the reading behavior shouldn't change. Do you see a performance impact when this conditional is introduced?

...in/java/com/facebook/presto/parquet/writer/valuewriter/TimestampWithTimezoneValueWriter.java

ZacBlanco

Core logic is sound. Just some comments on the test cases

ZacBlanco · 2025-01-27T20:32:15Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java

+        assertQuerySucceeds("CREATE TABLE test_timestamp_with_timezone (x timestamp with time zone)");
+        dropTable(getSession(), "test_timestamp_with_timezone");


Why create and then drop? Can't we just remove L92+93 altogether?

ZacBlanco · 2025-01-27T20:39:18Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+    public Object[][] createTestTimestampWithTimezoneData()
+    {
+        return new Object[][] {
+                {getQueryRunner()},


Creating an entirely new query runner is expensive. We should be able to test with the batch read optimization by just setting a session property.

In this data provider you should can create two sessions using the default session. One with batch read enabled and disabled and then run the test. It will take far less time to run. You will need to update all of the query statements to pass the session variable though.

public Object[][] batchReadSessions() { return new Object[][] { {Session.builder(getSession()) .setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "true") .build()}, {Session.builder(getSession()) .setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "false") .build()} }; }

This would also allow you to remove the additional logic inside of createQueryRunner

ZacBlanco · 2025-01-27T20:41:10Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+        String timestamptz = "TIMESTAMP '1984-12-08 00:10:00 America/Los_Angeles'";
+        String timestamp = "TIMESTAMP '1984-12-08 00:10:00'";
+
+        runner.execute("CREATE TABLE test_timestamptz(a TIMESTAMP WITH TIME ZONE, b TIMESTAMP, c TIMESTAMP WITH TIME ZONE)");


let's do assertQuerySucceeds here

ZacBlanco · 2025-01-27T20:41:28Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+        runner.execute("CREATE TABLE test_timestamptz(a TIMESTAMP WITH TIME ZONE, b TIMESTAMP, c TIMESTAMP WITH TIME ZONE)");
+        String row = "(" + timestamptz + ", " + timestamp + ", " + timestamptz + ")";
+        for (int i = 0; i < 10; i++) {
+            runner.execute("INSERT INTO test_timestamptz values " + row);


use assertUpdate

ZacBlanco · 2025-01-27T20:41:52Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+        assertTrue(types.get(0) instanceof TimestampWithTimeZoneType);
+        assertTrue(types.get(1) instanceof TimestampType);
+
+        runner.execute("CREATE TABLE test_timestamptz_partition(a TIMESTAMP WITH TIME ZONE, b TIMESTAMP, c TIMESTAMP WITH TIME ZONE) " +


assertQuerySucceeds

ZacBlanco · 2025-01-27T20:42:52Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+            runner.execute("INSERT INTO test_timestamptz values " + row);
+        }
+
+        MaterializedResult initialRows = runner.execute("SELECT * FROM test_timestamptz");


let's assert on the content of the result here too

ZacBlanco · 2025-01-27T20:43:39Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+                "WITH (PARTITIONING = ARRAY['b'])");
+        runner.execute("INSERT INTO test_timestamptz_partition (a, b, c) SELECT a, b, c FROM test_timestamptz");
+
+        MaterializedResult partitionRows = runner.execute("SELECT * FROM test_timestamptz");


assert on content of the query as well

ZacBlanco · 2025-01-27T20:44:02Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+        runner.execute("CREATE TABLE test_timestamptz_filter(a TIMESTAMP WITH TIME ZONE)");
+
+        for (int i = 0; i < 5; i++) {
+            runner.execute("INSERT INTO test_timestamptz_filter VALUES (" + earlyTimestamptz + ")");


assertUpdate

ZacBlanco · 2025-01-27T20:44:10Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+        assertTrue(partitionTypes.get(1) instanceof TimestampType);
+
+        String earlyTimestamptz = "TIMESTAMP '1980-12-08 00:10:00 America/Los_Angeles'";
+        runner.execute("CREATE TABLE test_timestamptz_filter(a TIMESTAMP WITH TIME ZONE)");


assertQuerySucceeds

ZacBlanco · 2025-01-27T20:44:20Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+            runner.execute("INSERT INTO test_timestamptz_filter VALUES (" + earlyTimestamptz + ")");
+        }
+        for (int i = 0; i < 5; i++) {
+            runner.execute("INSERT INTO test_timestamptz_filter VALUES (" + timestamptz + ")");


assertUpdate

auden-woolfson · 2025-01-27T21:44:21Z

@ZacBlanco thanks, just pushed the changes

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

agrawalreetika · 2025-01-28T19:51:02Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java

+                        .build()},
+                {Session.builder(getSession())
+                        .setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "false")
+                        .build()}


Shouldn't this be different timezone data for test as the name suggests?

This is just to test with batch reader enabled and disabled. We are not changing the data used

Then I think name of the DataProvider and Object method according to the properties would be better?

agrawalreetika · 2025-01-28T19:55:04Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java

@@ -179,10 +180,14 @@ public static DistributedQueryRunner createIcebergQueryRunner(
    {
        setupLogging();

-        Session session = testSessionBuilder()
+        Session.SessionBuilder sessionBuilder = testSessionBuilder()


Nit: import SessionBuilder

...in/java/com/facebook/presto/parquet/writer/valuewriter/TimestampWithTimezoneValueWriter.java

agrawalreetika · 2025-01-28T20:06:51Z

...in/java/com/facebook/presto/parquet/writer/valuewriter/TimestampWithTimezoneValueWriter.java

+        for (int i = 0; i < block.getPositionCount(); i++) {
+            if (!block.isNull(i)) {
+                long value = unpackMillisUtc(type.getLong(block, i));
+                long scaledValue = scaleValueFunction.scaleValue(value);


I think we can directly do this here? And we don't need ScaleValueFunction interface and scaleValueFunction variable above.

long scaledValue = writeMicroseconds ? MILLISECONDS.toMicros(value) : value;

.../parquet/batchreader/decoders/rle/Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder.java

...et/batchreader/decoders/delta/Int64TimeAndTimestampMicrosDeltaBinaryPackedValuesDecoder.java

agrawalreetika · 2025-01-29T05:39:14Z

...presto/parquet/batchreader/decoders/plain/Int64TimeAndTimestampMicrosPlainValuesDecoder.java

+        if (withTimezone) {
+            packFunction = millis -> packDateTimeWithZone(millis, UTC_KEY);
+        }
+        else {
+            packFunction = millis -> millis;
+        }


Suggested change

if (withTimezone) {

packFunction = millis -> packDateTimeWithZone(millis, UTC_KEY);

}

else {

packFunction = millis -> millis;

}

this.packFunction = withTimezone ? millis -> packDateTimeWithZone(millis, UTC_KEY) : millis -> millis;

hantangwangd

Thanks for the change, overall looks good to me. Some little nits.

presto-docs/src/main/sphinx/connector/iceberg.rst

...va/com/facebook/presto/parquet/batchreader/Int64TimeAndTimestampMicrosNestedBatchReader.java

...to-parquet/src/main/java/com/facebook/presto/parquet/batchreader/decoders/ValuesDecoder.java

...et/batchreader/decoders/delta/Int64TimeAndTimestampMicrosDeltaBinaryPackedValuesDecoder.java

...presto/parquet/batchreader/decoders/plain/Int64TimeAndTimestampMicrosPlainValuesDecoder.java

.../parquet/batchreader/decoders/rle/Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder.java

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java

ZacBlanco

Just some minor issues, otherwise lgtm

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java

tdcmeehan · 2025-01-30T16:10:11Z

Thank you for the reviews @ZacBlanco @agrawalreetika @hantangwangd, and thanks @auden-woolfson for the fix.

auden-woolfson added bug iceberg Apache Iceberg related labels Aug 27, 2024

auden-woolfson self-assigned this Aug 27, 2024

auden-woolfson requested review from hantangwangd, ZacBlanco and a team as code owners August 27, 2024 21:59

auden-woolfson requested a review from presto-oss August 27, 2024 21:59

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from 8e1716e to 38919d7 Compare August 27, 2024 22:01

auden-woolfson linked an issue Aug 27, 2024 that may be closed by this pull request

Iceberg timestamptz should map to Presto TIMESTAMP WITH TIME ZONE type #23529

Closed

agrawalreetika reviewed Aug 28, 2024

View reviewed changes

tdcmeehan self-assigned this Aug 28, 2024

auden-woolfson requested a review from shangxinli as a code owner August 28, 2024 21:27

ZacBlanco requested changes Sep 5, 2024

View reviewed changes

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch 2 times, most recently from a562f45 to d877bcc Compare September 13, 2024 21:50

agrawalreetika requested changes Sep 14, 2024

View reviewed changes

hantangwangd requested changes Sep 20, 2024

View reviewed changes

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java Outdated Show resolved Hide resolved

...arquet/src/main/java/com/facebook/presto/parquet/reader/LongTimestampMicrosColumnReader.java Outdated Show resolved Hide resolved

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from 69a4878 to 361d23a Compare September 24, 2024 19:54

auden-woolfson commented Sep 24, 2024

View reviewed changes

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from 79d53ce to 607edf0 Compare September 25, 2024 17:59

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from a381447 to c7deb43 Compare October 18, 2024 16:08

ZacBlanco requested changes Oct 21, 2024

View reviewed changes

auden-woolfson requested a review from steveburnett as a code owner October 21, 2024 20:35

prestodb-ci added the from:IBM PR from IBM label Jan 14, 2025

prestodb-ci requested review from a team, sh-shamsan and pdabre12 and removed request for a team January 14, 2025 23:56

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch 2 times, most recently from be489f1 to 8add1c4 Compare January 16, 2025 21:40

auden-woolfson requested review from hantangwangd, agrawalreetika and ZacBlanco January 16, 2025 23:02

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from 8add1c4 to 67895c0 Compare January 16, 2025 23:08

ZacBlanco requested changes Jan 27, 2025

View reviewed changes

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from d888615 to 3badf53 Compare January 27, 2025 21:43

ZacBlanco requested changes Jan 27, 2025

View reviewed changes

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java Outdated Show resolved Hide resolved

ZacBlanco reviewed Jan 28, 2025

View reviewed changes

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java Outdated Show resolved Hide resolved

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from e1b5581 to fa8be89 Compare January 28, 2025 17:29

agrawalreetika requested changes Jan 28, 2025

View reviewed changes

agrawalreetika reviewed Jan 29, 2025

View reviewed changes

hantangwangd requested changes Jan 29, 2025

View reviewed changes

ZacBlanco reviewed Jan 29, 2025

View reviewed changes

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java Outdated Show resolved Hide resolved

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java Outdated Show resolved Hide resolved

ZacBlanco requested changes Jan 29, 2025

View reviewed changes

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java Outdated Show resolved Hide resolved

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java Outdated Show resolved Hide resolved

Add support for timestamps with timezone in iceberg type converter

0ce3cc8

auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from f5c9447 to 0ce3cc8 Compare January 29, 2025 22:56

ZacBlanco approved these changes Jan 29, 2025

View reviewed changes

agrawalreetika approved these changes Jan 30, 2025

View reviewed changes

hantangwangd approved these changes Jan 30, 2025

View reviewed changes

tdcmeehan approved these changes Jan 30, 2025

View reviewed changes

auden-woolfson merged commit ec9e904 into prestodb:master Jan 30, 2025
55 checks passed

		assertQuerySucceeds("CREATE TABLE test_timestamp_with_timezone (x timestamp with time zone)");
		dropTable(getSession(), "test_timestamp_with_timezone");

Fix timestamp with timezone mapping in iceberg type converter #23534

Fix timestamp with timezone mapping in iceberg type converter #23534

Conversation

auden-woolfson commented Aug 27, 2024 • edited Loading

Description

agrawalreetika left a comment

Choose a reason for hiding this comment

tdcmeehan commented Aug 28, 2024

hantangwangd commented Aug 28, 2024

auden-woolfson commented Aug 28, 2024

hantangwangd commented Aug 29, 2024

tdcmeehan commented Aug 29, 2024

hantangwangd commented Aug 29, 2024

ZacBlanco left a comment

Choose a reason for hiding this comment

agrawalreetika left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZacBlanco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

auden-woolfson commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hantangwangd left a comment

Choose a reason for hiding this comment

ZacBlanco left a comment

Choose a reason for hiding this comment

tdcmeehan commented Jan 30, 2025

auden-woolfson commented Aug 27, 2024 •

edited

Loading