-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timestamp with timezone mapping in iceberg type converter #23534
Fix timestamp with timezone mapping in iceberg type converter #23534
Conversation
8e1716e
to
38919d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some tests with column types Timestamp with timezone
?
+1. Let's add some end to end tests. Additionally, we may want to remove the validation added here, since I believe we should support this properly now: #22926 |
Also add test cases involving |
Just to clarify, do you want the timestamp with timezone to be a part of the table that is being partitioned or the type of the partition column? Currently it is not supported as one of the types for partition columns. |
Yes, that's right. But I think it's better for us to first figure out how to handle it in these cases when we start to support it. A very important question is, what format of long type data do we plan to actually store in data files for If we store the data following Iceberg spec, then we will lose the information of time zone; and if we store the data following Presto's format, then we may meet problems involving cross-engine compatibility. |
@hantangwangd I don't see this as a choice, we must store the data according to the Iceberg spec, which means we'll lose the embedded time zone information. This is fine--semantically, it's the same thing, and the only thing that might be confusing is the user, when retrieving stored Iceberg timestamps, will see that the timezones have been adjusted to UTC. But the point in time values will remain the same, and this is merely a limitation of the Iceberg table format. |
@tdcmeehan Completely agree with your viewpoint. That means we need to perform transformation logics for data with type of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits. One question about removing the verifyTypeSupported
method
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergNativeMetadata.java
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java
Outdated
Show resolved
Hide resolved
a562f45
to
d877bcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me.
- Please add a document entry in https://prestodb.io/docs/current/connector/iceberg.html#type-mapping
- Squash all the commits into 1 "Fix timestamp with timezone mapping in iceberg type converter"
@@ -117,6 +117,10 @@ public static Type toPrestoType(org.apache.iceberg.types.Type type, TypeManager | |||
case TIME: | |||
return TimeType.TIME; | |||
case TIMESTAMP: | |||
Types.TimestampType timestampType = (Types.TimestampType) type.asPrimitiveType(); | |||
if (timestampType.shouldAdjustToUTC()) { | |||
return TimestampWithTimeZoneType.TIMESTAMP_WITH_TIME_ZONE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add static import for this
presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java
Outdated
Show resolved
Hide resolved
...arquet/src/main/java/com/facebook/presto/parquet/reader/LongTimestampMicrosColumnReader.java
Outdated
Show resolved
Hide resolved
69a4878
to
361d23a
Compare
@@ -60,6 +60,9 @@ interface Int64TimeAndTimestampMicrosValuesDecoder | |||
void readNext(long[] values, int offset, int length) | |||
throws IOException; | |||
|
|||
void readNextWithTimezone(long[] values, int offset, int length) | |||
throws IOException; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because the decoder is accesses the individual raw values from parquet, not the batch reader. The batch reader has the column descriptor (metadata) which should tell it whether or not to pack the value with timezone, like is done in the regular column reader. The actual packing should be done within the decoder, so there needs to be a way for decoders to dynamically switch between with and without timezone mode.
The approach I took here is just to copy the readNext method from each implementation and add packDateTimewithZone
. A stateful approach (instance variable bool withTimestamp or something) would probably be more efficient for future development and code execution but for now I am seeing if this implementation works.
79d53ce
to
607edf0
Compare
a381447
to
c7deb43
Compare
public Object[][] createTestTimestampWithTimezoneData() | ||
{ | ||
return new Object[][] { | ||
{getQueryRunner()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we need a whole separate test for this. Can't we just create a dataProvider which passes in true/false values and lets us construct a valid session in the beginning of the test method? Then you can pass the session to all of the execute/assertQuery methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this please? I'm not sure what you are referring to as the separate test. I can have the data provider pass in one true and one false value and add a condition inside the test function itself, is that what you are asking for here? If so what purpose would that serve? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating an entirely new query runner is expensive. We should be able to test with the batch read optimization by just setting a session property.
In this data provider you should can create two sessions using the default session. One with batch read enabled and disabled and then run the test. It will take far less time to run. You will need to update all of the query statements to pass the session variable though.
public Object[][] batchReadSessions()
{
return new Object[][] {
{Session.builder(getSession())
.setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "true")
.build()},
{Session.builder(getSession())
.setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "false")
.build()}
};
}
This would also allow you to remove the additional logic inside of createQueryRunner
...to-parquet/src/main/java/com/facebook/presto/parquet/batchreader/decoders/ValuesDecoder.java
Outdated
Show resolved
Hide resolved
@@ -69,7 +91,13 @@ public void readNext(long[] values, int offset, int length) | |||
final LongDictionary localDictionary = dictionary; | |||
for (int srcIndex = currentBuffer.length - currentCount; destinationIndex < endIndex; srcIndex++) { | |||
long dictionaryValue = localDictionary.decodeToLong(localBuffer[srcIndex]); | |||
values[destinationIndex++] = MICROSECONDS.toMillis(dictionaryValue); | |||
long millisValue = MICROSECONDS.toMillis(dictionaryValue); | |||
if (isWithTimezone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having a branch statement inside the hot loop for a lot of these readers, I think we should set the "reading function" in the constructor of the reader as the reading behavior shouldn't change. Do you see a performance impact when this conditional is introduced?
...in/java/com/facebook/presto/parquet/writer/valuewriter/TimestampWithTimezoneValueWriter.java
Show resolved
Hide resolved
be489f1
to
8add1c4
Compare
8add1c4
to
67895c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core logic is sound. Just some comments on the test cases
assertQuerySucceeds("CREATE TABLE test_timestamp_with_timezone (x timestamp with time zone)"); | ||
dropTable(getSession(), "test_timestamp_with_timezone"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create and then drop? Can't we just remove L92+93 altogether?
public Object[][] createTestTimestampWithTimezoneData() | ||
{ | ||
return new Object[][] { | ||
{getQueryRunner()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating an entirely new query runner is expensive. We should be able to test with the batch read optimization by just setting a session property.
In this data provider you should can create two sessions using the default session. One with batch read enabled and disabled and then run the test. It will take far less time to run. You will need to update all of the query statements to pass the session variable though.
public Object[][] batchReadSessions()
{
return new Object[][] {
{Session.builder(getSession())
.setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "true")
.build()},
{Session.builder(getSession())
.setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "false")
.build()}
};
}
This would also allow you to remove the additional logic inside of createQueryRunner
String timestamptz = "TIMESTAMP '1984-12-08 00:10:00 America/Los_Angeles'"; | ||
String timestamp = "TIMESTAMP '1984-12-08 00:10:00'"; | ||
|
||
runner.execute("CREATE TABLE test_timestamptz(a TIMESTAMP WITH TIME ZONE, b TIMESTAMP, c TIMESTAMP WITH TIME ZONE)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do assertQuerySucceeds
here
runner.execute("CREATE TABLE test_timestamptz(a TIMESTAMP WITH TIME ZONE, b TIMESTAMP, c TIMESTAMP WITH TIME ZONE)"); | ||
String row = "(" + timestamptz + ", " + timestamp + ", " + timestamptz + ")"; | ||
for (int i = 0; i < 10; i++) { | ||
runner.execute("INSERT INTO test_timestamptz values " + row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertUpdate
assertTrue(types.get(0) instanceof TimestampWithTimeZoneType); | ||
assertTrue(types.get(1) instanceof TimestampType); | ||
|
||
runner.execute("CREATE TABLE test_timestamptz_partition(a TIMESTAMP WITH TIME ZONE, b TIMESTAMP, c TIMESTAMP WITH TIME ZONE) " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertQuerySucceeds
runner.execute("INSERT INTO test_timestamptz values " + row); | ||
} | ||
|
||
MaterializedResult initialRows = runner.execute("SELECT * FROM test_timestamptz"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's assert on the content of the result here too
"WITH (PARTITIONING = ARRAY['b'])"); | ||
runner.execute("INSERT INTO test_timestamptz_partition (a, b, c) SELECT a, b, c FROM test_timestamptz"); | ||
|
||
MaterializedResult partitionRows = runner.execute("SELECT * FROM test_timestamptz"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert on content of the query as well
runner.execute("CREATE TABLE test_timestamptz_filter(a TIMESTAMP WITH TIME ZONE)"); | ||
|
||
for (int i = 0; i < 5; i++) { | ||
runner.execute("INSERT INTO test_timestamptz_filter VALUES (" + earlyTimestamptz + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertUpdate
assertTrue(partitionTypes.get(1) instanceof TimestampType); | ||
|
||
String earlyTimestamptz = "TIMESTAMP '1980-12-08 00:10:00 America/Los_Angeles'"; | ||
runner.execute("CREATE TABLE test_timestamptz_filter(a TIMESTAMP WITH TIME ZONE)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertQuerySucceeds
runner.execute("INSERT INTO test_timestamptz_filter VALUES (" + earlyTimestamptz + ")"); | ||
} | ||
for (int i = 0; i < 5; i++) { | ||
runner.execute("INSERT INTO test_timestamptz_filter VALUES (" + timestamptz + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertUpdate
d888615
to
3badf53
Compare
@ZacBlanco thanks, just pushed the changes |
presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergTypes.java
Outdated
Show resolved
Hide resolved
e1b5581
to
fa8be89
Compare
.build()}, | ||
{Session.builder(getSession()) | ||
.setCatalogSessionProperty("iceberg", PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "false") | ||
.build()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be different timezone data for test as the name suggests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to test with batch reader enabled and disabled. We are not changing the data used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think name of the DataProvider and Object method according to the properties would be better?
@@ -179,10 +180,14 @@ public static DistributedQueryRunner createIcebergQueryRunner( | |||
{ | |||
setupLogging(); | |||
|
|||
Session session = testSessionBuilder() | |||
Session.SessionBuilder sessionBuilder = testSessionBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: import SessionBuilder
...in/java/com/facebook/presto/parquet/writer/valuewriter/TimestampWithTimezoneValueWriter.java
Outdated
Show resolved
Hide resolved
for (int i = 0; i < block.getPositionCount(); i++) { | ||
if (!block.isNull(i)) { | ||
long value = unpackMillisUtc(type.getLong(block, i)); | ||
long scaledValue = scaleValueFunction.scaleValue(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can directly do this here? And we don't need ScaleValueFunction
interface and scaleValueFunction
variable above.
long scaledValue = writeMicroseconds ? MILLISECONDS.toMicros(value) : value;
.../parquet/batchreader/decoders/rle/Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder.java
Outdated
Show resolved
Hide resolved
...et/batchreader/decoders/delta/Int64TimeAndTimestampMicrosDeltaBinaryPackedValuesDecoder.java
Outdated
Show resolved
Hide resolved
if (withTimezone) { | ||
packFunction = millis -> packDateTimeWithZone(millis, UTC_KEY); | ||
} | ||
else { | ||
packFunction = millis -> millis; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (withTimezone) { | |
packFunction = millis -> packDateTimeWithZone(millis, UTC_KEY); | |
} | |
else { | |
packFunction = millis -> millis; | |
} | |
this.packFunction = withTimezone ? millis -> packDateTimeWithZone(millis, UTC_KEY) : millis -> millis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, overall looks good to me. Some little nits.
...va/com/facebook/presto/parquet/batchreader/Int64TimeAndTimestampMicrosNestedBatchReader.java
Outdated
Show resolved
Hide resolved
...to-parquet/src/main/java/com/facebook/presto/parquet/batchreader/decoders/ValuesDecoder.java
Outdated
Show resolved
Hide resolved
...et/batchreader/decoders/delta/Int64TimeAndTimestampMicrosDeltaBinaryPackedValuesDecoder.java
Outdated
Show resolved
Hide resolved
...et/batchreader/decoders/delta/Int64TimeAndTimestampMicrosDeltaBinaryPackedValuesDecoder.java
Outdated
Show resolved
Hide resolved
...presto/parquet/batchreader/decoders/plain/Int64TimeAndTimestampMicrosPlainValuesDecoder.java
Outdated
Show resolved
Hide resolved
...presto/parquet/batchreader/decoders/plain/Int64TimeAndTimestampMicrosPlainValuesDecoder.java
Outdated
Show resolved
Hide resolved
.../parquet/batchreader/decoders/rle/Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder.java
Outdated
Show resolved
Hide resolved
.../parquet/batchreader/decoders/rle/Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor issues, otherwise lgtm
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergQueryRunner.java
Outdated
Show resolved
Hide resolved
f5c9447
to
0ce3cc8
Compare
Thank you for the reviews @ZacBlanco @agrawalreetika @hantangwangd, and thanks @auden-woolfson for the fix. |
Description
Fixes bug described in #23529