Include Table Metadata in Operator Logs (prestodb#24590)

mkarrmann · NikhilCollooru · commit 3e059408ea2e · 2025-03-10T15:18:47.000-07:00
Summary: Pull Request resolved: prestodb#24590 X-link: facebookexternal/presto-facebook#3126 > Functional Requirement: Log stats in a way such that we can provide Tetris per-Table Warm Storage read metrics (including number of bytes read, and the region of the WS VC we read from) Adds the Table name and Warm Storage VC fields to the Presto Query Operator Stats logs, which well be set for all Prism Operators which read a Table (and null otherwise). This is accomplished by adding the query's `planNodeIdMap` to its `QueryCompletedEvent` (in OSS Presto), and then using this map in `PrismEventListener` to map from Operator stats to information from their respective Plan Nodes. # Context Runtime Stats in a Presto feature which conveniently supports aggregating statistics from different sources (in our case, aggregating the number of WS bytes read across all splits by Operator). In Java, the Runtime Stats already includes the number of WS Bytes read. Becase this diff adds the WS Region we're reading from, and we already know which region the query executes in, this diff is sufficient for logging all the information that Tetris needs. However, this data is not yet logged in C++. This is tracked here: https://www.internalfb.com/tasks/my_tasks?t=212787929. This diff will be needed in order to provide Tetris everything it needs, and I'll make sure we complete that task shortly. # Alternatives Considered There are quite a few places in the code that at first glance appear that they already should provide the information that we need, but just quite don't. That also means there's many different places this change could be made to accomplish the functional requirement. ## Scope Runtime Stats by Table My original approach for this: https://www.internalfb.com/diffs/browse?phabricator_diff_id=1099167592009927. "scope" the runtime stats by appending the Table name to the stats themselves. Even at the time I thought this was a quick-and-dirty hack, but after getting more context on how Runtime Stats are used (in particular, that they are user-facing because they're displayed in the Query UI), I realized that this approach wasn't acceptable. There just isn't a clean way to make this a) not hideous, hacky, fragile code (e.g. adding hacks to hide specific Runtime Stats on an adhoc basis, encoding metadata in the Stat name itself, etc) b) presentable to users (since Runtime Stats are displayed in UIs, and therefore are a user-facing feature) c) satisfy the functional requirements (include stats broken down by Table and Region read from) ## SplitStats The split stats contain an Info field which includes the Warm Storage file path they read from, which would be sufficient for our purposes! It would also seem, looking at our code, that we already log SplitStats. However, looking at [Configerator](https://www.internalfb.com/code/fbsource/[3baf3cc56bb1971bfd5c932f63018870821b520f]/fbcode/tupperware/config/presto/include/warehouse_event_listener.cinc?lines=85), we see it logs to the "blackhole" topic. That is, it's a dummy log. This was set in order to avoid logging too much, so probably shouldn't be messed with. ## OperatorStats Info/InfoUnion In addition to the Info field (which may come from one of the Splits), OperatorStats has the InfoUnion field which should contain all of the Info fields for its corresponding Splits. This would also be sufficient for our purposes. However, these fields are never set, at least in our codepaths for Prism reads. I didn't look into this enough to be completely confident about why these aren't being populated, but there may be be dead code here. While this approch is promising, I know there's already concerns around the amount of CPU cycles Coordinators burn manipulating these sets Objects: https://docs.google.com/document/d/1rzJjMcMXebazQy8PEOEAThY4HOl0e3WhA3GIXUKrRDI/edit?tab=t.0. If these stats are being nulled out, then perhaps it's for the best in order to minimize overhead. Of the alternative approaches I list here, this the second most-promising behind IO Metadata: ## IO Metadata https://www.internalfb.com/code/fbsource/fbcode/github/presto-trunk/presto-spi/src/main/java/com/facebook/presto/spi/eventlistener/QueryInputMetadata.java This one is almost perfect: this class includes statistics, broken down by Table, and is included in our Query logs. There's just two problems: - Currently, we don't actually populate the `statistics` field for this class. This means we can't use this for mapping from Tables to the statistics we need. I haven't tracked this down to determine whether or not this is intentional, but at the very least this would be a larger lift to get working. - This is part of the SPI and OSS Presto. However, we need to be able to log Meta-internal information (e.g. whether the traffic is x-region). This is a solvable problem, but it'd potentially be a hassle to do deal with. With that being said, if I needed to choose another approach for this diff, I would spend more time trying to make this work. ## Various Other "Plan Stats" Classes I spent a while looking at a few promising looking classes, e.g. [TableStats](https://www.internalfb.com/code/fbsource/fbcode/github/presto-trunk/presto-spi/src/main/java/com/facebook/presto/spi/statistics/TableStatistics.java) before realizing that these are actually meant to be use by the Query Planner, and "Stats" specifically refers certain things used to estimate Costs for the Query Planner. This isn't exactly the same type of stat I needed. However, this distinction becomes a bit blurred with HBO-stuff and when when we set the `confidence` parameter to `1.0`, indicating the stats are a "fact" pulled from actual Runtime Stats. It was tempting to pursue this because these classes already manage a) mapping from physical plan stats to logical plan nodes b) aggregating stats per-node. Perhaps in an ideal world all of our logging and stats tracking would be coherently unified so that we only computing every stat exactly once, but for now it seemed best to avoid coupling to any of this Planner/HBO stuff. # Justification for chosen approach There's two main advantages to this approach: - Simple (evidenced by the small number of lines changed) - Natural and extensible - While this requires making a change to OSS Presto, that change itself isn't coupled to our specific requirements. Wanting to look up the Plan Nodes in a Query Completed Event handler in order to enrich logs is a reasonable use case. We may even have future use cases which benefit from this. Differential Revision: D69730285
diff --git a/presto-main/src/main/java/com/facebook/presto/event/QueryMonitor.java b/presto-main/src/main/java/com/facebook/presto/event/QueryMonitor.java
@@ -274,7 +274,8 @@ public void queryImmediateFailureEvent(BasicQueryInfo queryInfo, ExecutionFailur
                 ImmutableSet.of(),
                 ImmutableSet.of(),
                 Optional.empty(),
-                ImmutableMap.of()));
+                ImmutableMap.of(),
+                Optional.empty()));
 
         logQueryTimeline(queryInfo);
     }
@@ -316,7 +317,8 @@ public void queryCompletedEvent(QueryInfo queryInfo)
                         queryInfo.getAggregateFunctions(),
                         queryInfo.getWindowFunctions(),
                         queryInfo.getPrestoSparkExecutionContext(),
-                        getPlanHash(queryInfo.getPlanCanonicalInfo(), historyBasedPlanStatisticsTracker.getStatsEquivalentPlanRootNode(queryInfo.getQueryId()))));
+                        getPlanHash(queryInfo.getPlanCanonicalInfo(), historyBasedPlanStatisticsTracker.getStatsEquivalentPlanRootNode(queryInfo.getQueryId())),
+                        Optional.of(queryInfo.getPlanIdNodeMap())));
 
         logQueryTimeline(queryInfo);
     }
diff --git a/presto-spi/src/main/java/com/facebook/presto/spi/eventlistener/QueryCompletedEvent.java b/presto-spi/src/main/java/com/facebook/presto/spi/eventlistener/QueryCompletedEvent.java
@@ -16,6 +16,7 @@
 import com.facebook.presto.common.plan.PlanCanonicalizationStrategy;
 import com.facebook.presto.common.resourceGroups.QueryType;
 import com.facebook.presto.spi.PrestoWarning;
+import com.facebook.presto.spi.plan.PlanNode;
 import com.facebook.presto.spi.plan.PlanNodeId;
 import com.facebook.presto.spi.prestospark.PrestoSparkExecutionContext;
 import com.facebook.presto.spi.statistics.PlanStatisticsWithSourceInfo;
@@ -57,6 +58,7 @@ public class QueryCompletedEvent
     private final Set<String> windowFunctions;
     private final Optional<PrestoSparkExecutionContext> prestoSparkExecutionContext;
     private final Map<PlanCanonicalizationStrategy, String> hboPlanHash;
+    private final Optional<Map<PlanNodeId, PlanNode>> planIdNodeMap;
 
     public QueryCompletedEvent(
             QueryMetadata metadata,
@@ -84,7 +86,8 @@ public QueryCompletedEvent(
             Set<String> aggregateFunctions,
             Set<String> windowFunctions,
             Optional<PrestoSparkExecutionContext> prestoSparkExecutionContext,
-            Map<PlanCanonicalizationStrategy, String> hboPlanHash)
+            Map<PlanCanonicalizationStrategy, String> hboPlanHash,
+            Optional<Map<PlanNodeId, PlanNode>> planNodeIdMap)
     {
         this.metadata = requireNonNull(metadata, "metadata is null");
         this.statistics = requireNonNull(statistics, "statistics is null");
@@ -112,6 +115,7 @@ public QueryCompletedEvent(
         this.windowFunctions = requireNonNull(windowFunctions, "windowFunctions is null");
         this.prestoSparkExecutionContext = requireNonNull(prestoSparkExecutionContext, "prestoSparkExecutionContext is null");
         this.hboPlanHash = requireNonNull(hboPlanHash, "planHash is null");
+        this.planIdNodeMap = requireNonNull(planNodeIdMap, "planNodeIdMap is null");
     }
 
     public QueryMetadata getMetadata()
@@ -243,4 +247,9 @@ public Map<PlanCanonicalizationStrategy, String> getHboPlanHash()
     {
         return hboPlanHash;
     }
+
+    public Optional<Map<PlanNodeId, PlanNode>> getPlanNodeIdMap()
+    {
+        return planIdNodeMap;
+    }
 }