Skip to content

Commit c8434d6

Browse files
author
Etsuro Fujita
committed
Allow partitionwise joins in more cases.
Previously, the partitionwise join technique only allowed partitionwise join when input partitioned tables had exactly the same partition bounds. This commit extends the technique to some cases when the tables have different partition bounds, by using an advanced partition-matching algorithm introduced by this commit. For both the input partitioned tables, the algorithm checks whether every partition of one input partitioned table only matches one partition of the other input partitioned table at most, and vice versa. In such a case the join between the tables can be broken down into joins between the matching partitions, so the algorithm produces the pairs of the matching partitions, plus the partition bounds for the join relation, to allow partitionwise join for computing the join. Currently, the algorithm works for list-partitioned and range-partitioned tables, but not hash-partitioned tables. See comments in partition_bounds_merge(). Ashutosh Bapat and Etsuro Fujita, most of regression tests by Rajkumar Raghuwanshi, some of the tests by Mark Dilger and Amul Sul, reviewed by Dmitry Dolgov and Amul Sul, with additional review at various points by Ashutosh Bapat, Mark Dilger, Robert Haas, Antonin Houska, Amit Langote, Justin Pryzby, and Tomas Vondra Discussion: https://postgr.es/m/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
1 parent 41a194f commit c8434d6

File tree

11 files changed

+5392
-83
lines changed

11 files changed

+5392
-83
lines changed

doc/src/sgml/config.sgml

+3-3
Original file line numberDiff line numberDiff line change
@@ -4749,9 +4749,9 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
47494749
which allows a join between partitioned tables to be performed by
47504750
joining the matching partitions. Partitionwise join currently applies
47514751
only when the join conditions include all the partition keys, which
4752-
must be of the same data type and have exactly matching sets of child
4753-
partitions. Because partitionwise join planning can use significantly
4754-
more CPU time and memory during planning, the default is
4752+
must be of the same data type and have one-to-one matching sets of
4753+
child partitions. Because partitionwise join planning can use
4754+
significantly more CPU time and memory during planning, the default is
47554755
<literal>off</literal>.
47564756
</para>
47574757
</listitem>

src/backend/nodes/outfuncs.c

+2
Original file line numberDiff line numberDiff line change
@@ -2309,6 +2309,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
23092309
WRITE_BOOL_FIELD(has_eclass_joins);
23102310
WRITE_BOOL_FIELD(consider_partitionwise_join);
23112311
WRITE_BITMAPSET_FIELD(top_parent_relids);
2312+
WRITE_BOOL_FIELD(partbounds_merged);
2313+
WRITE_BITMAPSET_FIELD(all_partrels);
23122314
WRITE_NODE_FIELD(partitioned_child_rels);
23132315
}
23142316

src/backend/optimizer/README

+27
Original file line numberDiff line numberDiff line change
@@ -1106,6 +1106,33 @@ into joins between their partitions is called partitionwise join. We will use
11061106
term "partitioned relation" for either a partitioned table or a join between
11071107
compatibly partitioned tables.
11081108

1109+
Even if the joining relations don't have exactly the same partition bounds,
1110+
partitionwise join can still be applied by using an advanced
1111+
partition-matching algorithm. For both the joining relations, the algorithm
1112+
checks wether every partition of one joining relation only matches one
1113+
partition of the other joining relation at most. In such a case the join
1114+
between the joining relations can be broken down into joins between the
1115+
matching partitions. The join relation can then be considerd partitioned.
1116+
The algorithm produces the pairs of the matching partitions, plus the
1117+
partition bounds for the join relation, to allow partitionwise join for
1118+
computing the join. The algorithm is implemented in partition_bounds_merge().
1119+
For an N-way join relation considered partitioned this way, not every pair of
1120+
joining relations can use partitionwise join. For example:
1121+
1122+
(A leftjoin B on (Pab)) innerjoin C on (Pac)
1123+
1124+
where A, B, and C are partitioned tables, and A has an extra partition
1125+
compared to B and C. When considering partitionwise join for the join {A B},
1126+
the extra partition of A doesn't have a matching partition on the nullable
1127+
side, which is the case that the current implementation of partitionwise join
1128+
can't handle. So {A B} is not considered partitioned, and the pair of {A B}
1129+
and C considered for the 3-way join can't use partitionwise join. On the
1130+
other hand, the pair of {A C} and B can use partitionwise join because {A C}
1131+
is considered partitioned by eliminating the extra partition (see identity 1
1132+
on outer join reordering). Whether an N-way join can use partitionwise join
1133+
is determined based on the first pair of joining relations that are both
1134+
partitioned and can use partitionwise join.
1135+
11091136
The partitioning properties of a partitioned relation are stored in its
11101137
RelOptInfo. The information about data types of partition keys are stored in
11111138
PartitionSchemeData structure. The planner maintains a list of canonical

src/backend/optimizer/path/joinrels.c

+236-27
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,13 @@ static void try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1,
4545
static SpecialJoinInfo *build_child_join_sjinfo(PlannerInfo *root,
4646
SpecialJoinInfo *parent_sjinfo,
4747
Relids left_relids, Relids right_relids);
48+
static void compute_partition_bounds(PlannerInfo *root, RelOptInfo *rel1,
49+
RelOptInfo *rel2, RelOptInfo *joinrel,
50+
SpecialJoinInfo *parent_sjinfo,
51+
List **parts1, List **parts2);
52+
static void get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
53+
RelOptInfo *rel1, RelOptInfo *rel2,
54+
List **parts1, List **parts2);
4855

4956

5057
/*
@@ -1354,25 +1361,29 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
13541361
{
13551362
bool rel1_is_simple = IS_SIMPLE_REL(rel1);
13561363
bool rel2_is_simple = IS_SIMPLE_REL(rel2);
1357-
int nparts;
1364+
List *parts1 = NIL;
1365+
List *parts2 = NIL;
1366+
ListCell *lcr1 = NULL;
1367+
ListCell *lcr2 = NULL;
13581368
int cnt_parts;
13591369

13601370
/* Guard against stack overflow due to overly deep partition hierarchy. */
13611371
check_stack_depth();
13621372

13631373
/* Nothing to do, if the join relation is not partitioned. */
1364-
if (!IS_PARTITIONED_REL(joinrel))
1374+
if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
13651375
return;
13661376

13671377
/* The join relation should have consider_partitionwise_join set. */
13681378
Assert(joinrel->consider_partitionwise_join);
13691379

13701380
/*
1371-
* Since this join relation is partitioned, all the base relations
1372-
* participating in this join must be partitioned and so are all the
1373-
* intermediate join relations.
1381+
* We can not perform partitionwise join if either of the joining relations
1382+
* is not partitioned.
13741383
*/
1375-
Assert(IS_PARTITIONED_REL(rel1) && IS_PARTITIONED_REL(rel2));
1384+
if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
1385+
return;
1386+
13761387
Assert(REL_HAS_ALL_PART_PROPS(rel1) && REL_HAS_ALL_PART_PROPS(rel2));
13771388

13781389
/* The joining relations should have consider_partitionwise_join set. */
@@ -1386,42 +1397,51 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
13861397
Assert(joinrel->part_scheme == rel1->part_scheme &&
13871398
joinrel->part_scheme == rel2->part_scheme);
13881399

1389-
/*
1390-
* Since we allow partitionwise join only when the partition bounds of the
1391-
* joining relations exactly match, the partition bounds of the join
1392-
* should match those of the joining relations.
1393-
*/
1394-
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
1395-
joinrel->part_scheme->parttyplen,
1396-
joinrel->part_scheme->parttypbyval,
1397-
joinrel->boundinfo, rel1->boundinfo));
1398-
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
1399-
joinrel->part_scheme->parttyplen,
1400-
joinrel->part_scheme->parttypbyval,
1401-
joinrel->boundinfo, rel2->boundinfo));
1400+
Assert(!(joinrel->partbounds_merged && (joinrel->nparts <= 0)));
14021401

1403-
nparts = joinrel->nparts;
1402+
compute_partition_bounds(root, rel1, rel2, joinrel, parent_sjinfo,
1403+
&parts1, &parts2);
1404+
1405+
if (joinrel->partbounds_merged)
1406+
{
1407+
lcr1 = list_head(parts1);
1408+
lcr2 = list_head(parts2);
1409+
}
14041410

14051411
/*
14061412
* Create child-join relations for this partitioned join, if those don't
14071413
* exist. Add paths to child-joins for a pair of child relations
14081414
* corresponding to the given pair of parent relations.
14091415
*/
1410-
for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
1416+
for (cnt_parts = 0; cnt_parts < joinrel->nparts; cnt_parts++)
14111417
{
1412-
RelOptInfo *child_rel1 = rel1->part_rels[cnt_parts];
1413-
RelOptInfo *child_rel2 = rel2->part_rels[cnt_parts];
1414-
bool rel1_empty = (child_rel1 == NULL ||
1415-
IS_DUMMY_REL(child_rel1));
1416-
bool rel2_empty = (child_rel2 == NULL ||
1417-
IS_DUMMY_REL(child_rel2));
1418+
RelOptInfo *child_rel1;
1419+
RelOptInfo *child_rel2;
1420+
bool rel1_empty;
1421+
bool rel2_empty;
14181422
SpecialJoinInfo *child_sjinfo;
14191423
List *child_restrictlist;
14201424
RelOptInfo *child_joinrel;
14211425
Relids child_joinrelids;
14221426
AppendRelInfo **appinfos;
14231427
int nappinfos;
14241428

1429+
if (joinrel->partbounds_merged)
1430+
{
1431+
child_rel1 = lfirst_node(RelOptInfo, lcr1);
1432+
child_rel2 = lfirst_node(RelOptInfo, lcr2);
1433+
lcr1 = lnext(parts1, lcr1);
1434+
lcr2 = lnext(parts2, lcr2);
1435+
}
1436+
else
1437+
{
1438+
child_rel1 = rel1->part_rels[cnt_parts];
1439+
child_rel2 = rel2->part_rels[cnt_parts];
1440+
}
1441+
1442+
rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
1443+
rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));
1444+
14251445
/*
14261446
* Check for cases where we can prove that this segment of the join
14271447
* returns no rows, due to one or both inputs being empty (including
@@ -1519,6 +1539,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
15191539
child_sjinfo,
15201540
child_sjinfo->jointype);
15211541
joinrel->part_rels[cnt_parts] = child_joinrel;
1542+
joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
1543+
child_joinrel->relids);
15221544
}
15231545

15241546
Assert(bms_equal(child_joinrel->relids, child_joinrelids));
@@ -1570,3 +1592,190 @@ build_child_join_sjinfo(PlannerInfo *root, SpecialJoinInfo *parent_sjinfo,
15701592

15711593
return sjinfo;
15721594
}
1595+
1596+
/*
1597+
* compute_partition_bounds
1598+
* Compute the partition bounds for a join rel from those for inputs
1599+
*/
1600+
static void
1601+
compute_partition_bounds(PlannerInfo *root, RelOptInfo *rel1,
1602+
RelOptInfo *rel2, RelOptInfo *joinrel,
1603+
SpecialJoinInfo *parent_sjinfo,
1604+
List **parts1, List **parts2)
1605+
{
1606+
/*
1607+
* If we don't have the partition bounds for the join rel yet, try to
1608+
* compute those along with pairs of partitions to be joined.
1609+
*/
1610+
if (joinrel->nparts == -1)
1611+
{
1612+
PartitionScheme part_scheme = joinrel->part_scheme;
1613+
PartitionBoundInfo boundinfo = NULL;
1614+
int nparts = 0;
1615+
1616+
Assert(joinrel->boundinfo == NULL);
1617+
Assert(joinrel->part_rels == NULL);
1618+
1619+
/*
1620+
* See if the partition bounds for inputs are exactly the same, in
1621+
* which case we don't need to work hard: the join rel have the same
1622+
* partition bounds as inputs, and the partitions with the same
1623+
* cardinal positions form the pairs.
1624+
*
1625+
* Note: even in cases where one or both inputs have merged bounds,
1626+
* it would be possible for both the bounds to be exactly the same, but
1627+
* it seems unlikely to be worth the cycles to check.
1628+
*/
1629+
if (!rel1->partbounds_merged &&
1630+
!rel2->partbounds_merged &&
1631+
rel1->nparts == rel2->nparts &&
1632+
partition_bounds_equal(part_scheme->partnatts,
1633+
part_scheme->parttyplen,
1634+
part_scheme->parttypbyval,
1635+
rel1->boundinfo, rel2->boundinfo))
1636+
{
1637+
boundinfo = rel1->boundinfo;
1638+
nparts = rel1->nparts;
1639+
}
1640+
else
1641+
{
1642+
/* Try merging the partition bounds for inputs. */
1643+
boundinfo = partition_bounds_merge(part_scheme->partnatts,
1644+
part_scheme->partsupfunc,
1645+
part_scheme->partcollation,
1646+
rel1, rel2,
1647+
parent_sjinfo->jointype,
1648+
parts1, parts2);
1649+
if (boundinfo == NULL)
1650+
{
1651+
joinrel->nparts = 0;
1652+
return;
1653+
}
1654+
nparts = list_length(*parts1);
1655+
joinrel->partbounds_merged = true;
1656+
}
1657+
1658+
Assert(nparts > 0);
1659+
joinrel->boundinfo = boundinfo;
1660+
joinrel->nparts = nparts;
1661+
joinrel->part_rels =
1662+
(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * nparts);
1663+
}
1664+
else
1665+
{
1666+
Assert(joinrel->nparts > 0);
1667+
Assert(joinrel->boundinfo);
1668+
Assert(joinrel->part_rels);
1669+
1670+
/*
1671+
* If the join rel's partbounds_merged flag is true, it means inputs
1672+
* are not guaranteed to have the same partition bounds, therefore we
1673+
* can't assume that the partitions at the same cardinal positions form
1674+
* the pairs; let get_matching_part_pairs() generate the pairs.
1675+
* Otherwise, nothing to do since we can assume that.
1676+
*/
1677+
if (joinrel->partbounds_merged)
1678+
{
1679+
get_matching_part_pairs(root, joinrel, rel1, rel2,
1680+
parts1, parts2);
1681+
Assert(list_length(*parts1) == joinrel->nparts);
1682+
Assert(list_length(*parts2) == joinrel->nparts);
1683+
}
1684+
}
1685+
}
1686+
1687+
/*
1688+
* get_matching_part_pairs
1689+
* Generate pairs of partitions to be joined from inputs
1690+
*/
1691+
static void
1692+
get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
1693+
RelOptInfo *rel1, RelOptInfo *rel2,
1694+
List **parts1, List **parts2)
1695+
{
1696+
bool rel1_is_simple = IS_SIMPLE_REL(rel1);
1697+
bool rel2_is_simple = IS_SIMPLE_REL(rel2);
1698+
int cnt_parts;
1699+
1700+
*parts1 = NIL;
1701+
*parts2 = NIL;
1702+
1703+
for (cnt_parts = 0; cnt_parts < joinrel->nparts; cnt_parts++)
1704+
{
1705+
RelOptInfo *child_joinrel = joinrel->part_rels[cnt_parts];
1706+
RelOptInfo *child_rel1;
1707+
RelOptInfo *child_rel2;
1708+
Relids child_relids1;
1709+
Relids child_relids2;
1710+
1711+
/*
1712+
* If this segment of the join is empty, it means that this segment
1713+
* was ignored when previously creating child-join paths for it in
1714+
* try_partitionwise_join() as it would not contribute to the join
1715+
* result, due to one or both inputs being empty; add NULL to each of
1716+
* the given lists so that this segment will be ignored again in that
1717+
* function.
1718+
*/
1719+
if (!child_joinrel)
1720+
{
1721+
*parts1 = lappend(*parts1, NULL);
1722+
*parts2 = lappend(*parts2, NULL);
1723+
continue;
1724+
}
1725+
1726+
/*
1727+
* Get a relids set of partition(s) involved in this join segment that
1728+
* are from the rel1 side.
1729+
*/
1730+
child_relids1 = bms_intersect(child_joinrel->relids,
1731+
rel1->all_partrels);
1732+
Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids));
1733+
1734+
/*
1735+
* Get a child rel for rel1 with the relids. Note that we should have
1736+
* the child rel even if rel1 is a join rel, because in that case the
1737+
* partitions specified in the relids would have matching/overlapping
1738+
* boundaries, so the specified partitions should be considered as ones
1739+
* to be joined when planning partitionwise joins of rel1, meaning that
1740+
* the child rel would have been built by the time we get here.
1741+
*/
1742+
if (rel1_is_simple)
1743+
{
1744+
int varno = bms_singleton_member(child_relids1);
1745+
1746+
child_rel1 = find_base_rel(root, varno);
1747+
}
1748+
else
1749+
child_rel1 = find_join_rel(root, child_relids1);
1750+
Assert(child_rel1);
1751+
1752+
/*
1753+
* Get a relids set of partition(s) involved in this join segment that
1754+
* are from the rel2 side.
1755+
*/
1756+
child_relids2 = bms_intersect(child_joinrel->relids,
1757+
rel2->all_partrels);
1758+
Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids));
1759+
1760+
/*
1761+
* Get a child rel for rel2 with the relids. See above comments.
1762+
*/
1763+
if (rel2_is_simple)
1764+
{
1765+
int varno = bms_singleton_member(child_relids2);
1766+
1767+
child_rel2 = find_base_rel(root, varno);
1768+
}
1769+
else
1770+
child_rel2 = find_join_rel(root, child_relids2);
1771+
Assert(child_rel2);
1772+
1773+
/*
1774+
* The join of rel1 and rel2 is legal, so is the join of the child
1775+
* rels obtained above; add them to the given lists as a join pair
1776+
* producing this join segment.
1777+
*/
1778+
*parts1 = lappend(*parts1, child_rel1);
1779+
*parts2 = lappend(*parts2, child_rel2);
1780+
}
1781+
}

src/backend/optimizer/util/inherit.c

+2
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
376376
/* Create the otherrel RelOptInfo too. */
377377
childrelinfo = build_simple_rel(root, childRTindex, relinfo);
378378
relinfo->part_rels[i] = childrelinfo;
379+
relinfo->all_partrels = bms_add_members(relinfo->all_partrels,
380+
childrelinfo->relids);
379381

380382
/* If this child is itself partitioned, recurse */
381383
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)

0 commit comments

Comments
 (0)