Skip to content

Commit 0804053

Browse files
committed
Add vops_as_123 file
1 parent b9c9497 commit 0804053

File tree

2 files changed

+221
-1
lines changed

2 files changed

+221
-1
lines changed

Diff for: README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -789,7 +789,7 @@ The `create_projection(PNAME,...)` functions does the following:
789789
5. Insert information about created projection in `vops_projections` table. This table is used by optimizer to
790790
automatically substitute table with partition.
791791

792-
The `order_by` attribute is on of the VOPS projection vector columns by which data is sorted. Usually it is some kind of timestamp
792+
The `order_by` attribute is one of the VOPS projection vector columns by which data is sorted. Usually it is some kind of timestamp
793793
used in *time series* (for example trade date). Presence of such column in projection allows to incrementally update projection.
794794
Generated `PNAME_refresh()` method calls `populate` method with correspondent values of `predicate` and
795795
`sort` parameters, selecting from original table only rows with `order_by` column value greater than maximal

Diff for: vops_as_123.md

+220
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
## <span id="query_transform">Query transformation</span>
2+
3+
VOPS is vectorized executor for Postgres. It is implemented as extension and can be used with any version of Postgres
4+
and doesn't require any changes in Postgres core. Using vector (tile) types VOPS provides columnar format (like Apache Arrow)
5+
on top of standard Postgres heap. And custom operators for VOPS types implement vector operations, minimizing interpretation overhead.
6+
Comparing with other vertical solutions for Postgres: cstore_fwd and zedstore,
7+
VOPS provides more than 10 times improvement of execution speed on some OLAP queries.
8+
9+
VOPS was developed more than three yeas ago and while main paradigm is not changed,
10+
my vision of how it should be used was significantly changed.
11+
My original indention was to define VOPS-specific tile types (i.e. vops_int4,...)
12+
which provides the same operations as native scalar types and thanks to Postgres
13+
user defined operators allow to write queries for VOPS tables almost in the same was as
14+
for standard tables.
15+
16+
But very soon it becomes clear that although it is possible to redefine most of operators
17+
and aggregates, for some of them it is not possible: AND, OR, BETWEEN, COUNT(*).
18+
And my attempt to replace them with VOPS analogs, i.e. BETWIXT,&,|,...
19+
require users to learn VOPS "dialect" of SQL. Also there are other cases,
20+
when VOPS requires changing of standard queries, for example adding explicit type cast for
21+
string literal.
22+
23+
Another my idea was to use foreign-data-wrappers (FDW) mechanism
24+
to provide "normal" (row-based) access to VOPS table. vops_fdw should transpose columns into rows.
25+
Unfortunately limitation of Postgres FDW mechanism, like not supporting parallel query execution,
26+
devalue any advantages of columnar storage and performance is several times worse than for normal tables.
27+
28+
Finally I come to the conclusion that data should be kept in normal (horizontal)
29+
table which is most efficient for OLTP operations and importing data.
30+
And user should create one or more VOPS projections of this table,
31+
where some subset of columns are "tiles": scalar types are replaced with VOPS types.
32+
Grouping and ordering of rows in the project allows to store data in the format most efficient for
33+
some particular subclass of queries. For example Q1 query of TPC-H benchmark performs group by
34+
linestatus and returnflag, so we can create projection grouped by this columns.
35+
36+
Postgres planner hook tries to substitute query to main table with tranformed query to one of the available
37+
VOPS projections. If it is possible (requirements for queries which can be executed on projections
38+
will be listed below), then it transforms query using VOPS operators.
39+
If it is not possible, then query is executed as usual on the original table.
40+
41+
So user writes query using standard SQL. If there is some existed application, then
42+
it should not be changed. But if query can be executed on some of the available VOPS projections,
43+
then it's speed can be increased several times (for example speedup on TPC-H Q1 is about ten times).
44+
45+
Synchronization of main table and VOPS projects is responsibly of user.
46+
The straightforward solution is to use on-insert trigger to propagate changes to projection.
47+
But it is actually bad idea: to efficiently pack records in vertical format we need to
48+
perform bulk inserts. So we need to hold data in some buffer and then perform massive update.
49+
Fortunately VOPS is useful for OLAP queries and such queries usually do not require most recent data.
50+
So it is possible to periodically transfer data from main table to projects (for example at night
51+
when system is idle). If projection is ordered by some monotonic key (i.e timestamp),
52+
then VOPS is smart enough to append only most recent records which are not parent in the projection.
53+
If there is no such key, then projections should be recreated from scratch.
54+
55+
As far as main table and projections may not be synchronized, but default
56+
such implicit query transformation is disabled.
57+
You should explicitly switch it on using `vops.auto_substitute_projections` parameter.
58+
It can be done locally for the current session or for the whole Postgres instance.
59+
60+
## <span id="projection_creation">Creation of projections</span>
61+
62+
Now consider creation of partitions more precisely.
63+
64+
In future it may be added to SQL grammar, so that it is possible to write
65+
`CREATE PROJECTION xxx OF TABLE yyy(column1, column2,...) GROUP BY (column1, column2, ...)`.
66+
But right now it can be done using `create_projection(projection_name text, source_table regclass, vector_columns text[], scalar_columns text[] default null, order_by text default null)` function.
67+
First argument of this function specifies name of the projection, second refers to existed Postgres table, `vector_columns` is array of
68+
column names which should be stores as VOPS tiles, `scalar_columns` is array of grouping columns which type is preserved and
69+
optional `order_by` parameter specifies name of ordering attribute (explained below).
70+
The `create_projection(PNAME,...)` functions does the following:
71+
72+
1. Creates projection table with specified name and attributes.
73+
2. Creates PNAME_refresh() functions which can be used to update projection.
74+
3. Creates functional BRIN indexes for `first()` and `last()` functions of ordering attribute (if any)
75+
4. Creates BRIN index on grouping attributes (if any)
76+
5. Insert information about created projection in `vops_projections` table. This table is used by optimizer to
77+
automatically substitute table with partition.
78+
79+
The `order_by` attribute is one of the VOPS projection vector columns by which data is sorted. Usually it is some kind of timestamp
80+
used in *time series* (for example trade date). Presence of such column in projection allows to incrementally update projection.
81+
Generated `PNAME_refresh()` method calls `populate` method with correspondent values of `predicate` and
82+
`sort` parameters, selecting from original table only rows with `order_by` column value greater than maximal
83+
value of this column in the projection. It assumes that `order_by` is unique or at least refresh is done at the moment when there is some gap
84+
in collected events. In addition to `order_by`, sort list for `populate` includes all scalar (grouping) columns.
85+
It allows to efficiently group imported data by scalar columns and fill VOPS tiles (vector columns) with data.
86+
87+
When `order_by` attribute is specified, VOPS creates two functional BRIN indexes on `first()` and `last()`
88+
functions of this attribute. Presence of such indexes allows to efficiently select time slices. If original query contains
89+
predicate like `(trade_date between '01-01-2017' and '01-01-2018')` then VOPS projection substitution mechanism adds
90+
`(first(trade_date) >= '01-01-2017' and last(trade_date) >= '01-01-2018')` conjuncts which allow Postgres optimizer to use BRIN
91+
index to locate affected pages.
92+
93+
In in addition to BRIN indexes for `order_by` attribute, VOPS also creates BRIN index for grouping (scalar) columns.
94+
Such index allows to efficiently select groups and perform index join.
95+
96+
Presence of scalar columns in VOPS projections allows to used them in index search, grouping or ordering.
97+
Please notice, that if you are importing data in VOPS projection with scalar columns,
98+
then input data should be sorted by these columns. And number of duplicated combinations of this columns should be large enough (greater than hundreds).
99+
Only in this case tiles will be efficiently filled with data. Otherwise you will only loose disk space without any positive effect on performance.
100+
101+
102+
Right now projections can be automatically substituted only if:
103+
104+
1. Query doesn't contain joins.
105+
2. Query performs aggregation of vector (tile) columns.
106+
3. All other expressions in target list, `ORDER BY` / `GROUP BY` clauses refer only to scalar attributes of projection.
107+
108+
Projection can be removed using `drop_projection(projection_name text)` function.
109+
It not only drops the correspondent table, but also removes information about it from `vops_partitions` table
110+
and drops generated refresh function.
111+
112+
## <span id="example">Example of using projections</span>
113+
114+
115+
```
116+
create extension vops;
117+
118+
create table lineitem(
119+
l_orderkey integer,
120+
l_partkey integer,
121+
l_suppkey integer,
122+
l_linenumber integer,
123+
l_quantity real,
124+
l_extendedprice real,
125+
l_discount real,
126+
l_tax real,
127+
l_returnflag "char",
128+
l_linestatus "char",
129+
l_shipdate date,
130+
l_commitdate date,
131+
l_receiptdate date,
132+
l_shipinstruct char(25),
133+
l_shipmode char(10),
134+
l_comment char(44),
135+
l_dummy char(1));
136+
137+
-- Create VOPS projection
138+
select create_projection('vops_lineitem','lineitem',array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'],array['l_returnflag','l_linestatus']);
139+
140+
\timing
141+
142+
-- Load data in main table
143+
copy lineitem from '/mnt/data/lineitem.tbl' delimiter '|' csv;
144+
145+
-- Transfer data from main table to projections.
146+
select vops_lineitem_refresh();
147+
148+
-- Allow query substition
149+
set vops.auto_substitute_projections TO on;
150+
151+
-- Now let VOPS planner hook to use VOPS projections instead of main table
152+
select
153+
l_returnflag,
154+
l_linestatus,
155+
sum(l_quantity) as sum_qty,
156+
sum(l_extendedprice) as sum_base_price,
157+
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
158+
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
159+
avg(l_quantity) as avg_qty,
160+
avg(l_extendedprice) as avg_price,
161+
avg(l_discount) as avg_disc,
162+
count(*) as count_order
163+
from
164+
lineitem
165+
where
166+
l_shipdate <= '1998-12-01'
167+
group by
168+
l_returnflag,
169+
l_linestatus
170+
order by
171+
l_returnflag,
172+
l_linestatus;
173+
```
174+
175+
## <span id="vops_rules">Three rules of using VOPS</span>
176+
177+
Now lets formulate 1-2-3 rules of using VOPS.
178+
179+
1. Consider whether VOPS can help to speed-up your application. VOPS will be useful if
180+
your applications runs OLAP queries on large volumes of data and these queries mostly do some
181+
filtering and aggregation (may be with grouping). If queries contain joins, then VOPS can not help you
182+
unless you perform join on foreign key which was left as scalar in projections.
183+
But even in the last case automatic substations of query will not work and you have to write query manually.
184+
185+
2. Find out whether your queries perform some particular grouping. Size of each group should be large enough
186+
(thousands) and number of different grouping combination should be small (because maintaining more than few projections
187+
will be too expensive). Create projection for each group.
188+
189+
3. Define projection refresh policy. If original table contains some monotonic key (like timestamp or
190+
auto-generated column), then specify it as ordering field during projection creation. It will allow to incrementally update projection,
191+
appending only new records. VOPS is mostly oriented on work with append-only tables. There is no way to efficiently handle updates.
192+
To make transformation from horizontal to vertical format as efficient as possible, you should perform bulk update,
193+
adding relatively large number of records.
194+
Adding records to projection one-by-one leads to very bad space utilization, because VOPS is not able to append data to existed tiles.
195+
196+
Once you created projections and populated them with data, you can try to run queries.
197+
Please use `EXPLAIN` to check whether plan for projection was chosen.
198+
199+
## <span id="vops_pros_and_contras">VOPS pros and contras</span>
200+
201+
Advantages of VOPS:
202+
1. Significant (several times) increase of execution speed on some OLAP queries.
203+
VOPS provides speed comparable with Yandex Clickhouse, preserving all power of Postgres relational model.
204+
2. Can be used with any version of Postgres: no need to install Postgres forks and migrate data to them
205+
(unlike CitusDB/GreenPlum)
206+
3. Fully transnational, data can be managed using standard Postgres utilities, like basebackup, pg_dump...
207+
4. VOPS doesn't affect original table, so presence of VOPS doesn't somehow affect data consistency/durability,
208+
as well as performance of OLTP operations on original table, for example data insertion speed.
209+
5. Doesn't require rewriting queries and changing your application (in case of using automatic query substitution).
210+
211+
Certainly VOPS approach has some limitations and disadvantages:
212+
1. Extra space needed for VOPS projections: maintaining N projections requires N times more storage space.
213+
2. Original table and projections are not synchronized, so online analytic is not possible
214+
(OLAP queries are not working with most recent data).
215+
3. Transferring data from main table to projection may take significant amount of time and right now this operation can not be done in parallel.
216+
4. Only small subset of OLAP queries can be efficiently handled by VOPS (joins, subqueries, window functions and many other things are not supported).
217+
5. No columnar specific compression is performed (VOPS relies on standard Postgres compression mechanism which is not so efficient).
218+
6. Some query plans may return incorrect results (for example VOPS correctly works for hash aggregate, but not for sort aggregate).
219+
220+
I hope that this small document can help you to understand whether VOPS may be useful in your case.

0 commit comments

Comments
 (0)