F.85. vops — support for vector operations #
The vops
module is an experimental Postgres Pro Enterprise extension that provides support for vector operations, which allows to speed up OLAP queries with filtering and aggregation more than 10 times. Performance improvements are achieved by implementing a vertical data model that allows storing data in tiles, or groups of column values. Such a storage format is also called Parquet in some analytical databases. Reducing the overhead of unpacking tuples, vertical data model allows to speed up query execution without any radical changes in Postgres Pro planner and executor.
vops
provides its own set of vector functions that can be used inside predicates and aggregate expressions. To avoid learning new syntax, you can use postgres_fdw as an abstraction layer and run regular SQL queries on vectorized data. For query types other than filtering and aggregation, vops
can also use postgres_fdw
to present vectorized data as a regular table with scalar column types and process this data as regular tuples. For ease of use, vops
also provides an automated way of generating and accessing vectorized data, as explained in Section F.85.5.5.
F.85.1. Limitations #
Tile-based tables only allow
INSERT
andAPPEND
operations. Once imported, vectorized data cannot be updated.Vector operations are only available for filtering and aggregation. Other query types are supported via
postgres_fdw
.Aggregate expressions must be of the same type.
JOIN
operations are not supported for vector types. You have to usepostgres_fdw
to transform vectorized data back to regular tuples to run such queries.Since
postgres_fdw
does not support parallel query execution, you can only run parallel queries when using vector operators directly.
F.85.2. Installation and Setup #
The vops
extension is included into Postgres Pro Enterprise. Once you have Postgres Pro Enterprise installed, complete the following steps to enable vops
:
Add
vops
to the shared_preload_libraries parameter in thepostgresql.conf
file:shared_preload_libraries = 'vops'
Note
If the extension library is not added to the
shared_preload_libraries
list, it will be loaded on demand after the first function invocation of this extension. In this case, the first time you execute a vectorized query, it will not be processed correctly and can return a wrong result.Restart the Postgres Pro Enterprise instance for the changes to take effect.
Create
vops
extension:CREATE EXTENSION vops;
Once vops
is enabled, you can create vectorized tables and query data using vector operations in your database.
If you would like to use regular SQL syntax for vector types, you must also enable postgres_fdw.
F.85.3. Motivation #
For OLTP workloads, Postgres Pro looks very competitive as compared to other mainstream databases. However, for OLAP queries, which require processing larger volumes of data, analytics-oriented database management systems can provide a much better speed. There are several main factors that limit Postgres Pro performance:
Tuple unpacking overhead. To access column values, Postgres Pro needs to unpack the tuple. As tables can have columns of variable-length data types, to extract the N-th column, the preceding N-1 columns need to be unpacked. Thus, deforming a tuple is quite an expensive operation, especially for tables with a large number of attributes. Besides, values can be compressed or stored in another page (TOAST). In queries like Query 6 of the TPC-H benchmark, deforming tuples takes about 40% of the total query execution time.
Interpretation overhead. Postgres Pro planner and optimizer build a tree representing the query execution plan. Query executor recursively calls functions that evaluate the nodes of this tree. Some nodes may also contain switches used to select the requested action. Thus, query plan is interpreted by Postgres Pro query executor rather than directly executed. The interpreter is usually about 10 times slower than the native code. If you eliminate interpretation overhead, you can increase query speed several times, especially for queries with complex predicates that have high evaluation costs.
Abstraction penalty. Support for abstract (user-defined) types and operations is one of the key features of Postgres Pro. However, the price of such flexibility is that each operation requires a separate function call. Instead of adding two integers directly, Postgres Pro executor invokes a function that performs addition. In this case, function call overhead is much higher than the cost the operation itself. Function call overhead is also increased because Postgres Pro requires passing parameter values through memory.
Pull model overhead. In Postgres Pro, operators pull the operand values. This approach simplifies the implementation of operators and the executor, but has a negative impact on performance. In this model, leaf nodes that fetch a tuple from the heap or index pages have to do a lot of extra work while saving and restoring their context.
MVCC overhead. Postgres Pro provides multi-version concurrency control, which allows multiple transactions to work with the same record in parallel without blocking each other. It is good for frequently updated data (OLTP), but for read-only or append-only data in OLAP scenarios it only adds both space overhead (about 20 extra bytes per tuple) and CPU overhead (checking visibility of each tuple).
Vector operations can address most of these issues without radical changes in the Postgres Pro executor, as explained in Section F.85.4.
F.85.4. Architecture #
The vops
extension implements a vertical data model for Postgres Pro. In this model, vertical columns are used as a data storing unit, with the values of the corresponding table attribute forming a vector.
The vertical data model used by vops
has the following advantages:
Reduces the size of fetched data: only the columns used in the query need to be fetched.
Achieves higher compression rates: storing all values of the same attribute together makes it possible to compress them much better and faster. For example, you can use delta encoding.
Minimizes interpretation overhead: a set of values can be processed by a single operation instead of running separate operations for each value.
Uses CPU vector instructions (SIMD) to process data.
A traditional Postgres Pro query executor deals with a single row of data at each moment of time. For example, to evaluate an expression (x+y)
, it first fetches the value of x
, then fetches the value of y
, performs the addition, and returns the result value to the upper node. In contrast, a vectorized executor can process multiple values within a single operation. In this case, x
and y
represent vectors of values instead of single scalars. The returned result is also a vector. In vector execution model, interpretation and function call overhead is divided by the vector size. The price of performing a function call is the same, but as the function processes N values instead of a single one, this overhead becomes less critical. The larger the vector, the smaller the per-row overhead.
However, performing an operation on the whole column may also turn out to be ineffective. Working with large vectors prevents efficient utilization of CPU cache levels. If the table is very large, the vector may not fit in memory at all. To avoid these issues, vops
splits the columns into relatively small chunks, or tiles. The size of a tile is currently set to 64 elements. It allows to keep all operands of vector operations in cache, even for complex expressions.
The vops
extension implements special vector types to be used instead of scalar types for table columns. The available types are listed in Section F.85.6.1. To use vops
, you have to create vectorized projections of original tables, with at least some attributes using these tile types, as explained in Section F.85.5.1.
The original table can be treated as a write-optimized storage: if it has no indexes, Postgres Pro can provide a very fast insertion speed, comparable to raw disk write speed. Vectorized projection can be treated as a read-optimized storage, which is most efficient for running OLAP queries.
Once all the data is in the tile-based format, you can run vectorized queries on this data. vops
provides a set of vector operators to work with vector types. Using these operators, you can write queries for filtering and aggregation, similar to SQL queries. For details, see Section F.85.5.2.
For other query types, vops
allows to work with vectorized data via postgres_fdw
. You can add vectorized data to your database cluster as a foreign table and use regular SQL for running any types of queries on this data. Using the implicit type casts and post-parse ANALYZE
hook, the vops
extension either transforms the query to use vector operators for filtering and aggregation, or processes the data using scalar operators for other query types. For details, see Section F.85.5.3.
F.85.5. Usage #
F.85.5.1. Converting Data into Vectorized Format #
To start using vector operations, you have to convert your data into a vectorized format, as explained below.
Create an empty vectorized table.
To use vector operations, you need to load data into a tile-based vectorized table, which can be treated as a projection of the original table. It can map all columns of the original table, or include some of the most frequently used columns only. To create a vectorized table, you can use the regular CREATE TABLE syntax, but at least some of the columns must be of vector types.
For example, suppose you have the following table from the TPC-H benchmark:
CREATE TABLE lineitem( l_orderkey integer, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity double precision, l_extendedprice double precision, l_discount double precision, l_tax double precision, l_returnflag "char", l_linestatus "char", l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25), l_shipmode char(10), l_comment char(44));
To create a vectorized projection that includes only a subset of the original columns, you can run:
CREATE TABLE vops_lineitem( l_shipdate vops_date not null, l_quantity vops_float8 not null, l_extendedprice vops_float8 not null, l_discount vops_float8 not null, l_tax vops_float8 not null, l_returnflag vops_char not null, l_linestatus vops_char not null );
Alternatively, you can keep the scalar type for some of the columns. In this case, you can sort the data by these columns when loading the data. For example, let's create another vectorized projection where
l_returnflag
andl_linestatus
fields remain scalar:CREATE TABLE vops_lineitem_projection( l_shipdate vops_date not null, l_quantity vops_float8 not null, l_extendedprice vops_float8 not null, l_discount vops_float8 not null, l_tax vops_float8 not null, l_returnflag "char" not null, l_linestatus "char" not null );
In this table,
l_returnflag
andl_linestatus
fields are scalar, while all the other fields are of the vector types that can be used in vector operations.Load the data into the vectorized table.
If the data to load is already stored in your database, use the populate() function to preprocess the data and unite attribute values of several rows inside a single tile. This function loads the data from the original table into its vectorized projection, using vector types instead of scalar types, as specified in the created empty table.
For example:
SELECT populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass);
Once the table is populated, you can start running queries performing sequential scan.
Optionally, you can sort the data by one or more scalar columns when loading the data:
SELECT populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem'::regclass, sort := 'l_returnflag,l_linestatus');
If you sort the original table by the scalar columns that have a lot of duplicates,
vops
can effectively collapse them and store the corresponding attribute values in tiles, which can reduce the occupied space, speed up queries, and facilitate index creation on these scalar columns.If your data is not yet loaded into the database and you would like to avoid having two copies of the same dataset, you can import the data into a vectorized table directly from a CSV file using the import() function, bypassing creation of a regular table. For example:
SELECT import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|');
For detailed description and syntax of the populate()
and import()
functions, see Section F.85.6.4.
Once your data is in the tile-based format, you can run queries on this data using vector operators, as described in Section F.85.5.2. If you create a foreign vectorized table, you can also run regular SQL queries. See Section F.85.5.3 for details.
F.85.5.2. Running Queries on Vectorized Data #
vops
implements its own vector functions and operators for working with vectorized data. See Section F.85.6 for the full list of vector functions and operators and their usage specifics.
F.85.5.2.1. Aggregating Vectorized Data #
OLAP queries often aggregate large volumes of data. To enable aggregate operations for vector types, vops
implements the following aggregate functions for vectorized data: count
, min
, max
, avg
, var_pop
, var_sampl
, variance
, stddev_pop
, stddev_samp
, stddev
. Besides, vops
provides the wavg
aggregate function that calculates volume-weighted average price (VWAP).
The vops
extension supports the following types of aggregates:
grand aggregates calculated for the whole table
aggregates for subsets of table rows defined by the
GROUP BY
clause
F.85.5.2.1.1. Calculating Grand Aggregates #
To calculate grand aggregates, you can use the provided vector aggregate functions exactly like in regular SQL queries.
Examples:
To calculate volume-weighted average price, run the following query:
SELECT wavg(l_extendedprice,l_quantity) FROM vops_lineitem;
To calculate the company revenue for the year 1996 based on discounted prices of the shipped goods, run:
SELECT sum(l_extendedprice*l_discount) AS revenue FROM vops_lineitem WHERE filter(betwixt(l_shipdate, '1996-01-01', '1997-01-01') & betwixt(l_discount, 0.08, 0.1) & (l_quantity < 24));
F.85.5.2.1.2. Calculating Grouped Aggregates #
To support grouped aggregates, vops
provides map()
and reduce()
functions:
The
map()
function collects aggregate states for all groups in a hash table. Its syntax is as follows:map(
group_by_expression
,aggregate_list
,expr
{,expr
})The
reduce()
function iterates through the hash table constructed bymap
and returns the set. Thereduce
function is needed because the result of aggregation in Postgres Pro cannot be a set.
Consider the following example:
SELECT reduce(map(l_returnflag||l_linestatus, 'sum,sum,sum,sum,avg,avg,avg', l_quantity, l_extendedprice, l_extendedprice*(1-l_discount), l_extendedprice*(1-l_discount)*(1+l_tax), l_quantity, l_extendedprice, l_discount)) FROM vops_lineitem WHERE filter(l_shipdate <= '1998-12-01'::date);
The concatenation operator ||
performs grouping by two columns. The vops
extension supports grouping only by an integer type. The map
function accepts aggregation arguments as a variadic array, so all elements of this array should have the same type. For example, you cannot calculate aggregates for vops_float4
and vops_int8
columns in a single operation.
An aggregate string in map
function should contain the list of aggregate functions to call, in a comma-separated format. Standard lowercase names should be used: count, sum, agg, min, max
. Count is executed for the particular column: count(x)
. There is no need to explicitly specify count(*)
because the number of records in each group is returned by reduce
function in any case.
reduce
function returns set of the vops_aggregate
type. It contains three components: the value of the GROUP BY
expression, the number of records in the group, and an array of floats with aggregate values. The values of all aggregates, including count
and min/max
, are returned as floats.
Alternatively, you can partition the vectorized table by the GROUP BY
fields to calculate grouped aggregates. In this case, grouping keys are stored as regular scalar values, while other fields are stored inside tiles. Now Postgres Pro executor will execute vops aggregates for each group:
SELECT l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice*(1-l_discount)) AS sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM vops_lineitem_projection WHERE filter(l_shipdate <= '1998-12-01'::date) GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
In this example, l_returnflag
and l_linestatus
fields of the vops_lineitem_projection
table have the char
type, while all the other used fields are of vector types. The query above is executed even faster than query with the map()
and reduce()
functions. The main problem with this approach is that you have to create a separate projection for each combination of the GROUP BY
keys to be used in queries.
F.85.5.2.2. Using Vector Window Functions #
The vops
extension provides limited support for Postgres Pro window functions. It provides mcount
, msum
, mmin
, mmax
, mavg
, and lag
functions that correspond to the standard Postgres Pro window functions count
, sum
, min
, max
, avg
, and lag
functions, respectively.
Note the following important restrictions:
Filtering, grouping, and sorting can be done only by scalar attributes.
Vector window functions only support window frames specified with
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
Examples:
SELECT vops_unnest(t.*) FROM (SELECT mcount(*) over w,mcount(x) over w,msum(x) over w,mavg(x) over w,mmin(x) over w,mmax(x) over w,x - lag(x) over w FROM v window w AS (rows between unbounded preceding and current row)) t;
F.85.5.2.3. Converting Results to Horizontal Format #
A query from the vops
projection returns a set of tiles. The output function of a tile type can print the tile contents. However, if you want to convert the result to the horizontal format where each tuple represents a single record, use the vops-unnest function. For example:
SELECT vops_unnest(l.*) FROM vops_lineitem l WHERE filter(l_shipdate >= '1998-12-01'::date);
F.85.5.3. Using Standard SQL for Vector Operations #
Vector operators can efficiently execute only filtering and aggregation queries. For other query types, you have to run regular SQL queries. The vops
extension offers the following options:
Use the original table, if any.
Enable a foreign data wrapper (FDW) using
postgres_fdw
to present a vectorized table to Postgres Pro as a regular table with scalar column types.
If you use postgres_fdw
with the vops
extension, vectorized data is not really foreign: it is stored inside the database in an alternative vectorized format. FDW allows to hide specifics of the vectorized format and provides the following opportunities for working with vectorized data:
Extract the data from a vectorized table and present it in the horizontal format, so that it can be processed by upper nodes in the query execution plan.
Push filtering and aggregation queries down to vector operations.
Get statistic for the underlying table by running
ANALYZE
for the foreign table. The collected statistic can be used by the query optimizer, so query execution plan should be almost the same as for regular tables.
Thus, by placing a vectorized projection under FDW, you can use standard SQL without any vops
-specific functions to perform sequential scan and aggregation queries as if they are explicitly written for the vops
table, and execute any other queries on this data at the same time, including JOIN
operations, common table expressions (CTE), etc. The queries that can be efficiently executed by vector operations will be pushed by Postgres Pro query optimizer to vops
FDW and will be executed using vector operators. For other queries, the data is fetched from vectorized tables as standard tuples. If it is determined that vector operations are required, vops
uses a post-parse ANALYZE
hook that replaces scalar boolean operations with vector boolean operations:
Original Expression | Transformed Expression |
---|---|
NOT filter(o1) | filter(vops_bool_not(o1)) |
filter(o1) AND filter(o2) | filter(vops_bool_and(o1, o2)) |
filter(o1) OR filter(o2) | filter(vops_bool_or(o1, o2)) |
There is only one difference between standard SQL and its vectorized extension. You still have to perform explicit type cast for string literals. For example, l_shipdate <= '1998-12-01'
will not work for the l_shipdate
column with vector type. Postgres Pro Enterprise has two overloaded versions of the <= operator to choose the appropriate scalar or vector type:
vops_date
<=vops_date
vops_date
<=date
The example below shows creation of a vectorized table via postgres_fdw
and some queries on it:
CREATE FOREIGN TABLE lineitem_fdw ( l_suppkey int4 not null, l_orderkey int4 not null, l_partkey int4 not null, l_shipdate date not null, l_quantity float8 not null, l_extendedprice float8 not null, l_discount float8 not null, l_tax float8 not null, l_returnflag "char" not null, l_linestatus "char" not null ) SERVER vops_server OPTIONS (table_name 'vops_lineitem'); EXPLAIN SELECT sum(l_extendedprice*l_discount) AS revenue FROM lineitem_fdw WHERE l_shipdate BETWEEN '1996-01-01' AND '1997-01-01' AND l_discount BETWEEN 0.08 AND 0.1 AND l_quantity < 24; QUERY PLAN --------------------------------------------------------- Foreign Scan (cost=1903.26..1664020.23 rows=1 width=4) (1 row) -- Filter was pushed down to FDW EXPLAIN SELECT n_name, count(*), sum(l_extendedprice * (1-l_discount)) AS revenue FROM customer_fdw JOIN orders_fdw ON c_custkey = o_custkey JOIN lineitem_fdw ON l_orderkey = o_orderkey JOIN supplier_fdw ON l_suppkey = s_suppkey JOIN nation ON c_nationkey = n_nationkey JOIN region ON n_regionkey = r_regionkey WHERE c_nationkey = s_nationkey AND r_name = 'ASIA' AND o_orderdate >= '1996-01-01' AND o_orderdate < '1997-01-01' GROUP BY n_name ORDER BY revenue desc; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=2337312.28..2337312.78 rows=200 width=48) Sort Key: (sum((lineitem_fdw.l_extendedprice * ('1'::double precision - lineitem_fdw.l_discount)))) DESC -> GroupAggregate (cost=2336881.54..2337304.64 rows=200 width=48) Group Key: nation.n_name -> Sort (cost=2336881.54..2336951.73 rows=28073 width=40) Sort Key: nation.n_name -> Hash Join (cost=396050.65..2334807.39 rows=28073 width=40) Hash Cond: ((orders_fdw.o_custkey = customer_fdw.c_custkey) AND (nation.n_nationkey = customer_fdw.c_nationkey)) -> Hash Join (cost=335084.53..2247223.46 rows=701672 width=52) Hash Cond: (lineitem_fdw.l_orderkey = orders_fdw.o_orderkey) -> Hash Join (cost=2887.07..1786058.18 rows=4607421 width=52) Hash Cond: (lineitem_fdw.l_suppkey = supplier_fdw.s_suppkey) -> Foreign Scan on lineitem_fdw (cost=0.00..1512151.52 rows=59986176 width=16) -> Hash (cost=2790.80..2790.80 rows=7702 width=44) -> Hash Join (cost=40.97..2790.80 rows=7702 width=44) Hash Cond: (supplier_fdw.s_nationkey = nation.n_nationkey) -> Foreign Scan on supplier_fdw (cost=0.00..2174.64 rows=100032 width=8) -> Hash (cost=40.79..40.79 rows=15 width=36) -> Hash Join (cost=20.05..40.79 rows=15 width=36) Hash Cond: (nation.n_regionkey = region.r_regionkey) -> Seq Scan on nation (cost=0.00..17.70 rows=770 width=40) -> Hash (cost=20.00..20.00 rows=4 width=4) -> Seq Scan on region (cost=0.00..20.00 rows=4 width=4) Filter: ((r_name)::text = 'ASIA'::text) -> Hash (cost=294718.76..294718.76 rows=2284376 width=8) -> Foreign Scan on orders_fdw (cost=0.00..294718.76 rows=2284376 width=8) -> Hash (cost=32605.64..32605.64 rows=1500032 width=8) -> Foreign Scan on customer_fdw (cost=0.00..32605.64 rows=1500032 width=8) -- filter on orders range is pushed to FDW
F.85.5.4. Building Indexes for Vectorized Tables #
Analytic queries are usually performed on the data for which no indexes are defined. Vector operations on tiles are most effective in this case. However, you can still use indexes with vectorized data.
Since each tile represents multiple values, an index can be used only for some preliminary, non-precise filtering of data. The vops
extension provides the following pairs of functions to obtain boundary values stored in the tile:
first()
andlast()
functions should be used for sorted dataset. In this case, the first and the last values in the tile are the smallest and the largest values in the tile, respectively.high()
andlow()
functions should be used for unsorted data. These functions are more expensive because they need to inspect all tile values.
Using the returned values, you can construct functional indexes for vectorized data. The BRIN index seems to be the best choice:
CREATE INDEX low_boundary ON trades USING brin(first(day)); -- trades table is ordered by day CREATE INDEX high_boundary ON trades USING brin(last(day)); -- trades table is ordered by day
Now you can use the created indexes in queries. However, you have to recheck the precise condition because such an index only gives an approximate result:
SELECT sum(price) FROM trades WHERE first(day) >= '2015-01-01' AND last(day) <= '2016-01-01' AND filter(betwixt(day, '2015-01-01', '2016-01-01'));
F.85.5.5. Automating Projection Usage #
vops
can automate most of the operations required to create and maintain vectorized data projections. In this case, you do not have to use postgres_fdw: vops
can redirect regular SQL queries to projection tables automatically. However, these queries must satisfy the following criteria to get redirected:
The query does not contain joins.
The query performs aggregation of vector columns.
All other expressions in the target list, as well as
ORDER BY
andGROUP BY
clauses refer only to scalar columns.
To enable automatic query redirection to a vectorized projection of your data, do the following:
Create a vectorized projection of the table using the create_projection() function.
For example, to create a projection for the
lineitem
table shown above, similar to thevops_lineitem
, run:SELECT create_projection('auto_vops_lineitem','lineitem', array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'], array['l_returnflag','l_linestatus']);
In this example, the first two arguments specify the projection table and the original table, respectively, the first array defines all the vector columns to be included into the projection, and the second array specifies the scalar columns that will be used for grouping vectorized data into tiles.
Additionally, you can provide an optional
order_by
parameter to speed up subsequent projection refresh operations. Theorder_by
values should be unique. For example, to sort all table entries byl_shipdate
, you can create a projection table as follows:SELECT create_projection('auto_vops_lineitem_ordered','lineitem', array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'], array['l_returnflag','l_linestatus'], 'l_shipdate');
Set the vops.auto_substitute_projections parameter to
on
.Now all queries that
vops
can redirect to theauto_vops_lineitem
projection table will be run on the vectorized data.
Projections are not updated automatically, so once the original table gets updated, implicitly redirected queries can start returning inaccurate results. To avoid this, you have to refresh the projections before running queries on the vectorized data. For this purpose, vops
provides the
functions, where the pname
_refresh()pname
prefix is the name of the projection you need to refresh. For example, to refresh the auto_vops_lineitem
projection, run:
SELECT auto_vops_lineitem_refresh();
This function calls the populate()
method to update all the fields present in the projection. If you have specified the optional order_by
parameter when creating the projection, only the new data will be imported from the original table, selecting only the rows with the order_by
column value greater than the maximal value of this column in the projection. Otherwise, the whole table will be re-written. The populate()
method also takes the grouping columns into account to efficiently group the imported data in tiles.
If you have specified the order_by
attribute when creating a projection, vops
also creates the following indexes:
Two functional BRIN indexes on
first()
andlast()
functions of this attribute. Such indexes allow to efficiently select time slices. For example, if the original query contains predicates like(l_shipdate between '01-01-2017' and '01-01-2018')
,vops
adds the(first(l_shipdate) >= '01-01-2017' and last(l_shipdate) >= '01-01-2018')
conjuncts when redirecting the query, so that Postgres Pro optimizer can quickly locate the affected pages using BRIN indexes.A BRIN index for scalar columns used for grouping. Such index allows to efficiently select groups and perform index joins.
F.85.6. Reference #
F.85.6.1. Vector Types #
The vops
extension supports all basic Postgres Pro numeric types: 1-, 2-, 4-, 8-byte integers, as well as 4- and 8-byte floats. The date
and timestamp
types are using the same implementation as int4
and int8
, respectively. The table below illustrates the available vector types, as compared to the corresponding SQL and C types.
SQL Type | C Type | vops Vector Type |
---|---|---|
bool | bool | vops_bool |
"char" | char | vops_char |
date | DateADT | vops_date |
int2 | int16 | vops_int2 |
int4 | int32 | vops_int4 |
int8 | int64 | vops_int8 |
float4 | float4 | vops_float4 |
float8 | float8 | vops_float8 |
timestamp | Timestamp | vops_timestamp |
char(N ), varchar(N ) | text | vops_text(N ) |
Note
For performance reasons, using vops_text
for single-character strings is not recommended. If you are using strings as identifiers, you can place them in a dictionary and use integer identifiers instead of the original strings.
F.85.6.2. Vector Operators #
The vops
extension provides a set of operators for vector types.
F.85.6.2.1. Mathematical Operators #
For mathematical operations, vops
overloads regular SQL operators, so they can take operands of both vector and scalar types.
Operator | Description |
---|---|
+ | Addition |
- | Binary subtraction or unary negation |
* | Multiplication |
/ | Division |
|| | Concatenation. This operator is only supported for char , int2 , and int4 types. It returns an integer type of a doubled size: (char || char) -> int2 (int2 || int2) -> int4 (int4 || int4) -> int8 |
F.85.6.2.2. Comparison Operators #
For comparison operations, vops
overloads regular SQL operators, so they can take operands of both vector and scalar types.
Operator | Description |
---|---|
= | Equals |
<> | Not equals |
< | Less than |
<= | Less than or equals |
> | Greater than |
>= | Greater than or equals |
F.85.6.2.3. Boolean Operators #
Boolean operators AND
, OR
, and NOT
cannot be overloaded, so vops
provides its own implementation of boolean operators for vector types:
Operator | Description |
---|---|
& | Boolean AND |
| | Boolean OR |
! | Boolean NOT |
Vector boolean operators have the following usage specifics:
The precedence of vector boolean operators is different from
AND
,OR
, andNOT
, so you have to enclose the operands into parentheses. For example:(x=1) | (x=2)
The result of vector boolean operations is
vops_bool
. Since Postgres Pro requires predicate expressions to have a regular boolean type, you have to convert the result to thebool
type using thefilter()
function. Taking a vector boolean expression, this function calculates afilter_mask
value that defines the result of this boolean operation for each vector element. If all mask bits are zero (the predicate is false for all vector elements), thenfilter()
returnsfalse
, and Postgres Pro executor skips this record. Otherwise,true
is returned, so vector operators will check thefilter_mask
value for the selected vector elements in subsequent operations.When using vector boolean operators, you have to use explicit cast of string constants to the required data type.
F.85.6.3. Comparison Functions #
Function | Description |
---|---|
bitwixt(x,low,high) | Same as the x BETWEEN low AND high predicate. |
is_null(x) | Same as the expression IS NULL predicate. |
is_not_null(x) | Same as the expression IS NOT NULL predicate. |
ifnull(x,subst) | Same as the COALESCE function. |
F.85.6.4. Conversion Functions #
-
create_projection(
#projection_name
text
,source_table
regclass
,vector_columns
text[]
,scalar_columns
text[]
DEFAULT null,order_by
text
DEFAULT null) Creates a projection table with specified name and attributes and updates the
vops_projections
table to include the information about this projection, so that the optimizer can automatically redirect queries from the original table to its projection. It also creates the pname_refresh() function that can update the projection.Arguments:
projection_name
— the name of the projection table.source_table
— the original Postgres Pro table for which to create a vectorized projection.vector_columns
— an array of column names that will be stored asvops
tiles.scalar_columns
— an array of column names that will preserve its original scalar type. These columns will be used to group table rows intovops
tiles.vops
automatically builds a BRIN index over such columns.order_by
— an optional vector column that can be used to sort data in the projection table. Sorting can be useful to incrementally update the projection after the original table updates using the pname_refresh() function. This column should contain unique timestamp values, such as trade dates.vops
automatically builds a functional BRIN index forfirst()
andlast()
functions of this column.
Examples:
SELECT create_projection('vops_lineitem','lineitem',array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'],array['l_returnflag','l_linestatus']);
-
#pname
_refresh() Refreshes the
pname
projection table.Examples:
Refresh the
vops_lineitem
projection:SELECT vops_lineitem_refresh();
-
drop_projection(
#projection_name
text
) Drops the specified projection and the corresponding refresh function, as well as removes the information about this projection from the
vops_projections
table.Examples:
Drop the
vops_lineitem
projection:SELECT drop_projection(vops_lineitem);
-
populate(
#destination
regclass
,source
regclass
,predicate
cstring
DEFAULT null,sort
cstring
DEFAULT null) Copies the data from an existing table to its vectorized projection and returns the number of imported rows (
bigint
).Arguments:
destination
— target table to copy the data into.Type:
regclass
source
— source table to copy the data from.Type:
regclass
predicate
— restricts the amount of imported data (optional). Using this argument, you can upload only the most recent records.Type:
cstring
sort
— sorts the source data to define the order in which the data needs to be loaded (optional).Type:
cstring
Examples:
SELECT populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass);
SELECT populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem_projection'::regclass, sort := 'l_returnflag,l_linestatus');
-
import(
#destination
regclass
,csv_path
cstring
,separator
cstring
DEFAULT ','skip
integer
DEFAULT 0) Loads the data into a
vops
table directly from a CSV file, bypassing creation of a regular table, and returns the number of imported rows (bigint
). Use this function to avoid storing two copies of the same dataset.Arguments:
destination
— target table to copy the data into.Type:
regclass
csv_path
— the path to the CSV file to copy the data from.Type:
cstring
separator
— specifies the field separator used in the CSV file (optional). By default, the comma is assumed to be the field separator.Type:
cstring
skip
— specifies the number of rows in the CSV header that do not need to be imported (optional).Default: 0
Type:
cstring
Examples:
SELECT import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|');
-
#vops_unnest
(anyelement
) Scatters records with
vops
types into records with scalar types.
F.85.7. Authors #
Postgres Professional, Moscow, Russia