F.82. vops — support for vector operations #

The vops module is an experimental Postgres Pro Enterprise extension that provides support for vector operations, which allows to speed up OLAP queries with filtering and aggregation more than 10 times. Performance improvements are achieved by implementing a vertical data model that allows storing data in tiles, or groups of column values. Such a storage format is also called Parquet in some analytical databases. Reducing the overhead of unpacking tuples, vertical data model allows to speed up query execution without any radical changes in Postgres Pro planner and executor.

vops provides its own set of vector functions that can be used inside predicates and aggregate expressions. To avoid learning new syntax, you can use postgres_fdw as an abstraction layer and run regular SQL queries on vectorized data. For query types other than filtering and aggregation, vops can also use postgres_fdw to present vectorized data as a regular table with scalar column types and process this data as regular tuples. For ease of use, vops also provides an automated way of generating and accessing vectorized data, as explained in Section F.82.5.5.

F.82.1. Limitations #

  • Tile-based tables only allow INSERT and APPEND operations. Once imported, vectorized data cannot be updated.

  • Vector operations are only available for filtering and aggregation. Other query types are supported via postgres_fdw.

  • Aggregate expressions must be of the same type.

  • JOIN operations are not supported for vector types. You have to use postgres_fdw to transform vectorized data back to regular tuples to run such queries.

  • Since postgres_fdw does not support parallel query execution, you can only run parallel queries when using vector operators directly.

F.82.2. Installation and Setup #

The vops extension is included into Postgres Pro Enterprise. Once you have Postgres Pro Enterprise installed, complete the following steps to enable vops:

  1. Add vops to the shared_preload_libraries parameter in the postgresql.conf file:

    shared_preload_libraries = 'vops'
    

    Note

    If the extension library is not added to the shared_preload_libraries list, it will be loaded on demand after the first function invocation of this extension. In this case, the first time you execute a vectorized query, it will not be processed correctly and can return a wrong result.

  2. Restart the Postgres Pro Enterprise instance for the changes to take effect.

  3. Create vops extension:

    CREATE EXTENSION vops;
    

Once vops is enabled, you can create vectorized tables and query data using vector operations in your database.

If you would like to use regular SQL syntax for vector types, you must also enable postgres_fdw.

F.82.3. Motivation #

For OLTP workloads, Postgres Pro looks very competitive as compared to other mainstream databases. However, for OLAP queries, which require processing larger volumes of data, analytics-oriented database management systems can provide a much better speed. There are several main factors that limit Postgres Pro performance:

  • Tuple unpacking overhead. To access column values, Postgres Pro needs to unpack the tuple. As tables can have columns of variable-length data types, to extract the N-th column, the preceding N-1 columns need to be unpacked. Thus, deforming a tuple is quite an expensive operation, especially for tables with a large number of attributes. Besides, values can be compressed or stored in another page (TOAST). In queries like Query 6 of the TPC-H benchmark, deforming tuples takes about 40% of the total query execution time.

  • Interpretation overhead. Postgres Pro planner and optimizer build a tree representing the query execution plan. Query executor recursively calls functions that evaluate the nodes of this tree. Some nodes may also contain switches used to select the requested action. Thus, query plan is interpreted by Postgres Pro query executor rather than directly executed. The interpreter is usually about 10 times slower than the native code. If you eliminate interpretation overhead, you can increase query speed several times, especially for queries with complex predicates that have high evaluation costs.

  • Abstraction penalty. Support for abstract (user-defined) types and operations is one of the key features of Postgres Pro. However, the price of such flexibility is that each operation requires a separate function call. Instead of adding two integers directly, Postgres Pro executor invokes a function that performs addition. In this case, function call overhead is much higher than the cost the operation itself. Function call overhead is also increased because Postgres Pro requires passing parameter values through memory.

  • Pull model overhead. In Postgres Pro, operators pull the operand values. This approach simplifies the implementation of operators and the executor, but has a negative impact on performance. In this model, leaf nodes that fetch a tuple from the heap or index pages have to do a lot of extra work while saving and restoring their context.

  • MVCC overhead. Postgres Pro provides multi-version concurrency control, which allows multiple transactions to work with the same record in parallel without blocking each other. It is good for frequently updated data (OLTP), but for read-only or append-only data in OLAP scenarios it only adds both space overhead (about 20 extra bytes per tuple) and CPU overhead (checking visibility of each tuple).

Vector operations can address most of these issues without radical changes in the Postgres Pro executor, as explained in Section F.82.4.

F.82.4. Architecture #

The vops extension implements a vertical data model for Postgres Pro. In this model, vertical columns are used as a data storing unit, with the values of the corresponding table attribute forming a vector.

The vertical data model used by vops has the following advantages:

  • Reduces the size of fetched data: only the columns used in the query need to be fetched.

  • Achieves higher compression rates: storing all values of the same attribute together makes it possible to compress them much better and faster. For example, you can use delta encoding.

  • Minimizes interpretation overhead: a set of values can be processed by a single operation instead of running separate operations for each value.

  • Uses CPU vector instructions (SIMD) to process data.

A traditional Postgres Pro query executor deals with a single row of data at each moment of time. For example, to evaluate an expression (x+y), it first fetches the value of x, then fetches the value of y, performs the addition, and returns the result value to the upper node. In contrast, a vectorized executor can process multiple values within a single operation. In this case, x and y represent vectors of values instead of single scalars. The returned result is also a vector. In vector execution model, interpretation and function call overhead is divided by the vector size. The price of performing a function call is the same, but as the function processes N values instead of a single one, this overhead becomes less critical. The larger the vector, the smaller the per-row overhead.

However, performing an operation on the whole column may also turn out to be ineffective. Working with large vectors prevents efficient utilization of CPU cache levels. If the table is very large, the vector may not fit in memory at all. To avoid these issues, vops splits the columns into relatively small chunks, or tiles. The size of a tile is currently set to 64 elements. It allows to keep all operands of vector operations in cache, even for complex expressions.

The vops extension implements special vector types to be used instead of scalar types for table columns. The available types are listed in Section F.82.6.1. To use vops, you have to create vectorized projections of original tables, with at least some attributes using these tile types, as explained in Section F.82.5.1.

The original table can be treated as a write-optimized storage: if it has no indexes, Postgres Pro can provide a very fast insertion speed, comparable to raw disk write speed. Vectorized projection can be treated as a read-optimized storage, which is most efficient for running OLAP queries.

Once all the data is in the tile-based format, you can run vectorized queries on this data. vops provides a set of vector operators to work with vector types. Using these operators, you can write queries for filtering and aggregation, similar to SQL queries. For details, see Section F.82.5.2.

For other query types, vops allows to work with vectorized data via postgres_fdw. You can add vectorized data to your database cluster as a foreign table and use regular SQL for running any types of queries on this data. Using the implicit type casts and post-parse ANALYZE hook, the vops extension either transforms the query to use vector operators for filtering and aggregation, or processes the data using scalar operators for other query types. For details, see Section F.82.5.3.

F.82.5. Usage #

F.82.5.1. Converting Data into Vectorized Format #

To start using vector operations, you have to convert your data into a vectorized format, as explained below.

  1. Create an empty vectorized table.

    To use vector operations, you need to load data into a tile-based vectorized table, which can be treated as a projection of the original table. It can map all columns of the original table, or include some of the most frequently used columns only. To create a vectorized table, you can use the regular CREATE TABLE syntax, but at least some of the columns must be of vector types.

    For example, suppose you have the following table from the TPC-H benchmark:

    CREATE TABLE lineitem(
       l_orderkey integer,
       l_partkey integer,
       l_suppkey integer,
       l_linenumber integer,
       l_quantity double precision,
       l_extendedprice double precision,
       l_discount double precision,
       l_tax double precision,
       l_returnflag "char",
       l_linestatus "char",
       l_shipdate date,
       l_commitdate date,
       l_receiptdate date,
       l_shipinstruct char(25),
       l_shipmode char(10),
       l_comment char(44));
    

    To create a vectorized projection that includes only a subset of the original columns, you can run:

    CREATE TABLE vops_lineitem(
       l_shipdate vops_date not null,
       l_quantity vops_float8 not null,
       l_extendedprice vops_float8 not null,
       l_discount vops_float8 not null,
       l_tax vops_float8 not null,
       l_returnflag vops_char not null,
       l_linestatus vops_char not null
    );
    

    Alternatively, you can keep the scalar type for some of the columns. In this case, you can sort the data by these columns when loading the data. For example, let's create another vectorized projection where l_returnflag and l_linestatus fields remain scalar:

    CREATE TABLE vops_lineitem_projection(
       l_shipdate vops_date not null,
       l_quantity vops_float8 not null,
       l_extendedprice vops_float8 not null,
       l_discount vops_float8 not null,
       l_tax vops_float8 not null,
       l_returnflag "char" not null,
       l_linestatus "char" not null
    );
    

    In this table, l_returnflag and l_linestatus fields are scalar, while all the other fields are of the vector types that can be used in vector operations.

  2. Load the data into the vectorized table.

    • If the data to load is already stored in your database, use the populate() function to preprocess the data and unite attribute values of several rows inside a single tile. This function loads the data from the original table into its vectorized projection, using vector types instead of scalar types, as specified in the created empty table.

      For example:

      SELECT populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass);
      

      Once the table is populated, you can start running queries performing sequential scan.

      Optionally, you can sort the data by one or more scalar columns when loading the data:

      SELECT populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem'::regclass, sort := 'l_returnflag,l_linestatus');
      

      If you sort the original table by the scalar columns that have a lot of duplicates, vops can effectively collapse them and store the corresponding attribute values in tiles, which can reduce the occupied space, speed up queries, and facilitate index creation on these scalar columns.

    • If your data is not yet loaded into the database and you would like to avoid having two copies of the same dataset, you can import the data into a vectorized table directly from a CSV file using the import() function, bypassing creation of a regular table. For example:

      SELECT import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|');
      

For detailed description and syntax of the populate() and import() functions, see Section F.82.6.4.

Once your data is in the tile-based format, you can run queries on this data using vector operators, as described in Section F.82.5.2. If you create a foreign vectorized table, you can also run regular SQL queries. See Section F.82.5.3 for details.

F.82.5.2. Running Queries on Vectorized Data #

vops implements its own vector functions and operators for working with vectorized data. See Section F.82.6 for the full list of vector functions and operators and their usage specifics.

F.82.5.2.1. Aggregating Vectorized Data #

OLAP queries often aggregate large volumes of data. To enable aggregate operations for vector types, vops implements the following aggregate functions for vectorized data: count, min, max, avg, var_pop, var_sampl, variance, stddev_pop, stddev_samp, stddev. Besides, vops provides the wavg aggregate function that calculates volume-weighted average price (VWAP).

The vops extension supports the following types of aggregates:

  • grand aggregates calculated for the whole table

  • aggregates for subsets of table rows defined by the GROUP BY clause

F.82.5.2.1.1. Calculating Grand Aggregates #

To calculate grand aggregates, you can use the provided vector aggregate functions exactly like in regular SQL queries.

Examples:

To calculate volume-weighted average price, run the following query:

SELECT wavg(l_extendedprice,l_quantity) FROM vops_lineitem;

To calculate the company revenue for the year 1996 based on discounted prices of the shipped goods, run:

SELECT sum(l_extendedprice*l_discount) AS revenue
  FROM vops_lineitem
  WHERE filter(betwixt(l_shipdate, '1996-01-01', '1997-01-01')
        & betwixt(l_discount, 0.08, 0.1)
        & (l_quantity < 24));
F.82.5.2.1.2. Calculating Grouped Aggregates #

To support grouped aggregates, vops provides map() and reduce() functions:

  • The map() function collects aggregate states for all groups in a hash table. Its syntax is as follows:

    map(group_by_expression,
          aggregate_list, expr {,
          expr })
    

  • The reduce() function iterates through the hash table constructed by map and returns the set. The reduce function is needed because the result of aggregation in Postgres Pro cannot be a set.

Consider the following example:

SELECT reduce(map(l_returnflag||l_linestatus, 'sum,sum,sum,sum,avg,avg,avg',
    l_quantity,
    l_extendedprice,
    l_extendedprice*(1-l_discount),
    l_extendedprice*(1-l_discount)*(1+l_tax),
    l_quantity,
    l_extendedprice,
    l_discount))
FROM vops_lineitem
WHERE filter(l_shipdate <= '1998-12-01'::date);

The concatenation operator || performs grouping by two columns. The vops extension supports grouping only by an integer type. The map function accepts aggregation arguments as a variadic array, so all elements of this array should have the same type. For example, you cannot calculate aggregates for vops_float4 and vops_int8 columns in a single operation.

An aggregate string in map function should contain the list of aggregate functions to call, in a comma-separated format. Standard lowercase names should be used: count, sum, agg, min, max. Count is executed for the particular column: count(x). There is no need to explicitly specify count(*) because the number of records in each group is returned by reduce function in any case.

reduce function returns set of the vops_aggregate type. It contains three components: the value of the GROUP BY expression, the number of records in the group, and an array of floats with aggregate values. The values of all aggregates, including count and min/max, are returned as floats.

Alternatively, you can partition the vectorized table by the GROUP BY fields to calculate grouped aggregates. In this case, grouping keys are stored as regular scalar values, while other fields are stored inside tiles. Now Postgres Pro executor will execute vops aggregates for each group:

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice*(1-l_discount)) AS sum_disc_price,
    sum(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    vops_lineitem_projection
WHERE
    filter(l_shipdate <= '1998-12-01'::date)
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;

In this example, l_returnflag and l_linestatus fields of the vops_lineitem_projection table have the char type, while all the other used fields are of vector types. The query above is executed even faster than query with the map() and reduce() functions. The main problem with this approach is that you have to create a separate projection for each combination of the GROUP BY keys to be used in queries.

F.82.5.2.2. Using Vector Window Functions #

The vops extension provides limited support for Postgres Pro window functions. It provides mcount, msum, mmin, mmax, mavg, and lag functions that correspond to the standard Postgres Pro window functions count, sum, min, max, avg, and lag functions, respectively.

Note the following important restrictions:

  1. Filtering, grouping, and sorting can be done only by scalar attributes.

  2. Vector window functions only support window frames specified with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Examples:

SELECT vops_unnest(t.*)
  FROM (SELECT mcount(*) over w,mcount(x) over w,msum(x) over w,mavg(x) over w,mmin(x) over w,mmax(x) over w,x - lag(x) over w 
  FROM v window w
  AS (rows between unbounded preceding and current row)) t;
F.82.5.2.3. Converting Results to Horizontal Format #

A query from the vops projection returns a set of tiles. The output function of a tile type can print the tile contents. However, if you want to convert the result to the horizontal format where each tuple represents a single record, use the vops-unnest function. For example:

SELECT vops_unnest(l.*)
  FROM vops_lineitem l
  WHERE filter(l_shipdate >= '1998-12-01'::date);

F.82.5.3. Using Standard SQL for Vector Operations #

Vector operators can efficiently execute only filtering and aggregation queries. For other query types, you have to run regular SQL queries. The vops extension offers the following options:

  • Use the original table, if any.

  • Enable a foreign data wrapper (FDW) using postgres_fdw to present a vectorized table to Postgres Pro as a regular table with scalar column types.

If you use postgres_fdw with the vops extension, vectorized data is not really foreign: it is stored inside the database in an alternative vectorized format. FDW allows to hide specifics of the vectorized format and provides the following opportunities for working with vectorized data:

  1. Extract the data from a vectorized table and present it in the horizontal format, so that it can be processed by upper nodes in the query execution plan.

  2. Push filtering and aggregation queries down to vector operations.

  3. Get statistic for the underlying table by running ANALYZE for the foreign table. The collected statistic can be used by the query optimizer, so query execution plan should be almost the same as for regular tables.

Thus, by placing a vectorized projection under FDW, you can use standard SQL without any vops-specific functions to perform sequential scan and aggregation queries as if they are explicitly written for the vops table, and execute any other queries on this data at the same time, including JOIN operations, common table expressions (CTE), etc. The queries that can be efficiently executed by vector operations will be pushed by Postgres Pro query optimizer to vops FDW and will be executed using vector operators. For other queries, the data is fetched from vectorized tables as standard tuples. If it is determined that vector operations are required, vops uses a post-parse ANALYZE hook that replaces scalar boolean operations with vector boolean operations:

Original Expression Transformed Expression
NOT filter(o1)filter(vops_bool_not(o1))
filter(o1) AND filter(o2)filter(vops_bool_and(o1, o2))
filter(o1) OR filter(o2)filter(vops_bool_or(o1, o2))

There is only one difference between standard SQL and its vectorized extension. You still have to perform explicit type cast for string literals. For example, l_shipdate <= '1998-12-01' will not work for the l_shipdate column with vector type. Postgres Pro Enterprise has two overloaded versions of the <= operator to choose the appropriate scalar or vector type:

  • vops_date <= vops_date

  • vops_date <= date

The example below shows creation of a vectorized table via postgres_fdw and some queries on it:

CREATE FOREIGN TABLE lineitem_fdw (
   l_suppkey int4 not null,
   l_orderkey int4 not null,
   l_partkey int4 not null,
   l_shipdate date not null,
   l_quantity float8 not null,
   l_extendedprice float8 not null,
   l_discount float8 not null,
   l_tax      float8 not null,
   l_returnflag "char" not null,
   l_linestatus "char" not null
) SERVER vops_server OPTIONS (table_name 'vops_lineitem');

EXPLAIN SELECT
   sum(l_extendedprice*l_discount) AS revenue
FROM
   lineitem_fdw
WHERE
   l_shipdate BETWEEN '1996-01-01' AND '1997-01-01'
   AND l_discount BETWEEN 0.08 AND 0.1
   AND l_quantity < 24;
                       QUERY PLAN                        
---------------------------------------------------------
 Foreign Scan  (cost=1903.26..1664020.23 rows=1 width=4)
(1 row)

-- Filter was pushed down to FDW

EXPLAIN SELECT
    n_name,
    count(*),
    sum(l_extendedprice * (1-l_discount)) AS revenue
FROM
    customer_fdw JOIN orders_fdw ON c_custkey = o_custkey
    JOIN lineitem_fdw ON l_orderkey = o_orderkey
    JOIN supplier_fdw ON l_suppkey = s_suppkey
    JOIN nation ON c_nationkey = n_nationkey
    JOIN region ON n_regionkey = r_regionkey
WHERE
    c_nationkey = s_nationkey
    AND r_name = 'ASIA'
    AND o_orderdate >= '1996-01-01'
    AND o_orderdate < '1997-01-01'
GROUP BY
    n_name
ORDER BY
    revenue desc;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2337312.28..2337312.78 rows=200 width=48)
   Sort Key: (sum((lineitem_fdw.l_extendedprice * ('1'::double precision - lineitem_fdw.l_discount)))) DESC
   ->  GroupAggregate  (cost=2336881.54..2337304.64 rows=200 width=48)
         Group Key: nation.n_name
         ->  Sort  (cost=2336881.54..2336951.73 rows=28073 width=40)
               Sort Key: nation.n_name
               ->  Hash Join  (cost=396050.65..2334807.39 rows=28073 width=40)
                     Hash Cond: ((orders_fdw.o_custkey = customer_fdw.c_custkey) AND (nation.n_nationkey = customer_fdw.c_nationkey))
                     ->  Hash Join  (cost=335084.53..2247223.46 rows=701672 width=52)
                           Hash Cond: (lineitem_fdw.l_orderkey = orders_fdw.o_orderkey)
                           ->  Hash Join  (cost=2887.07..1786058.18 rows=4607421 width=52)
                                 Hash Cond: (lineitem_fdw.l_suppkey = supplier_fdw.s_suppkey)
                                 ->  Foreign Scan on lineitem_fdw  (cost=0.00..1512151.52 rows=59986176 width=16)
                                 ->  Hash  (cost=2790.80..2790.80 rows=7702 width=44)
                                       ->  Hash Join  (cost=40.97..2790.80 rows=7702 width=44)
                                             Hash Cond: (supplier_fdw.s_nationkey = nation.n_nationkey)
                                             ->  Foreign Scan on supplier_fdw  (cost=0.00..2174.64 rows=100032 width=8)
                                             ->  Hash  (cost=40.79..40.79 rows=15 width=36)
                                                   ->  Hash Join  (cost=20.05..40.79 rows=15 width=36)
                                                         Hash Cond: (nation.n_regionkey = region.r_regionkey)
                                                         ->  Seq Scan on nation  (cost=0.00..17.70 rows=770 width=40)
                                                         ->  Hash  (cost=20.00..20.00 rows=4 width=4)
                                                               ->  Seq Scan on region  (cost=0.00..20.00 rows=4 width=4)
                                                                     Filter: ((r_name)::text = 'ASIA'::text)
                           ->  Hash  (cost=294718.76..294718.76 rows=2284376 width=8)
                                 ->  Foreign Scan on orders_fdw  (cost=0.00..294718.76 rows=2284376 width=8)
                     ->  Hash  (cost=32605.64..32605.64 rows=1500032 width=8)
                           ->  Foreign Scan on customer_fdw  (cost=0.00..32605.64 rows=1500032 width=8)

-- filter on orders range is pushed to FDW

F.82.5.4. Building Indexes for Vectorized Tables #

Analytic queries are usually performed on the data for which no indexes are defined. Vector operations on tiles are most effective in this case. However, you can still use indexes with vectorized data.

Since each tile represents multiple values, an index can be used only for some preliminary, non-precise filtering of data. The vops extension provides the following pairs of functions to obtain boundary values stored in the tile:

  • first() and last() functions should be used for sorted dataset. In this case, the first and the last values in the tile are the smallest and the largest values in the tile, respectively.

  • high() and low() functions should be used for unsorted data. These functions are more expensive because they need to inspect all tile values.

Using the returned values, you can construct functional indexes for vectorized data. The BRIN index seems to be the best choice:

CREATE INDEX low_boundary ON trades
USING brin(first(day)); -- trades table is ordered by day
CREATE INDEX high_boundary ON trades
USING brin(last(day)); -- trades table is ordered by day

Now you can use the created indexes in queries. However, you have to recheck the precise condition because such an index only gives an approximate result:

SELECT sum(price) FROM trades
WHERE first(day) >= '2015-01-01' AND last(day) <= '2016-01-01'
AND filter(betwixt(day, '2015-01-01', '2016-01-01'));

F.82.5.5. Automating Projection Usage #

vops can automate most of the operations required to create and maintain vectorized data projections. In this case, you do not have to use postgres_fdw: vops can redirect regular SQL queries to projection tables automatically. However, these queries must satisfy the following criteria to get redirected:

  • The query does not contain joins.

  • The query performs aggregation of vector columns.

  • All other expressions in the target list, as well as ORDER BY and GROUP BY clauses refer only to scalar columns.

To enable automatic query redirection to a vectorized projection of your data, do the following:

  1. Create a vectorized projection of the table using the create_projection() function.

    For example, to create a projection for the lineitem table shown above, similar to the vops_lineitem, run:

    SELECT create_projection('auto_vops_lineitem','lineitem',
                             array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'],
                             array['l_returnflag','l_linestatus']);
    

    In this example, the first two arguments specify the projection table and the original table, respectively, the first array defines all the vector columns to be included into the projection, and the second array specifies the scalar columns that will be used for grouping vectorized data into tiles.

    Additionally, you can provide an optional order_by parameter to speed up subsequent projection refresh operations. The order_by values should be unique. For example, to sort all table entries by l_shipdate, you can create a projection table as follows:

    SELECT create_projection('auto_vops_lineitem_ordered','lineitem',
                             array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'],
                             array['l_returnflag','l_linestatus'],
                             'l_shipdate');
    

  2. Set the vops.auto_substitute_projections parameter to on.

    Now all queries that vops can redirect to the auto_vops_lineitem projection table will be run on the vectorized data.

Projections are not updated automatically, so once the original table gets updated, implicitly redirected queries can start returning inaccurate results. To avoid this, you have to refresh the projections before running queries on the vectorized data. For this purpose, vops provides the pname_refresh() functions, where the pname prefix is the name of the projection you need to refresh. For example, to refresh the auto_vops_lineitem projection, run:

SELECT auto_vops_lineitem_refresh();

This function calls the populate() method to update all the fields present in the projection. If you have specified the optional order_by parameter when creating the projection, only the new data will be imported from the original table, selecting only the rows with the order_by column value greater than the maximal value of this column in the projection. Otherwise, the whole table will be re-written. The populate() method also takes the grouping columns into account to efficiently group the imported data in tiles.

If you have specified the order_by attribute when creating a projection, vops also creates the following indexes:

  • Two functional BRIN indexes on first() and last() functions of this attribute. Such indexes allow to efficiently select time slices. For example, if the original query contains predicates like (l_shipdate between '01-01-2017' and '01-01-2018'), vops adds the (first(l_shipdate) >= '01-01-2017' and last(l_shipdate) >= '01-01-2018') conjuncts when redirecting the query, so that Postgres Pro optimizer can quickly locate the affected pages using BRIN indexes.

  • A BRIN index for scalar columns used for grouping. Such index allows to efficiently select groups and perform index joins.

F.82.6. Reference #

F.82.6.1. Vector Types #

The vops extension supports all basic Postgres Pro numeric types: 1-, 2-, 4-, 8-byte integers, as well as 4- and 8-byte floats. The date and timestamp types are using the same implementation as int4 and int8, respectively. The table below illustrates the available vector types, as compared to the corresponding SQL and C types.

SQL Type C Type vops Vector Type
bool bool vops_bool
"char" char vops_char
date DateADT vops_date
int2 int16 vops_int2
int4 int32 vops_int4
int8 int64 vops_int8
float4 float4 vops_float4
float8 float8 vops_float8
timestamp Timestamp vops_timestamp
char(N), varchar(N) text vops_text(N)

Note

For performance reasons, using vops_text for single-character strings is not recommended. If you are using strings as identifiers, you can place them in a dictionary and use integer identifiers instead of the original strings.

F.82.6.2. Vector Operators #

The vops extension provides a set of operators for vector types.

F.82.6.2.1. Mathematical Operators #

For mathematical operations, vops overloads regular SQL operators, so they can take operands of both vector and scalar types.

Operator Description
+ Addition
- Binary subtraction or unary negation
* Multiplication
/ Division
|| Concatenation. This operator is only supported for char, int2, and int4 types. It returns an integer type of a doubled size:
(char || char) -> int2
(int2 || int2) -> int4
(int4 || int4) -> int8
F.82.6.2.2. Comparison Operators #

For comparison operations, vops overloads regular SQL operators, so they can take operands of both vector and scalar types.

Operator Description
= Equals
<> Not equals
< Less than
<= Less than or equals
> Greater than
>= Greater than or equals
F.82.6.2.3. Boolean Operators #

Boolean operators AND, OR, and NOT cannot be overloaded, so vops provides its own implementation of boolean operators for vector types:

Operator Description
& Boolean AND
| Boolean OR
! Boolean NOT

Vector boolean operators have the following usage specifics:

  • The precedence of vector boolean operators is different from AND, OR, and NOT, so you have to enclose the operands into parentheses. For example:

    (x=1) | (x=2)
    

  • The result of vector boolean operations is vops_bool. Since Postgres Pro requires predicate expressions to have a regular boolean type, you have to convert the result to the bool type using the filter() function. Taking a vector boolean expression, this function calculates a filter_mask value that defines the result of this boolean operation for each vector element. If all mask bits are zero (the predicate is false for all vector elements), then filter() returns false, and Postgres Pro executor skips this record. Otherwise, true is returned, so vector operators will check the filter_mask value for the selected vector elements in subsequent operations.

  • When using vector boolean operators, you have to use explicit cast of string constants to the required data type.

F.82.6.3. Comparison Functions #

Function Description
bitwixt(x,low,high) Same as the x BETWEEN low AND high predicate.
is_null(x) Same as the expression IS NULL predicate.
is_not_null(x) Same as the expression IS NOT NULL predicate.
ifnull(x,subst) Same as the COALESCE function.

F.82.6.4. Conversion Functions #

create_projection(projection_name text, source_table regclass, vector_columns text[], scalar_columns text[] DEFAULT null, order_by text DEFAULT null) #

Creates a projection table with specified name and attributes and updates the vops_projections table to include the information about this projection, so that the optimizer can automatically redirect queries from the original table to its projection. It also creates the pname_refresh() function that can update the projection.

Arguments:

  • projection_name — the name of the projection table.

  • source_table — the original Postgres Pro table for which to create a vectorized projection.

  • vector_columns — an array of column names that will be stored as vops tiles.

  • scalar_columns — an array of column names that will preserve its original scalar type. These columns will be used to group table rows into vops tiles. vops automatically builds a BRIN index over such columns.

  • order_by — an optional vector column that can be used to sort data in the projection table. Sorting can be useful to incrementally update the projection after the original table updates using the pname_refresh() function. This column should contain unique timestamp values, such as trade dates. vops automatically builds a functional BRIN index for first() and last() functions of this column.

Examples:

SELECT create_projection('vops_lineitem','lineitem',array['l_shipdate','l_quantity','l_extendedprice','l_discount','l_tax'],array['l_returnflag','l_linestatus']);
pname_refresh() #

Refreshes the pname projection table.

Examples:

Refresh the vops_lineitem projection:

SELECT vops_lineitem_refresh();
drop_projection(projection_name text) #

Drops the specified projection and the corresponding refresh function, as well as removes the information about this projection from the vops_projections table.

Examples:

Drop the vops_lineitem projection:

SELECT drop_projection(vops_lineitem);
populate(destination regclass, source regclass, predicate cstring DEFAULT null, sort cstring DEFAULT null) #

Copies the data from an existing table to its vectorized projection and returns the number of imported rows (bigint).

Arguments:

  • destination — target table to copy the data into.

    Type: regclass

  • source — source table to copy the data from.

    Type: regclass

  • predicate — restricts the amount of imported data (optional). Using this argument, you can upload only the most recent records.

    Type: cstring

  • sort — sorts the source data to define the order in which the data needs to be loaded (optional).

    Type: cstring

Examples:

SELECT populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass);
SELECT populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem_projection'::regclass, sort := 'l_returnflag,l_linestatus');
import(destination regclass, csv_path cstring, separator cstring DEFAULT ',' skip integer DEFAULT 0) #

Loads the data into a vops table directly from a CSV file, bypassing creation of a regular table, and returns the number of imported rows (bigint). Use this function to avoid storing two copies of the same dataset.

Arguments:

  • destination — target table to copy the data into.

    Type: regclass

  • csv_path — the path to the CSV file to copy the data from.

    Type: cstring

  • separator — specifies the field separator used in the CSV file (optional). By default, the comma is assumed to be the field separator.

    Type: cstring

  • skip — specifies the number of rows in the CSV header that do not need to be imported (optional).

    Default: 0

    Type: cstring

Examples:

SELECT import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|');
vops_unnest(anyelement) #

Scatters records with vops types into records with scalar types.

F.82.6.5. Configuration Parameters #

vops.auto_substitute_projections (boolean) #

Enables automatic query redirection to projection tables. For correct results, you have to ensure that the projection is synchronized with the original table.

Default: off

F.82.7. Authors #

Postgres Professional, Moscow, Russia