Re: Yet another vectorized engine - Mailing list pgsql-hackers

From Hubert Zhang
Subject Re: Yet another vectorized engine
Date
Msg-id CAB0yrenQ7miZf1zu7au-R0A+HdEEdtummpST+VKn5ZckgsWUrw@mail.gmail.com
Whole thread Raw
In response to Re: Yet another vectorized engine  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Yet another vectorized engine
List pgsql-hackers
Hi Konstantin,
Thanks for your reply.

On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 28.11.2019 12:23, Hubert Zhang wrote:
We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.
But I do not completely understand why you are proposing to implement it as extension.
Yes, custom nodes makes it possible to provide vector execution without affecting Postgres core.
But for efficient integration of zedstore and vectorized executor we need to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
Certainly it is easier to contribute vectorized executor as extension, but sooner or later I think it should be added to Postgres core.

As far as I understand you already have some prototype implementation (otherwise how you got the performance results)?
If so, are you planning to publish it or you think that executor should be developed from scratch?

The prototype extension is at https://github.com/zhangh43/vectorize_engine
 I agree vectorized executor should be added to Postgres core some days. But it is such a huge feature and need to change from not only the extended table-AM you mentioned and also every executor node , such as Agg,Join,Sort node etc. What's more, the expression evaluation function and aggregate's transition function, combine function etc. We all need to supply a vectorized version for them. Hence, implementing it as an extension first and if it is popular among community and stable, we could merge it into Postgres core whenever we want.

We do want to get some feedback from the community about CustomScan. CustomScan is just an abstract layer. It's typically used to support user defined scan node, but some other PG extensions(pgstorm) have already used it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need to support vectorized processing in all executor node, follow the above idea, our choice is to use CustomScan.
 
 Some my concerns based on VOPS experience:

1. Vertical (columnar) model is preferable for some kind of queries, but there are some classes of queries for which it is less efficient.
Moreover, data is used to be imported in the database in row format. Inserting it in columnar store record-by-record is very inefficient.
So you need some kind of bulk loader which will be able to buffer input data before loading it in columnar store.
Actually this problem it is more related with data model rather than vectorized executor. But what I want to express here is that it may be better to have both representation (horizontal and vertical)
and let optimizer choose most efficient one for particular query.


Yes, in general, for OLTP queries, row format is better and for OLAP queries column format is better.
As for storage type(or data model), I think DBA should choose row or column store to use for a specific table.
As for executor, it's a good idea to let optimizer to choose based on cost. It is a long term goal and our extension now will fallback to original row executor for Insert,Update,IndexScan cases in a rough way.
We want our extension could be enhanced in a gradual way.
 
2. Columnar store and vectorized executor are most efficient for query like "select sum(x) from T where ...".
Unfortunately such simple queries are rarely used in real life. Usually analytic queries contain group-by and joins.
And here vertical model is not always optimal (you have to reconstruct rows from columns to perform join or grouping).
To provide efficient execution of queries you may need to create multiple different projections of the same data (sorted by different subset of attributes).
This is why Vertica (one of the most popular columnar store DBMS) is supporting projections.
The same can be done in VOPS: using create_projection function you can specify which attributes should be scalar (grouping attributes) and which vectorized.
In this case you can perform grouping and joins using standard Postgres executor, while perform vectorized operations for filtering and accumulating aggregates. 

This is why Q1 is 20 times faster in VOPS and not 2 times as in your prototype.
So I think that columnar store should make it possible to maintain several projections of table and optimizer should be able to automatically choose one of them for particular query.
Definitely synchronization of projections is challenged problem. Fortunately OLAP usually not require most recent data.


Projection in Vertica is useful. I tested, VOPS is really faster. It could be nice if you could contribute it to PG core. Our extension is aimed to not change any Postgres code as well as user's sql and existing table.
We will continue to optimize our vectorize implementation. Vectorized hashagg need vectorized hashtable implementation, e.g. calculate hashkey in a batched way, probe hashtable in a batched way. Original hashtable in PG is not a vectorised hash table of course.
 
3. I wonder if vectorized executor should support only built-in types and predefined operators? Or it should be able to work with any user defined types, operators and aggregates?
Certainly it is much easier to support only built-in scalar types. But it contradicts to open and extensible nature of Postgres.

Yes, we should support user defined type. This could be done by introducing a register layer which mapping the row type with vector type. E.g. int4->vint4 and also for each operator. 

4. Did you already think about format of storing data in VectorTupleTableSlot? Should it be array of Datum? Or we need to represent vector in more low level format (for example
as array of floats for real4 type)?


Our perf results show that datum conversion is not effective, and we prepare to implement to datum array as low level format array as you mentioned.
--
Thanks

Hubert Zhang

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Implementing Incremental View Maintenance
Next
From: Tatsuo Ishii
Date:
Subject: Re: Implementing Incremental View Maintenance