Thread: Yet another vectorized engine

Yet another vectorized engine

From
Hubert Zhang
Date:
Hi hackers,

We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.

The basic idea is to extend the TupleTableSlot and introduce VectorTupleTableSlot, which is an array of datums organized by projected columns.  The array of datum per column is continuous in memory. This makes the expression evaluation cache friendly and SIMD could be utilized. We have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot currently.

Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original executor node such as SeqScan, Agg etc. Based on CustomScan, we could extend the CustomScanState, BeginCustomScan(), ExecCustomScan(), EndCustomScan() interface to implement vectorize executor logic.

3. Post planner hook. After plan is generated, we use plan_tree_walker to traverse the plan tree and check whether it could be vectorized. If yes, the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized nodes (in form of CustomScan node) and use vectorized executor. If no, we will revert to the original plan and use non-vectorized executor. In future this part could be enhanced, for example, instead of revert to original plan when some nodes cannot be vectorized, we could add Batch/UnBatch node to generate a plan with both vectorized as well as non-vectorized node. 

4. Support implement new vectorized executor node gradually. We currently only vectorized SeqScan and Agg but other queries which including Join could also be run when vectorize extension is enabled.

5. Inherit original executor code. Instead of rewriting the whole executor, we choose a more smooth method to modify current Postgres executor node and make it vectorized. We copy the current executor node's c file into our extension, and add vectorize logic based on it. When Postgres enhance its executor, we could relatively easily merge them back. We want to know whether this is a good way to write vectorized executor extension?

6. Pluggable storage. Postgres has supported pluggable storage now. TupleTableSlot is refactored as abstract struct TupleTableSlotOps. VectorTupleTableSlot could be implemented under this framework when we upgrade the extension to latest PG.

We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S. 28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future, since vectorized executor is more compatible with column store.

2. vectorized agg is not fully vectorized and we have many optimization need to do. For example, batch compute the hash value, optimize hash table for vectorized HashAgg. 

3. Conversion cost from Datum to actual type and vice versa is also high, for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we store the actual type in VectorTupleTableSlot directly, instead of an array of datums.

Related works:
1. VOPS is a vectorized execution extension. Link: https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized operation e.g. it changes the SQL syntax to do aggregation.

2. Citus vectorized executor is another POC. Link: https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw to support column storage.

Note that the vectorized executor engine is based on PG9.6 now, but it could be ported to master / zedstore with some effort.  We would appreciate some feedback before moving further in that direction.

Thanks, 
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 28.11.2019 12:23, Hubert Zhang wrote:
Hi hackers,

We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.

The basic idea is to extend the TupleTableSlot and introduce VectorTupleTableSlot, which is an array of datums organized by projected columns.  The array of datum per column is continuous in memory. This makes the expression evaluation cache friendly and SIMD could be utilized. We have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot currently.

Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original executor node such as SeqScan, Agg etc. Based on CustomScan, we could extend the CustomScanState, BeginCustomScan(), ExecCustomScan(), EndCustomScan() interface to implement vectorize executor logic.

3. Post planner hook. After plan is generated, we use plan_tree_walker to traverse the plan tree and check whether it could be vectorized. If yes, the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized nodes (in form of CustomScan node) and use vectorized executor. If no, we will revert to the original plan and use non-vectorized executor. In future this part could be enhanced, for example, instead of revert to original plan when some nodes cannot be vectorized, we could add Batch/UnBatch node to generate a plan with both vectorized as well as non-vectorized node. 

4. Support implement new vectorized executor node gradually. We currently only vectorized SeqScan and Agg but other queries which including Join could also be run when vectorize extension is enabled.

5. Inherit original executor code. Instead of rewriting the whole executor, we choose a more smooth method to modify current Postgres executor node and make it vectorized. We copy the current executor node's c file into our extension, and add vectorize logic based on it. When Postgres enhance its executor, we could relatively easily merge them back. We want to know whether this is a good way to write vectorized executor extension?

6. Pluggable storage. Postgres has supported pluggable storage now. TupleTableSlot is refactored as abstract struct TupleTableSlotOps. VectorTupleTableSlot could be implemented under this framework when we upgrade the extension to latest PG.

We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S. 28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future, since vectorized executor is more compatible with column store.

2. vectorized agg is not fully vectorized and we have many optimization need to do. For example, batch compute the hash value, optimize hash table for vectorized HashAgg. 

3. Conversion cost from Datum to actual type and vice versa is also high, for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we store the actual type in VectorTupleTableSlot directly, instead of an array of datums.

Related works:
1. VOPS is a vectorized execution extension. Link: https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized operation e.g. it changes the SQL syntax to do aggregation.

2. Citus vectorized executor is another POC. Link: https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw to support column storage.

Note that the vectorized executor engine is based on PG9.6 now, but it could be ported to master / zedstore with some effort.  We would appreciate some feedback before moving further in that direction.

Thanks, 
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen

Hi,

I think that vectorized executor is absolutely necessary thing for Postgres, especially taken in account that now we have columnar store prototype (zedstore).
To take all advantages of columnar store we definitely need a vectorized executor.

But I do not completely understand why you are proposing to implement it as extension.
Yes, custom nodes makes it possible to provide vector execution without affecting Postgres core.
But for efficient integration of zedstore and vectorized executor we need to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
Certainly it is easier to contribute vectorized executor as extension, but sooner or later I think it should be added to Postgres core.

As far as I understand you already have some prototype implementation (otherwise how you got the performance results)?
If so, are you planning to publish it or you think that executor should be developed from scratch?

Some my concerns based on VOPS experience:

1. Vertical (columnar) model is preferable for some kind of queries, but there are some classes of queries for which it is less efficient.
Moreover, data is used to be imported in the database in row format. Inserting it in columnar store record-by-record is very inefficient.
So you need some kind of bulk loader which will be able to buffer input data before loading it in columnar store.
Actually this problem it is more related with data model rather than vectorized executor. But what I want to express here is that it may be better to have both representation (horizontal and vertical)
and let optimizer choose most efficient one for particular query.

2. Columnar store and vectorized executor are most efficient for query like "select sum(x) from T where ...".
Unfortunately such simple queries are rarely used in real life. Usually analytic queries contain group-by and joins.
And here vertical model is not always optimal (you have to reconstruct rows from columns to perform join or grouping).
To provide efficient execution of queries you may need to create multiple different projections of the same data (sorted by different subset of attributes).
This is why Vertica (one of the most popular columnar store DBMS) is supporting projections.
The same can be done in VOPS: using create_projection function you can specify which attributes should be scalar (grouping attributes) and which vectorized.
In this case you can perform grouping and joins using standard Postgres executor, while perform vectorized operations for filtering and accumulating aggregates.

This is why Q1 is 20 times faster in VOPS and not 2 times as in your prototype.
So I think that columnar store should make it possible to maintain several projections of table and optimizer should be able to automatically choose one of them for particular query.
Definitely synchronization of projections is challenged problem. Fortunately OLAP usually not require most recent data.

3. I wonder if vectorized executor should support only built-in types and predefined operators? Or it should be able to work with any user defined types, operators and aggregates?
Certainly it is much easier to support only built-in scalar types. But it contradicts to open and extensible nature of Postgres.

4. Did you already think about format of storing data in VectorTupleTableSlot? Should it be array of Datum? Or we need to represent vector in more low level format (for example
as array of floats for real4 type)?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Michael Paquier
Date:
On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote:
> Note that the vectorized executor engine is based on PG9.6 now, but it
> could be ported to master / zedstore with some effort.  We would appreciate
> some feedback before moving further in that direction.

There has been no feedback yet, unfortunately.  The patch does not
apply anymore, so a rebase is necessary.  For now I am moving the
patch to next CF, waiting on author.
--
Michael

Attachment

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Hi Konstantin,
Thanks for your reply.

On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 28.11.2019 12:23, Hubert Zhang wrote:
We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.
But I do not completely understand why you are proposing to implement it as extension.
Yes, custom nodes makes it possible to provide vector execution without affecting Postgres core.
But for efficient integration of zedstore and vectorized executor we need to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
Certainly it is easier to contribute vectorized executor as extension, but sooner or later I think it should be added to Postgres core.

As far as I understand you already have some prototype implementation (otherwise how you got the performance results)?
If so, are you planning to publish it or you think that executor should be developed from scratch?

The prototype extension is at https://github.com/zhangh43/vectorize_engine
 I agree vectorized executor should be added to Postgres core some days. But it is such a huge feature and need to change from not only the extended table-AM you mentioned and also every executor node , such as Agg,Join,Sort node etc. What's more, the expression evaluation function and aggregate's transition function, combine function etc. We all need to supply a vectorized version for them. Hence, implementing it as an extension first and if it is popular among community and stable, we could merge it into Postgres core whenever we want.

We do want to get some feedback from the community about CustomScan. CustomScan is just an abstract layer. It's typically used to support user defined scan node, but some other PG extensions(pgstorm) have already used it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need to support vectorized processing in all executor node, follow the above idea, our choice is to use CustomScan.
 
 Some my concerns based on VOPS experience:

1. Vertical (columnar) model is preferable for some kind of queries, but there are some classes of queries for which it is less efficient.
Moreover, data is used to be imported in the database in row format. Inserting it in columnar store record-by-record is very inefficient.
So you need some kind of bulk loader which will be able to buffer input data before loading it in columnar store.
Actually this problem it is more related with data model rather than vectorized executor. But what I want to express here is that it may be better to have both representation (horizontal and vertical)
and let optimizer choose most efficient one for particular query.


Yes, in general, for OLTP queries, row format is better and for OLAP queries column format is better.
As for storage type(or data model), I think DBA should choose row or column store to use for a specific table.
As for executor, it's a good idea to let optimizer to choose based on cost. It is a long term goal and our extension now will fallback to original row executor for Insert,Update,IndexScan cases in a rough way.
We want our extension could be enhanced in a gradual way.
 
2. Columnar store and vectorized executor are most efficient for query like "select sum(x) from T where ...".
Unfortunately such simple queries are rarely used in real life. Usually analytic queries contain group-by and joins.
And here vertical model is not always optimal (you have to reconstruct rows from columns to perform join or grouping).
To provide efficient execution of queries you may need to create multiple different projections of the same data (sorted by different subset of attributes).
This is why Vertica (one of the most popular columnar store DBMS) is supporting projections.
The same can be done in VOPS: using create_projection function you can specify which attributes should be scalar (grouping attributes) and which vectorized.
In this case you can perform grouping and joins using standard Postgres executor, while perform vectorized operations for filtering and accumulating aggregates. 

This is why Q1 is 20 times faster in VOPS and not 2 times as in your prototype.
So I think that columnar store should make it possible to maintain several projections of table and optimizer should be able to automatically choose one of them for particular query.
Definitely synchronization of projections is challenged problem. Fortunately OLAP usually not require most recent data.


Projection in Vertica is useful. I tested, VOPS is really faster. It could be nice if you could contribute it to PG core. Our extension is aimed to not change any Postgres code as well as user's sql and existing table.
We will continue to optimize our vectorize implementation. Vectorized hashagg need vectorized hashtable implementation, e.g. calculate hashkey in a batched way, probe hashtable in a batched way. Original hashtable in PG is not a vectorised hash table of course.
 
3. I wonder if vectorized executor should support only built-in types and predefined operators? Or it should be able to work with any user defined types, operators and aggregates?
Certainly it is much easier to support only built-in scalar types. But it contradicts to open and extensible nature of Postgres.

Yes, we should support user defined type. This could be done by introducing a register layer which mapping the row type with vector type. E.g. int4->vint4 and also for each operator. 

4. Did you already think about format of storing data in VectorTupleTableSlot? Should it be array of Datum? Or we need to represent vector in more low level format (for example
as array of floats for real4 type)?


Our perf results show that datum conversion is not effective, and we prepare to implement to datum array as low level format array as you mentioned.
--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
On Sun, Dec 1, 2019 at 10:05 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote:
> Note that the vectorized executor engine is based on PG9.6 now, but it
> could be ported to master / zedstore with some effort.  We would appreciate
> some feedback before moving further in that direction.

There has been no feedback yet, unfortunately.  The patch does not
apply anymore, so a rebase is necessary.  For now I am moving the
patch to next CF, waiting on author.
--
Michael


Thanks we'll rebase and resubmit the patch.
--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 02.12.2019 4:15, Hubert Zhang wrote:

The prototype extension is at https://github.com/zhangh43/vectorize_engine

I am very sorry, that I have no followed this link.
Few questions concerning your design decisions:

1. Will it be more efficient to use native arrays in vtype instead of array of Datum? I think it will allow compiler to generate more efficient code for operations with float4 and int32 types.
It is possible to use union to keep fixed size of vtype.
2. Why VectorTupleSlot contains array (batch) of heap tuples rather than vectors (array of vtype)?
3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?
5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs?
6. Have you experimented with different batch size? I have done similar experiments in VOPS and find out that tile size larger than 128 are not providing noticable increase of performance.
You are currently using batch size 1024 which is significantly larger than typical amount of tuples on one page.
7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Thanks Konstantin for your detailed review!

On Tue, Dec 3, 2019 at 5:58 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 02.12.2019 4:15, Hubert Zhang wrote:

The prototype extension is at https://github.com/zhangh43/vectorize_engine

I am very sorry, that I have no followed this link.
Few questions concerning your design decisions:

1. Will it be more efficient to use native arrays in vtype instead of array of Datum? I think it will allow compiler to generate more efficient code for operations with float4 and int32 types.
It is possible to use union to keep fixed size of vtype.
 
Yes, I'm also considering that when scan a column store, the column batch is loaded into a continuous memory region. For int32, the size of this region is 4*BATCHSIZE, while for int16, the size is 2*BATCHSIZE. So using native array could just do a single memcpy to fill the vtype batch.
 
2. Why VectorTupleSlot contains array (batch) of heap tuples rather than vectors (array of vtype)?

a. VectorTupleSlot stores array of vtype in tts_values field which is used to reduce the code change and reuse functions like ExecProject. Of course we could use separate field to store vtypes.
b. VectorTupleSlot also contains array of heap tuples. This used to do heap tuple deform. In fact, the tuples in a batch may across many pages, so we also need to pin an array of related pages instead of just one page.

3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right?
 
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries.
Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs? 
 
Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches?
What do you mean by not safest? PG catch will receive the ERROR, and fallback to the original non-vectorized plan.


6. Have you experimented with different batch size? I have done similar experiments in VOPS and find out that tile size larger than 128 are not providing noticable increase of performance.
You are currently using batch size 1024 which is significantly larger than typical amount of tuples on one page.

Good point, We will do some experiments on it. 

7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually.

--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 04.12.2019 12:13, Hubert Zhang wrote:
3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right?

O, sorry, I see.
 
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries.

Generally, yes.
But will it be true for the query with a lot of joins?

select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join T4 ...

How can batching improve performance in this case?
Also if query contains LIMIT clause or cursors, then batching can cause fetching of useless records (which never will be requested by client).

Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs? 
 
Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches?
What do you mean by not safest?
PG catch will receive the ERROR, and fallback to the original non-vectorized plan.

The problem with catching and ignoring exception was many times discussed in hackers.
Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of exception mechanism in more high level languages, like C++, Java...
It doesn't perform stack unwind. If some resources (files, locks, memory,...) were obtained before throwing error, then them are not reclaimed.
Only rollback of transaction is guaranteed to release all resources. And it actually happen in case of normal error processing.
But if you catch and ignore exception , trying to continue execution, then it can cause many problems.

May be in your case it is not a problem, because you know for sure where error can happen: it is thrown by plan_tree_mutator
and looks like there are no resources obtained by this function.  But in any case overhead of setjmp is much higher than of explicit checks of return code.
So checking return codes will not actually add some noticeable overhead except code complication by adding extra checks.
But in can be hidden in macros which are used in any case (like MUTATE).

7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually.

Parallel scan is scattering pages between parallel workers.
To fill VectorTupleSlot with data you may need more than one page (unless you make a decision that it can fetch tuples only from single page).
So it should be somehow take in account specific of parallel search.
Also there is special nodes for parallel search so if we want to provide parallel execution for vectorized operations we need also to substitute this nodes with
custom nodes.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Thanks Konstantin,
Your suggestions are very helpful. I have added them into issues of vectorize_engine repo

On Wed, Dec 4, 2019 at 10:08 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 04.12.2019 12:13, Hubert Zhang wrote:
3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right?

O, sorry, I see.
 
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries.

Generally, yes.
But will it be true for the query with a lot of joins?

select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join T4 ...

How can batching improve performance in this case?
Also if query contains LIMIT clause or cursors, then batching can cause fetching of useless records (which never will be requested by client).

Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs? 
 
Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches?
What do you mean by not safest?
PG catch will receive the ERROR, and fallback to the original non-vectorized plan.

The problem with catching and ignoring exception was many times discussed in hackers.
Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of exception mechanism in more high level languages, like C++, Java...
It doesn't perform stack unwind. If some resources (files, locks, memory,...) were obtained before throwing error, then them are not reclaimed.
Only rollback of transaction is guaranteed to release all resources. And it actually happen in case of normal error processing.
But if you catch and ignore exception , trying to continue execution, then it can cause many problems.

May be in your case it is not a problem, because you know for sure where error can happen: it is thrown by plan_tree_mutator
and looks like there are no resources obtained by this function.  But in any case overhead of setjmp is much higher than of explicit checks of return code.
So checking return codes will not actually add some noticeable overhead except code complication by adding extra checks.
But in can be hidden in macros which are used in any case (like MUTATE).

7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually.

Parallel scan is scattering pages between parallel workers.
To fill VectorTupleSlot with data you may need more than one page (unless you make a decision that it can fetch tuples only from single page).
So it should be somehow take in account specific of parallel search.
Also there is special nodes for parallel search so if we want to provide parallel execution for vectorized operations we need also to substitute this nodes with
custom nodes.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 


--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:
I have done some performance comparisons.
First of all I failed to run vectorized version of Q1 with master branch of your repository and
PG9_6_STABLE branch of Postgres:

NOTICE:  query can't be vectorized
DETAIL:  Non plain agg is not supported

I have to switch to pg96 branch.

Results (seconds) of Q1 execution are the  following:

max_parallel_workers_per_gather
PG9_6, enable_vectorize_engine=off
PG9_6, enable_vectorize_engine=on
master (jit=on)
0
36
20
10
4
10
-
5


I failed to run parallel version of Q1 with enable_vectorize_engine=on because of the same error: "Non plain agg is not supported"


So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Pavel Stehule
Date:


po 10. 2. 2020 v 18:20 odesílatel Konstantin Knizhnik <k.knizhnik@postgrespro.ru> napsal:
I have done some performance comparisons.
First of all I failed to run vectorized version of Q1 with master branch of your repository and
PG9_6_STABLE branch of Postgres:

NOTICE:  query can't be vectorized
DETAIL:  Non plain agg is not supported

I have to switch to pg96 branch.

Results (seconds) of Q1 execution are the  following:

max_parallel_workers_per_gather
PG9_6, enable_vectorize_engine=off
PG9_6, enable_vectorize_engine=on
master (jit=on)
0
36
20
10
4
10
-
5


I failed to run parallel version of Q1 with enable_vectorize_engine=on because of the same error: "Non plain agg is not supported"


So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

+1

Pavel




-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 12.02.2020 13:12, Hubert Zhang wrote:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

I have ported vectorize_engine to the master.
It takes longer than I expected: a lot of things were changed in executor.

Results are the following:


par.warkers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=ofn
master
vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
4
10
-
57
-
-

So it proves the theory that JIT provides almost the same speedup as vector executor (both eliminates interpretation overhead but in different way).
I still not sure that we need vectorized executor: because with standard heap it provides almost no improvements comparing with current JIT version.
But in any case I am going to test it with vertical storage (zedstore or cstore).

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Hi

On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 12.02.2020 13:12, Hubert Zhang wrote:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

I have ported vectorize_engine to the master.
It takes longer than I expected: a lot of things were changed in executor.

Results are the following:


par.warkers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=ofn
master
vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
4
10
-
57
-
-

So it proves the theory that JIT provides almost the same speedup as vector executor (both eliminates interpretation overhead but in different way).
I still not sure that we need vectorized executor: because with standard heap it provides almost no improvements comparing with current JIT version.
But in any case I am going to test it with vertical storage (zedstore or cstore).


Thanks for the porting and testing.
Yes, PG master and 9.6 have many changes, not only executor, but also tupletableslot interface.

What matters the performance of JIT and Vectorization is its implementation. This is just the beginning of vectorization work, just as your vops extension reported, vectorization could run 10 times faster in PG. With the overhead of row storage(heap), we may not reach that speedup, but I think we could do better. Also +1 on vertical storage.

BTW, welcome to submit your PR for the PG master version.


--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 24.02.2020 05:08, Hubert Zhang wrote:
Hi

On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 12.02.2020 13:12, Hubert Zhang wrote:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

I have ported vectorize_engine to the master.
It takes longer than I expected: a lot of things were changed in executor.

Results are the following:


par.warkers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=ofn
master
vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
4
10
-
57
-
-

So it proves the theory that JIT provides almost the same speedup as vector executor (both eliminates interpretation overhead but in different way).
I still not sure that we need vectorized executor: because with standard heap it provides almost no improvements comparing with current JIT version.
But in any case I am going to test it with vertical storage (zedstore or cstore).

Thanks for the porting and testing.
Yes, PG master and 9.6 have many changes, not only executor, but also tupletableslot interface.

What matters the performance of JIT and Vectorization is its implementation. This is just the beginning of vectorization work, just as your vops extension reported, vectorization could run 10 times faster in PG. With the overhead of row storage(heap), we may not reach that speedup, but I think we could do better. Also +1 on vertical storage.

BTW, welcome to submit your PR for the PG master version.


Sorry, but I have no permissions to push changes to your repository.
I can certainly create my own fork of vectorize_engine, but I think it will be beter if I push pg13 branch in your repository.


Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Hi Konstantin,
I have added you as a collaborator on github. Please accepted and try again.
I think non collaborator could also open pull requests.

On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 24.02.2020 05:08, Hubert Zhang wrote:
Hi

On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 12.02.2020 13:12, Hubert Zhang wrote:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

I have ported vectorize_engine to the master.
It takes longer than I expected: a lot of things were changed in executor.

Results are the following:


par.warkers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=ofn
master
vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
4
10
-
57
-
-

So it proves the theory that JIT provides almost the same speedup as vector executor (both eliminates interpretation overhead but in different way).
I still not sure that we need vectorized executor: because with standard heap it provides almost no improvements comparing with current JIT version.
But in any case I am going to test it with vertical storage (zedstore or cstore).

Thanks for the porting and testing.
Yes, PG master and 9.6 have many changes, not only executor, but also tupletableslot interface.

What matters the performance of JIT and Vectorization is its implementation. This is just the beginning of vectorization work, just as your vops extension reported, vectorization could run 10 times faster in PG. With the overhead of row storage(heap), we may not reach that speedup, but I think we could do better. Also +1 on vertical storage.

BTW, welcome to submit your PR for the PG master version.


Sorry, but I have no permissions to push changes to your repository.
I can certainly create my own fork of vectorize_engine, but I think it will be beter if I push pg13 branch in your repository.




--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Hi Konstantin,

I checkout your branch pg13 in repo https://github.com/zhangh43/vectorize_engine
After I fixed some compile error, I tested Q1 on TPCH-10G
The result is different from yours and vectorize version is too slow. Note that I disable parallel worker by default.
no JIT no Vectorize:  36 secs
with JIT only:             23 secs
with Vectorize only:   33 secs
JIT + Vectorize:         29 secs

My config option is `CFLAGS='-O3 -g -march=native' --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm`
I will do some spike on why vectorized is so slow. Could you please provide your compile option and the TPCH dataset size and your queries(standard Q1?) to help me to debug on it.

On Mon, Feb 24, 2020 at 8:43 PM Hubert Zhang <hzhang@pivotal.io> wrote:
Hi Konstantin,
I have added you as a collaborator on github. Please accepted and try again.
I think non collaborator could also open pull requests.

On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 24.02.2020 05:08, Hubert Zhang wrote:
Hi

On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 12.02.2020 13:12, Hubert Zhang wrote:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

I have ported vectorize_engine to the master.
It takes longer than I expected: a lot of things were changed in executor.

Results are the following:


par.warkers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=ofn
master
vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
4
10
-
57
-
-

So it proves the theory that JIT provides almost the same speedup as vector executor (both eliminates interpretation overhead but in different way).
I still not sure that we need vectorized executor: because with standard heap it provides almost no improvements comparing with current JIT version.
But in any case I am going to test it with vertical storage (zedstore or cstore).

Thanks for the porting and testing.
Yes, PG master and 9.6 have many changes, not only executor, but also tupletableslot interface.

What matters the performance of JIT and Vectorization is its implementation. This is just the beginning of vectorization work, just as your vops extension reported, vectorization could run 10 times faster in PG. With the overhead of row storage(heap), we may not reach that speedup, but I think we could do better. Also +1 on vertical storage.

BTW, welcome to submit your PR for the PG master version.


Sorry, but I have no permissions to push changes to your repository.
I can certainly create my own fork of vectorize_engine, but I think it will be beter if I push pg13 branch in your repository.




--
Thanks

Hubert Zhang


--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 25.02.2020 11:06, Hubert Zhang wrote:
Hi Konstantin,

I checkout your branch pg13 in repo https://github.com/zhangh43/vectorize_engine
After I fixed some compile error, I tested Q1 on TPCH-10G
The result is different from yours and vectorize version is too slow. Note that I disable parallel worker by default.
no JIT no Vectorize:  36 secs
with JIT only:             23 secs
with Vectorize only:   33 secs
JIT + Vectorize:         29 secs

My config option is `CFLAGS='-O3 -g -march=native' --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm`
I will do some spike on why vectorized is so slow. Could you please provide your compile option and the TPCH dataset size and your queries(standard Q1?) to help me to debug on it.



Hi, Hubert

Sorry, looks like I have used slightly deteriorated snapshot of master so I have not noticed some problems.
Fixes are committed.

Most of the time is spent in unpacking heap tuple (tts_buffer_heap_getsomeattrs):

  24.66%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
   8.28%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   5.94%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   4.21%  postgres  postgres             [.] bpchareq
   4.12%  postgres  vectorize_engine.so  [.] vfloat8_accum


In my version of nodeSeqscan I do not keep all fetched 1024 heap tuples but stored there attribute values in vector columns immediately.
But to avoid extraction of useless data it is necessary to know list of used columns.
The same problem is solved in zedstore, but unfortunately there is no existed method in Postgres to get list
of used attributes. I have done it but my last implementation contains error which cause loading of all columns.
Fixed version is committed.

Now profile without JIT is:

 15.52%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
  10.25%  postgres  postgres             [.] ExecInterpExpr
   6.54%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   5.12%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   4.86%  postgres  postgres             [.] bpchareq
   4.80%  postgres  vectorize_engine.so  [.] vfloat8_accum
   3.78%  postgres  postgres             [.] tts_minimal_getsomeattrs
   3.66%  postgres  vectorize_engine.so  [.] VExecAgg
   3.38%  postgres  postgres             [.] hashbpchar

and with JIT:

 13.88%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
   7.15%  postgres  vectorize_engine.so  [.] vfloat8_accum
   6.03%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   5.55%  postgres  postgres             [.] bpchareq
   4.42%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   4.19%  postgres  postgres             [.] hashbpchar
   4.09%  postgres  vectorize_engine.so  [.] vfloat8pl

On Mon, Feb 24, 2020 at 8:43 PM Hubert Zhang <hzhang@pivotal.io> wrote:
Hi Konstantin,
I have added you as a collaborator on github. Please accepted and try again.
I think non collaborator could also open pull requests.

On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 24.02.2020 05:08, Hubert Zhang wrote:
Hi

On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 12.02.2020 13:12, Hubert Zhang wrote:
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

So looks like PG-13 provides significant advantages in OLAP queries comparing with 9.6!
Definitely it doesn't mean that vectorized executor is not needed for new version of Postgres.
Once been ported, I expect that it should provide comparable  improvement of performance.

But in any case I think that vectorized executor makes sense only been combine with columnar store.

Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension.
 
Thanks

Hubert Zhang

I have ported vectorize_engine to the master.
It takes longer than I expected: a lot of things were changed in executor.

Results are the following:


par.warkers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=ofn
master
vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
4
10
-
57
-
-

So it proves the theory that JIT provides almost the same speedup as vector executor (both eliminates interpretation overhead but in different way).
I still not sure that we need vectorized executor: because with standard heap it provides almost no improvements comparing with current JIT version.
But in any case I am going to test it with vertical storage (zedstore or cstore).

Thanks for the porting and testing.
Yes, PG master and 9.6 have many changes, not only executor, but also tupletableslot interface.

What matters the performance of JIT and Vectorization is its implementation. This is just the beginning of vectorization work, just as your vops extension reported, vectorization could run 10 times faster in PG. With the overhead of row storage(heap), we may not reach that speedup, but I think we could do better. Also +1 on vertical storage.

BTW, welcome to submit your PR for the PG master version.


Sorry, but I have no permissions to push changes to your repository.
I can certainly create my own fork of vectorize_engine, but I think it will be beter if I push pg13 branch in your repository.




--
Thanks

Hubert Zhang


--
Thanks

Hubert Zhang

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:
I have ported vectorize_engine  for zedstore (vertical table AM).
Results of TPCH-10G/Q1 are the following:

par.workers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=on
master
vectorize=on
jit=off
zedstore vectorize=off
jit=on
zedstore vectorize=off
jit=off
zedstore vectorize=on
jit=on
zedstore vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
18
26
17
19
4
10
-
57
-
-5
7
-
-

As you can see from this table time of query execution without vectorization is almost the same for zedstore as for standard heap.
If means that expression execution overhead is dominated in this case despite to the underlying storage.
Enabling vectorize engine increases speed of zedstore as well as of standard heap.
But still standard heap is faster.

May be my implementation of extracting data from zedstore is not optimal - I just calling in the loop  zsbt_tid_scan_next + zsbt_attr_fetch.
I attached my implementation of zedstoream_getnexttile (I have added scan_getnexttile to tableAM interface).

Also I noticed that currently zedstore doesn't correctly calculate set of used attributes and so is extract useless data.
For example query like "select sum(x) from foo" cause fetching of all attributes from foo although we need just "x".

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 25.02.2020 19:40, Konstantin Knizhnik wrote:
I have ported vectorize_engine  for zedstore (vertical table AM).
Results of TPCH-10G/Q1 are the following:

par.workers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=on
master
vectorize=on
jit=off
zedstore vectorize=off
jit=on
zedstore vectorize=off
jit=off
zedstore vectorize=on
jit=on
zedstore vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
18
26
17
19
4
10
-
57
-
-5
7
-
-


After correct calculation of used columns bitmapset and passing it to table_beginscan_with_column_projection function zedstore+vectorize_engine
show the best result (without parallel execution):


par.workers
PG9_6
vectorize=off
PG9_6
vectorize=on
master
vectorize=off
jit=on
master
vectorize=off
jit=off
master
vectorize=on
jit=on
master
vectorize=on
jit=off
zedstore vectorize=off
jit=on
zedstore vectorize=off
jit=off
zedstore vectorize=on
jit=on
zedstore vectorize=on
jit=off
0
36
20
16
25.5
15
17.5
18
26
14
16
4
10
-
57
-
-5
7
-
-


but still the difference with vanilla is minimal.


Profiler top is the following:
 
  16.30%  postgres  postgres             [.] zedstoream_getnexttile
   6.98%  postgres  postgres             [.] decode_chunk
   6.68%  postgres  liblz4.so.1.7.1      [.] LZ4_decompress_safe
   5.37%  postgres  vectorize_engine.so  [.] vfloat8_accum
   5.23%  postgres  postgres             [.] bpchareq

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Hi Konstantin,

On Tue, Feb 25, 2020 at 6:44 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 25.02.2020 11:06, Hubert Zhang wrote:
Hi Konstantin,

I checkout your branch pg13 in repo https://github.com/zhangh43/vectorize_engine
After I fixed some compile error, I tested Q1 on TPCH-10G
The result is different from yours and vectorize version is too slow. Note that I disable parallel worker by default.
no JIT no Vectorize:  36 secs
with JIT only:             23 secs
with Vectorize only:   33 secs
JIT + Vectorize:         29 secs

My config option is `CFLAGS='-O3 -g -march=native' --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm`
I will do some spike on why vectorized is so slow. Could you please provide your compile option and the TPCH dataset size and your queries(standard Q1?) to help me to debug on it.



Hi, Hubert

Sorry, looks like I have used slightly deteriorated snapshot of master so I have not noticed some problems.
Fixes are committed.

Most of the time is spent in unpacking heap tuple (tts_buffer_heap_getsomeattrs):

  24.66%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
   8.28%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   5.94%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   4.21%  postgres  postgres             [.] bpchareq
   4.12%  postgres  vectorize_engine.so  [.] vfloat8_accum


In my version of nodeSeqscan I do not keep all fetched 1024 heap tuples but stored there attribute values in vector columns immediately.
But to avoid extraction of useless data it is necessary to know list of used columns.
The same problem is solved in zedstore, but unfortunately there is no existed method in Postgres to get list
of used attributes. I have done it but my last implementation contains error which cause loading of all columns.
Fixed version is committed.

Now profile without JIT is:

 15.52%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
  10.25%  postgres  postgres             [.] ExecInterpExpr
   6.54%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   5.12%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   4.86%  postgres  postgres             [.] bpchareq
   4.80%  postgres  vectorize_engine.so  [.] vfloat8_accum
   3.78%  postgres  postgres             [.] tts_minimal_getsomeattrs
   3.66%  postgres  vectorize_engine.so  [.] VExecAgg
   3.38%  postgres  postgres             [.] hashbpchar

and with JIT:

 13.88%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
   7.15%  postgres  vectorize_engine.so  [.] vfloat8_accum
   6.03%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   5.55%  postgres  postgres             [.] bpchareq
   4.42%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   4.19%  postgres  postgres             [.] hashbpchar
   4.09%  postgres  vectorize_engine.so  [.] vfloat8pl


I also tested Q1 with your latest code. Result of vectorized is still slow.
PG13 native: 38 secs
PG13 Vec: 30 secs
PG13 JIT: 23 secs
PG13 JIT+Vec: 27 secs

My perf result is as belows. There are three parts: 
1. lookup_hash_entry(43.5%) this part is not vectorized yet.
2. scan part: fetch_input_tuple(36%)
3. vadvance_aggregates part(20%)
I also perfed on PG96 vectorized version and got similar perf results and running time of vectorized PG96 and PG13 are also similar. But PG13 is much faster than PG96. So I just wonder whether we merge all the latest executor code of PG13 into the vectorized PG13 branch? 

- agg_fill_hash_table ◆ - 43.50% lookup_hash_entry (inlined) ▒ + 39.07% LookupTupleHashEntry ▒ 0.56% ExecClearTuple (inlined) ▒ - 36.06% fetch_input_tuple ▒ - ExecProcNode (inlined) ▒ - 36.03% VExecScan ▒ - 34.60% ExecScanFetch (inlined) ▒ - ExecScanFetch (inlined) ▒ - VSeqNext ▒ + 16.64% table_scan_getnextslot (inlined) ▒ - 10.29% slot_getsomeattrs (inlined) ▒ - 10.17% slot_getsomeattrs_int ▒ + tts_buffer_heap_getsomeattrs ▒ 7.14% VExecStoreColumns ▒ + 1.38% ExecQual (inlined) ▒ - 20.30% Vadvance_aggregates (inlined) ▒ - 17.46% Vadvance_transition_function (inlined) ▒ + 11.95% vfloat8_accum ▒ + 4.74% vfloat8pl ▒ 0.75% vint8inc_any ▒ + 2.77% ExecProject (inlined)

--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:


On 26.02.2020 13:11, Hubert Zhang wrote:


and with JIT:

 13.88%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
   7.15%  postgres  vectorize_engine.so  [.] vfloat8_accum
   6.03%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   5.55%  postgres  postgres             [.] bpchareq
   4.42%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   4.19%  postgres  postgres             [.] hashbpchar
   4.09%  postgres  vectorize_engine.so  [.] vfloat8pl


I also tested Q1 with your latest code. Result of vectorized is still slow.
PG13 native: 38 secs
PG13 Vec: 30 secs
PG13 JIT: 23 secs
PG13 JIT+Vec: 27 secs


It is strange that your results are much slower than my and profile is very different.
Which postgres configuration you are using?


My perf result is as belows. There are three parts: 
1. lookup_hash_entry(43.5%) this part is not vectorized yet.
It is vectorized in some sense: lookup_hash_entry performs bulk of hash lookups and pass array with results of such lookups to aggregate transmit functions.
It will be possible to significantly increase speed of HashAgg if we store data in order of grouping attributes and use RLE (run length encoding) to peform just one
hash lookup for group of values. But it requires creation of special partitions (like it is done in Vertica and VOPS).

2. scan part: fetch_input_tuple(36%)
3. vadvance_aggregates part(20%)
I also perfed on PG96 vectorized version and got similar perf results and running time of vectorized PG96 and PG13 are also similar. But PG13 is much faster than PG96. So I just wonder whether we merge all the latest executor code of PG13 into the vectorized PG13 branch?

Sorry, I do not understand the question. vectorize_executor contains patched versions of nodeSeqscan  and nodeAgg from standard executor.
When performing porting to PG13, I took the latest version of nodeAgg and tried to apply your patches to it. Certainly not always it was possible and I have to rewrite a lt of places. Concerning nodeSeqscan - I took old version from vectorize_executor and port it to PG13.


- agg_fill_hash_table ◆ - 43.50% lookup_hash_entry (inlined) ▒ + 39.07% LookupTupleHashEntry ▒ 0.56% ExecClearTuple (inlined) ▒ - 36.06% fetch_input_tuple ▒ - ExecProcNode (inlined) ▒ - 36.03% VExecScan ▒ - 34.60% ExecScanFetch (inlined) ▒ - ExecScanFetch (inlined) ▒ - VSeqNext ▒ + 16.64% table_scan_getnextslot (inlined) ▒ - 10.29% slot_getsomeattrs (inlined) ▒ - 10.17% slot_getsomeattrs_int ▒ + tts_buffer_heap_getsomeattrs ▒ 7.14% VExecStoreColumns ▒ + 1.38% ExecQual (inlined) ▒ - 20.30% Vadvance_aggregates (inlined) ▒ - 17.46% Vadvance_transition_function (inlined) ▒ + 11.95% vfloat8_accum ▒ + 4.74% vfloat8pl ▒ 0.75% vint8inc_any ▒ + 2.77% ExecProject (inlined)


It is strange that I am not seeing lookup_hash_entry in profile in my case.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Yet another vectorized engine

From
Hubert Zhang
Date:


On Wed, Feb 26, 2020 at 7:59 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 26.02.2020 13:11, Hubert Zhang wrote:


and with JIT:

 13.88%  postgres  postgres             [.] tts_buffer_heap_getsomeattrs
   7.15%  postgres  vectorize_engine.so  [.] vfloat8_accum
   6.03%  postgres  postgres             [.] HeapTupleSatisfiesVisibility
   5.55%  postgres  postgres             [.] bpchareq
   4.42%  postgres  vectorize_engine.so  [.] VExecStoreColumns
   4.19%  postgres  postgres             [.] hashbpchar
   4.09%  postgres  vectorize_engine.so  [.] vfloat8pl


I also tested Q1 with your latest code. Result of vectorized is still slow.
PG13 native: 38 secs
PG13 Vec: 30 secs
PG13 JIT: 23 secs
PG13 JIT+Vec: 27 secs


It is strange that your results are much slower than my and profile is very different.
Which postgres configuration you are using?


./configure CFLAGS="-O3 -g -march=native" --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm
 I also use `PGXS := $(shell $(PG_CONFIG) --pgxs)` to compile vectorized_engine. So it will share the same compile configuration.

My perf result is as belows. There are three parts: 
1. lookup_hash_entry(43.5%) this part is not vectorized yet.
It is vectorized in some sense: lookup_hash_entry performs bulk of hash lookups and pass array with results of such lookups to aggregate transmit functions.
It will be possible to significantly increase speed of HashAgg if we store data in order of grouping attributes and use RLE (run length encoding) to peform just one
hash lookup for group of values. But it requires creation of special partitions (like it is done in Vertica and VOPS).


Yes, Vertica's partition needed to be pre-sorted on user defined columns. So for TPCH Q1 on Postgres, we could not have that assumption. And my Q1 plan uses HashAgg instead of GroupAgg based on cost.
 
2. scan part: fetch_input_tuple(36%)
3. vadvance_aggregates part(20%)
I also perfed on PG96 vectorized version and got similar perf results and running time of vectorized PG96 and PG13 are also similar. But PG13 is much faster than PG96. So I just wonder whether we merge all the latest executor code of PG13 into the vectorized PG13 branch?

Sorry, I do not understand the question. vectorize_executor contains patched versions of nodeSeqscan  and nodeAgg from standard executor.
When performing porting to PG13, I took the latest version of nodeAgg and tried to apply your patches to it. Certainly not always it was possible and I have to rewrite a lt of places. Concerning nodeSeqscan - I took old version from vectorize_executor and port it to PG13.


It is strange that I am not seeing lookup_hash_entry in profile in my case.


So you already have the PG13 nodeAgg, that is good.
Yes, it is strange. Hash table probing is always the costly part.
My perf command `perf record --call-graph dwarf -p pid`
Could you share your lineitem schema and Q1 query?
My schema and Q1 query are:
CREATE TABLE lineitem (
        l_orderkey BIGINT NOT NULL,
        l_partkey INTEGER NOT NULL,
        l_suppkey INTEGER NOT NULL,
        l_linenumber INTEGER NOT NULL,
        l_quantity double precision NOT NULL,
        l_extendedprice double precision NOT NULL,
        l_discount double precision NOT NULL,
        l_tax double precision NOT NULL,
        l_returnflag CHAR(1) NOT NULL,
        l_linestatus CHAR(1) NOT NULL,
        l_shipdate DATE NOT NULL,
        l_commitdate DATE NOT NULL,
        l_receiptdate DATE NOT NULL,
        l_shipinstruct CHAR(25) NOT NULL,
        l_shipmode CHAR(10) NOT NULL,
        l_comment VARCHAR(44) NOT NULL
);
select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(l_discount) as count_order
from
    lineitem
where
    l_shipdate <= date '1998-12-01' - interval '106 day'
group by
    l_returnflag,
    l_linestatus
;


--
Thanks

Hubert Zhang

Re: Yet another vectorized engine

From
Hubert Zhang
Date:
Hi Konstantin,
I also vimdiff nodeAgg.c in your PG13 branch with nodeAgg.c in pg's main repo.
Many functions has changed from PG96 to PG13, e.g. 'advance_aggregates', 'lookup_hash_entry'
The vectorized nodeAgg seems still follow the PG96 way of implementing these functions.
In general, I think we'd better port executor of PG13 to vectorized executor of PG13 instead of merge some PG13 code into vectorized executor of PG96 to make it works. Because It's hard to determine which functions need to be merged and it's buggy if the executor code of both PG13 and PG96 exist in one branch. 

What's your opinion?

Re: Yet another vectorized engine

From
Konstantin Knizhnik
Date:

On 27.02.2020 11:09, Hubert Zhang wrote:
> Hi Konstantin,
> I also vimdiff nodeAgg.c in your PG13 branch with nodeAgg.c in pg's 
> main repo.
> Many functions has changed from PG96 to PG13, e.g. 
> 'advance_aggregates', 'lookup_hash_entry'
> The vectorized nodeAgg seems still follow the PG96 way of implementing 
> these functions.
> In general, I think we'd better port executor of PG13 to vectorized 
> executor of PG13 instead of merge some PG13 code into vectorized 
> executor of PG96 to make it works. Because It's hard to determine 
> which functions need to be merged and it's buggy if the executor code 
> of both PG13 and PG96 exist in one branch.
>
> What's your opinion?
>

In new version of Postgres all logic of aggregates transition is 
encapsulated in expression and performed by execExprInterp or generated 
GIT code.
If we not going to embed vectorize engine in kernel and continue to 
develop it as extension, then I do not have any good idea how to achieve 
it without
copying and patching code of ExecInterpExpr.

In any case, the current prototype doesn't show any noticeable 
performance improvement  comparing with existed executor with enabled JIT.
And providing vectorized version of ExecInterpExpr will not help to 
increase speed (according to profile time is spent in other places).

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company