Re: [HACKERS] [WIP]Vertical Clustered Index (columnar storeextension) - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: [HACKERS] [WIP]Vertical Clustered Index (columnar storeextension) |
Date | |
Msg-id | e112b3a9-1150-e388-9efc-798f45bf9484@postgrespro.ru Whole thread Raw |
In response to | Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension) (Haribabu Kommi <kommi.haribabu@gmail.com>) |
Responses |
Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)
|
List | pgsql-hackers |
Hi,
I wonder if it is possible to somehow benchmark your clustered index implementation.
I tried to create VCI index for lineitem table from TPC and run Q6 query.
After index creation Postgres is not using parallel execution plan any more but speed of sequential plan is not changed
and nothing in query execution plan indicates that VCI index is used:
postgres=# explain select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Finalize Aggregate (cost=608333.85..608333.86 rows=1 width=4)
-> Gather (cost=608333.23..608333.84 rows=6 width=4)
Workers Planned: 6
-> Partial Aggregate (cost=607333.23..607333.24 rows=1 width=4)
-> Parallel Seq Scan on lineitem_projection (cost=0.00..607024.83 rows=61680 width=8)
Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AN
D (l_discount <= '0.1'::double precision) AND (l_quantity < '24'::double precision))
(6 rows)
postgres=# select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
revenue
-------------
6.21111e+08
(1 row)
Time: 1171.324 ms (00:01.171)
postgres=# create index vci_idx on lineitem_projection using vci(l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus);
CREATE INDEX
Time: 4.705 ms
postgres=# explain select
* from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------
Seq Scan on lineitem_projection (cost=0.00..382077.00 rows=1 width=22)
Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AND (l_discount <= '
0.1'::double precision) AND (l_quantity < '24'::double precision))
(2 rows)
postgres=# select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
revenue
------------
6.2112e+08
(1 row)
Time: 4304.355 ms (00:04.304)
I wonder if there is any query which can demonstrate advantages of using VCI index?
I wonder if it is possible to somehow benchmark your clustered index implementation.
I tried to create VCI index for lineitem table from TPC and run Q6 query.
After index creation Postgres is not using parallel execution plan any more but speed of sequential plan is not changed
and nothing in query execution plan indicates that VCI index is used:
postgres=# explain select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Finalize Aggregate (cost=608333.85..608333.86 rows=1 width=4)
-> Gather (cost=608333.23..608333.84 rows=6 width=4)
Workers Planned: 6
-> Partial Aggregate (cost=607333.23..607333.24 rows=1 width=4)
-> Parallel Seq Scan on lineitem_projection (cost=0.00..607024.83 rows=61680 width=8)
Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AN
D (l_discount <= '0.1'::double precision) AND (l_quantity < '24'::double precision))
(6 rows)
postgres=# select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
revenue
-------------
6.21111e+08
(1 row)
Time: 1171.324 ms (00:01.171)
postgres=# create index vci_idx on lineitem_projection using vci(l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus);
CREATE INDEX
Time: 4.705 ms
postgres=# explain select
* from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------
Seq Scan on lineitem_projection (cost=0.00..382077.00 rows=1 width=22)
Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AND (l_discount <= '
0.1'::double precision) AND (l_quantity < '24'::double precision))
(2 rows)
postgres=# select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
revenue
------------
6.2112e+08
(1 row)
Time: 4304.355 ms (00:04.304)
I wonder if there is any query which can demonstrate advantages of using VCI index?
On 06.02.2017 04:26, Haribabu Kommi wrote:
On Fri, Feb 3, 2017 at 8:28 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:On 30.12.2016 06:55, Haribabu Kommi wrote:Hi All,Fujitsu was interested in developing a columnar storage extension with minimalchanges the server backend.
We in PostgresPRO are also very interested in developing vertical storage (VS) for Postgres.
And after considering many alternatives, we came to the conclusion that approach based on representing columnar store as access method (index)
is the most promising one.
It allows to:
1. Implement VS as extension without affecting Postgres core.
2. Have both ROS and WOS.
3. Create multiple projections (as in Vertica).
4. Optimize insert speed by support batch inserts and use flexible recovery model for VS.
So it is very similar with your approach. But there are few differences:
1. Our intention is to completely eliminate changes in Postgres core.
You wrote:But I still do not completely understand why it is not possible to use VS in index only scans without any changes and standard Postgres executor?Yes, it is a mix of both index and table access methods. The current designof Vertical clustered index needs both access methods, because of this reasonwe used both access methods.
Why it is not possible to rely on standard rules of applying indexes in Postgres optimizer based on costs provided by our AM implementation?In our storage design, we used TID-CRID map to identify a record in heapto columnar storage. Because of HOT update, the new data will not be insertedinto indexes, but this will give problem to the columnar storage, so we addeda hook to insert index data even if the update is HOT.And also we added another hook for initializing the parameters during theexecution.Most of the other added hooks can be replaced with existing hooks and addingsome extra code.2. You are accessing VS pages through Postgres buffer manager. It certainly have a lot of advantages. First of all it significantly simplifies implementation of VS and allows to reuse Postgres cache and lock managers.
But is all leads to some limitation:
- For VS it is preferable to have larger pages (in Vertica size of page can be several megabytes).
- VS is optimized for sequential access, so caching pages in buffer manager is no needed and can only cause leaching of other useful pages from cache.
- It makes it not possible to implement in-memory version of VS.
- Access to buffer manager adds extra synchronization overhead which becomes noticeable at MPP systems.
So I wonder if you have considered approach with VS specific implementation of storage layer?Currently, we are just using the existing the PostgreSQL buffer managerand didn't evaluate any columnar storage specific storage implementation.we are having some plan of evaluating dynamic shared memory.3. To take all advantages of vertical model, we should provide vector execution.
Without it columnar store can only reduce amount of fetched data by selective fetch of accessed columns and better compression of them.
But this is what existed cstore_fdw extension for Postgres also does.
We are going to use executor hooks or custom nodes to implement vector operations for some nodes (filter, grand aggregate, aggregation with group by,...).
Something similar with https://github.com/citusdata/postgres_vectorization_test
What is your vision of optimizing executor to work with VS?Yes, we implemented similar like above by copy/paste the most of the aggregate and etc codeinto the extension for providing the vector execution support.Without this vector execution and parallelism support, there will not be much performancebenefit.4. How do you consider adding parallelism support to VS? Should it be handled inside VS implementation? Or should we use standard Postgres parallel execution (parallel index-only scan)?Currently we implemented our own parallelism in columnar storage with some base infrastructureof OSS, but we are planning to change/integrate according to the OSS implementation.Regards,Hari BabuFujitsu Australia
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: