Re: Zedstore - compressed in-core columnar storage - Mailing list pgsql-hackers
From | Taylor Vesely |
---|---|
Subject | Re: Zedstore - compressed in-core columnar storage |
Date | |
Msg-id | CAFaX_4JSDxyiPavZ-2bdjTjY18Szm74fHMKntPGHQuSBuhuteg@mail.gmail.com Whole thread Raw |
In response to | Re: Zedstore - compressed in-core columnar storage (Taylor Vesely <tvesely@pivotal.io>) |
Responses |
Re: Zedstore - compressed in-core columnar storage
Re: Zedstore - compressed in-core columnar storage |
List | pgsql-hackers |
Alex Wang and I have been doing some performance analysis of the most
recent version of the zedstore branch, and have some interesting
statistics to share.
We specifically focused on TPC-DS query 2, because it plays to what
should be the strength of zedstore- namely it does a full table scan
of only a subset of columns. I've attached the explain verbose output
for reference.
We scan two columns of 'catalog_sales', and two columns of 'web_sales'.
-> Parallel Append
-> Parallel Seq Scan on tpcds.catalog_sales
Output: catalog_sales.cs_ext_sales_price, catalog_sales.cs_sold_date_sk
-> Parallel Seq Scan on tpcds.web_sales
Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk
For heap, it needs to do a full table scan of both tables, and we need
to read the entire table into memory. For our dataset, that totals
around 119GB of data.
***HEAP***
tpcds=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
39 GB
(1 row)
tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
80 GB
(1 row)
***/HEAP***
With Zedstore the total relation size is smaller because of
compression. When scanning the table, we only scan the blocks with
data we are interested in, and leave the rest alone. So the total
size we need to scan for these tables totals around 4GB
***ZEDSTORE***
zedstore=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
20 GB
(1 row)
zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
40 GB
(1 row)
zedstore=# with zedstore_tables as (select d.oid, f.*
zedstore(# from (select c.oid
zedstore(# from pg_am am
zedstore(# join pg_class c on (c.relam = am.oid)
zedstore(# where am.amname = 'zedstore') d,
zedstore(# pg_zs_btree_pages(d.oid) f)
zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno) as pages
zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from zedstore_tables zs
zedstore-# left join pg_attribute att on zs.attno = att.attnum
zedstore-# and zs.oid = att.attrelid
zedstore-# where zs.oid in ('catalog_sales'::regclass, 'web_sales'::regclass)
zedstore-# and (att.attname in ('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk')
zedstore(# or zs.attno = 0)
zedstore-# group by zs.attno, att.attname, zs.oid
zedstore-# order by zs.oid , zs.attno;
attno | attname | oid | pages | pg_size_pretty
-------+--------------------+---------------+--------+----------------
0 | | catalog_sales | 39549 | 309 MB
1 | cs_sold_date_sk | catalog_sales | 2441 | 19 MB
24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB
0 | | web_sales | 20013 | 156 MB
1 | ws_sold_date_sk | web_sales | 17578 | 137 MB
24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB
***/ZEDSTORE ***
On our test machine, our tables were stored on a single spinning disk,
so our read speed was pretty abysmal with this query. This query is
I/O bound for us, so it was the single largest factor. With heap, the
tables are scanned sequentially, and therefore can scan around 150MB of
table data per second:
***HEAP***
avg-cpu: %user %nice %system %iowait %steal %idle
8.54 0.00 1.85 11.62 0.00 77.98
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1685.33 0.00 157069.33 0.00 18.67 0.00 1.10 0.00 1.56 0.00 2.62 93.20 0.00 0.59 100.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1655.33 0.00 154910.67 0.00 21.33 0.00 1.27 0.00 1.62 0.00 2.68 93.58 0.00 0.60 100.13
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1746.33 0.00 155121.33 0.00 28.00 0.00 1.58 0.00 1.48 0.00 2.61 88.83 0.00 0.57 100.00
***/HEAP***
Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.
***ZEDSTORE***
avg-cpu: %user %nice %system %iowait %steal %idle
6.24 0.00 1.22 6.34 0.00 86.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 129.33 0.00 1034.67 0.00 0.00 0.00 0.00 0.00 15.89 0.00 2.05 8.00 0.00 7.67 99.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 120.67 0.00 965.33 0.00 0.00 0.00 0.00 0.00 16.51 0.00 1.99 8.00 0.00 8.21 99.07
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 121.00 0.00 968.00 0.00 0.00 0.00 0.00 0.00 16.76 0.00 2.02 8.00 0.00 8.19 99.07
***/ZEDSTORE***
The total query time:
***HEAP***
Execution Time: 758807.571 ms
***/HEAP***
***ZEDSTORE***
Execution Time: 2111576.259 ms
***/ZEDSTORE***
Every attribute in zedstore is stored in a btree with the TID as a
key. Unlike heap, the TID is a logical address, and not a physical
one. The pages of one attribute are interspersed with the pages of all
other attributes. When you do a sequential scan on zedstore the pages
are, therefore, not stored in sequential order, so the access pattern
can resemble random I/O.
On our system, query time for zedstore was around 3x slower than heap
for this query. If your storage does not handle semi-random reads very
well, then zedstore can be very slow. This setup was a worst case
scenario because random read was 150x slower than with sequential
read. On hardware with better random I/O zedstore would really shine.
On a side note, a second run of this query with zedstore was finished
in around 57 seconds, because the ~4GB of column data was already in
the relcache. The data size is smaller because we only store the
relevant columns in memory, also the datums are compressed and
encoded. Conversely, subsequently running the same query with heap
still takes around 750 seconds because our system cannot store 119GB
of relation data in the relcache/system caches.
Our main takeaway with this is that anything we can do to group
together data that is accessed together can help zedstore to have
larger, more frequent sequential reads.
recent version of the zedstore branch, and have some interesting
statistics to share.
We specifically focused on TPC-DS query 2, because it plays to what
should be the strength of zedstore- namely it does a full table scan
of only a subset of columns. I've attached the explain verbose output
for reference.
We scan two columns of 'catalog_sales', and two columns of 'web_sales'.
-> Parallel Append
-> Parallel Seq Scan on tpcds.catalog_sales
Output: catalog_sales.cs_ext_sales_price, catalog_sales.cs_sold_date_sk
-> Parallel Seq Scan on tpcds.web_sales
Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk
For heap, it needs to do a full table scan of both tables, and we need
to read the entire table into memory. For our dataset, that totals
around 119GB of data.
***HEAP***
tpcds=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
39 GB
(1 row)
tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
80 GB
(1 row)
***/HEAP***
With Zedstore the total relation size is smaller because of
compression. When scanning the table, we only scan the blocks with
data we are interested in, and leave the rest alone. So the total
size we need to scan for these tables totals around 4GB
***ZEDSTORE***
zedstore=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
20 GB
(1 row)
zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
40 GB
(1 row)
zedstore=# with zedstore_tables as (select d.oid, f.*
zedstore(# from (select c.oid
zedstore(# from pg_am am
zedstore(# join pg_class c on (c.relam = am.oid)
zedstore(# where am.amname = 'zedstore') d,
zedstore(# pg_zs_btree_pages(d.oid) f)
zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno) as pages
zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from zedstore_tables zs
zedstore-# left join pg_attribute att on zs.attno = att.attnum
zedstore-# and zs.oid = att.attrelid
zedstore-# where zs.oid in ('catalog_sales'::regclass, 'web_sales'::regclass)
zedstore-# and (att.attname in ('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk')
zedstore(# or zs.attno = 0)
zedstore-# group by zs.attno, att.attname, zs.oid
zedstore-# order by zs.oid , zs.attno;
attno | attname | oid | pages | pg_size_pretty
-------+--------------------+---------------+--------+----------------
0 | | catalog_sales | 39549 | 309 MB
1 | cs_sold_date_sk | catalog_sales | 2441 | 19 MB
24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB
0 | | web_sales | 20013 | 156 MB
1 | ws_sold_date_sk | web_sales | 17578 | 137 MB
24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB
***/ZEDSTORE ***
On our test machine, our tables were stored on a single spinning disk,
so our read speed was pretty abysmal with this query. This query is
I/O bound for us, so it was the single largest factor. With heap, the
tables are scanned sequentially, and therefore can scan around 150MB of
table data per second:
***HEAP***
avg-cpu: %user %nice %system %iowait %steal %idle
8.54 0.00 1.85 11.62 0.00 77.98
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1685.33 0.00 157069.33 0.00 18.67 0.00 1.10 0.00 1.56 0.00 2.62 93.20 0.00 0.59 100.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1655.33 0.00 154910.67 0.00 21.33 0.00 1.27 0.00 1.62 0.00 2.68 93.58 0.00 0.60 100.13
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1746.33 0.00 155121.33 0.00 28.00 0.00 1.58 0.00 1.48 0.00 2.61 88.83 0.00 0.57 100.00
***/HEAP***
Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.
***ZEDSTORE***
avg-cpu: %user %nice %system %iowait %steal %idle
6.24 0.00 1.22 6.34 0.00 86.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 129.33 0.00 1034.67 0.00 0.00 0.00 0.00 0.00 15.89 0.00 2.05 8.00 0.00 7.67 99.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 120.67 0.00 965.33 0.00 0.00 0.00 0.00 0.00 16.51 0.00 1.99 8.00 0.00 8.21 99.07
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 121.00 0.00 968.00 0.00 0.00 0.00 0.00 0.00 16.76 0.00 2.02 8.00 0.00 8.19 99.07
***/ZEDSTORE***
The total query time:
***HEAP***
Execution Time: 758807.571 ms
***/HEAP***
***ZEDSTORE***
Execution Time: 2111576.259 ms
***/ZEDSTORE***
Every attribute in zedstore is stored in a btree with the TID as a
key. Unlike heap, the TID is a logical address, and not a physical
one. The pages of one attribute are interspersed with the pages of all
other attributes. When you do a sequential scan on zedstore the pages
are, therefore, not stored in sequential order, so the access pattern
can resemble random I/O.
On our system, query time for zedstore was around 3x slower than heap
for this query. If your storage does not handle semi-random reads very
well, then zedstore can be very slow. This setup was a worst case
scenario because random read was 150x slower than with sequential
read. On hardware with better random I/O zedstore would really shine.
On a side note, a second run of this query with zedstore was finished
in around 57 seconds, because the ~4GB of column data was already in
the relcache. The data size is smaller because we only store the
relevant columns in memory, also the datums are compressed and
encoded. Conversely, subsequently running the same query with heap
still takes around 750 seconds because our system cannot store 119GB
of relation data in the relcache/system caches.
Our main takeaway with this is that anything we can do to group
together data that is accessed together can help zedstore to have
larger, more frequent sequential reads.
On Mon, Oct 28, 2019 at 3:22 PM Taylor Vesely <tvesely@pivotal.io> wrote:
> When a zedstore table is queried using *invalid* ctid, the server
> crashes due to assertion failure. See below,
>
> postgres=# select * from t2 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:
ERROR: arrived at incorrect block 2 while descending zedstore btree
> I believe above should have either returned 1 rows or failed with some
> user friendly error.
Agreed. I think it should match the behavior of heap as closely as
possible.
Attachment
pgsql-hackers by date: