Thread: Zedstore - compressed in-core columnar storage
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.
The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.
Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.
High-level Design - B-trees for the win!
========================================
To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.
The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.
TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:
- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload
+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------
Row store
---------
The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.
The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.
Column store
------------
A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.
This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.
Main reasons for storing data this way
--------------------------------------
* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.
* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.
MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.
Implementation
==============
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.
Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.
Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.
Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.
* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.
* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.
Index Support:
Building index also leverages columnar storage and only scans columns
required to build the index. Indexes work pretty similar to heap
tables. Data is inserted into tables and TID for the tuple same gets
stored in index. On index scans, required column Btrees are scanned
for given TID and datums passed back using virtual tuple.
Page Format:
ZedStore table contains different kinds of pages, all in the same
file. Kinds of pages are meta-page, per-attribute btree internal and
leaf pages, UNDO log page, and toast pages. Each page type has its own
distinct data storage format.
Block 0 is always a metapage. It contains the block numbers of the
other data structures stored within the file, like the per-attribute
B-trees, and the UNDO log.
Enhancements to design:
=======================
Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.
When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.
When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.
In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.
If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.
Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.
Notes about current patch
=========================
Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.
Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:
CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.
TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.
Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows
postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; -- heap
avg
---------------------
500000.500000000000
(1 row)
Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; -- zedstore
avg
---------------------
500000.500000000000
(1 row)
Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.
The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.
Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.
Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.
References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com
3] https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org
4] https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6] https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com
7] https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi
The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.
Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.
High-level Design - B-trees for the win!
========================================
To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.
The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.
TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:
- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload
+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------
Row store
---------
The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.
The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.
Column store
------------
A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.
This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.
Main reasons for storing data this way
--------------------------------------
* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.
* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.
MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.
Implementation
==============
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.
Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.
Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.
Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.
* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.
* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.
Index Support:
Building index also leverages columnar storage and only scans columns
required to build the index. Indexes work pretty similar to heap
tables. Data is inserted into tables and TID for the tuple same gets
stored in index. On index scans, required column Btrees are scanned
for given TID and datums passed back using virtual tuple.
Page Format:
ZedStore table contains different kinds of pages, all in the same
file. Kinds of pages are meta-page, per-attribute btree internal and
leaf pages, UNDO log page, and toast pages. Each page type has its own
distinct data storage format.
Block 0 is always a metapage. It contains the block numbers of the
other data structures stored within the file, like the per-attribute
B-trees, and the UNDO log.
Enhancements to design:
=======================
Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.
When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.
When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.
In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.
If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.
Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.
Notes about current patch
=========================
Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.
Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:
CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.
TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.
Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows
postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; -- heap
avg
---------------------
500000.500000000000
(1 row)
Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; -- zedstore
avg
---------------------
500000.500000000000
(1 row)
Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.
The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.
Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.
Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.
References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com
3] https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org
4] https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6] https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com
7] https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi
Attachment
Hi, On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote: > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). That's very cool. > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. Is storage going through the bufmgr.c or separately? > In uncompressed form, the page can be arbitrarily large. But after > compression, it must fit into a physical 8k block. If on insert or > update of a tuple, the page cannot be compressed below 8k anymore, the > page is split. Note that because TIDs are logical rather than physical > identifiers, we can freely move tuples from one physical page to > another during page split. A tuple's TID never changes. When does compression happen? After each modifcation of the expanded "page"? Are repeated expansions prevented somehow, e.g. if I insert/delete rows into the same page one-by-one? > A metapage at block 0, has links to the roots of the B-trees. Leaf > pages look the same, but instead of storing the whole tuple, stores > just a single attribute. To reconstruct a row with given TID, scan > descends down the B-trees for all the columns using that TID, and > fetches all attributes. Likewise, a sequential scan walks all the > B-trees in lockstep. Does the size of the metapage limit the number of column [groups]? Or is there some overflow / tree of trees / whatnot happening? > Insert: > Inserting a new row, splits the row into datums. Then for first column > decide which block to insert the same to, and pick a TID for it, and > write undo record for the same. Rest of the columns are inserted using > that same TID and point to same undo position. Is there some buffering? Without that it seems like retail inserts are going to be pretty slow? > Property is added to Table AM to convey if column projection is > leveraged by AM for scans. While scanning tables with AM leveraging > this property, executor parses the plan. Leverages the target list and > quals to find the required columns for query. This list is passed down > to AM on beginscan. Zedstore uses this column projection list to only > pull data from selected columns. Virtual tuple table slot is used to > pass back the datums for subset of columns. > > Current table am API requires enhancement here to pass down column > projection to AM. The patch showcases two different ways for the same. > > * For sequential scans added new beginscan_with_column_projection() > API. Executor checks AM property and if it leverages column > projection uses this new API else normal beginscan() API. > > * For index scans instead of modifying the begin scan API, added new > API to specifically pass column projection list after calling begin > scan to populate the scan descriptor but before fetching the tuples. FWIW, I don't quite think this is the right approach. I've only a vague sketch of this in my head, but I think we should want a general API to pass that down to *any* scan. Even for heap, not deforming leading columns that a uninteresting, but precede relevant columns, would be quite a noticable performance win. I don't think the projection list is the right approach for that. > Extremely basic UNDO logging has be implemented just for MVCC > perspective. MVCC is missing tuple lock right now. Plus, doesn't > actually perform any undo yet. No WAL logging exist currently hence > its not crash safe either. Have you looked at the undo APIs developed for zheap, as discussed on the list? Seems important that they're suitable for this too. > Test mentioned in thread "Column lookup in a row performance" [6], > good example query for zedstore locally on laptop using lz4 shows > > postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; -- > heap > avg > --------------------- > 500000.500000000000 > (1 row) > > Time: 4679.026 ms (00:04.679) > > postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; -- > zedstore > avg > --------------------- > 500000.500000000000 > (1 row) > > Time: 379.710 ms Well, I'm not sure I'm actually impressed by that. What does the performance look like if you select i0 instead? > Important note: > --------------- > Planner has not been modified yet to leverage the columnar > storage. Hence, plans using "physical tlist" optimization or such good > for row store miss out to leverage the columnar nature > currently. Hence, can see the need for subquery with OFFSET 0 above to > disable the optimization and scan only required column. I'm more and more thinking that we should just nix the physical tlist stuff and start afresh. Congrats again, this is cool stuff. - Andres
On Mon, Apr 8, 2019 at 6:04 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:
> Heikki and I have been hacking recently for few weeks to implement
> in-core columnar storage for PostgreSQL. Here's the design and initial
> implementation of Zedstore, compressed in-core columnar storage (table
> access method).
That's very cool.
> Motivations / Objectives
>
> * Performance improvement for queries selecting subset of columns
> (reduced IO).
> * Reduced on-disk footprint compared to heap table. Shorter tuple
> headers and also leveraging compression of similar type data
> * Be first-class citizen in the Postgres architecture (tables data can
> just independently live in columnar storage)
> * Fully MVCC compliant
> * All Indexes supported
> * Hybrid row-column store, where some columns are stored together, and
> others separately. Provide flexibility of granularity on how to
> divide the columns. Columns accessed together can be stored
> together.
> * Provide better control over bloat (similar to zheap)
> * Eliminate need for separate toast tables
> * Faster add / drop column or changing data type of column by avoiding
> full rewrite of the table.
Is storage going through the bufmgr.c or separately?
Yes, below access method its pretty much same as heap. All reads and writes flow via buffer cache. The implementation sits nicely in between, just modifying the access method code changing how just how data is stored in pages, above AM and below AM is basically all behaves similar to heap code.
> In uncompressed form, the page can be arbitrarily large. But after
> compression, it must fit into a physical 8k block. If on insert or
> update of a tuple, the page cannot be compressed below 8k anymore, the
> page is split. Note that because TIDs are logical rather than physical
> identifiers, we can freely move tuples from one physical page to
> another during page split. A tuple's TID never changes.
When does compression happen? After each modifcation of the expanded
"page"? Are repeated expansions prevented somehow, e.g. if I
insert/delete rows into the same page one-by-one?
Compression is performed with new data is only if page becomes full, till then uncompressed data is added to the page. If even after compression cannot add data to the page then page split is performed. Already compressed data is not compressed again on next insert. New compressed block is created for newly added uncompressed items.
> A metapage at block 0, has links to the roots of the B-trees. Leaf
> pages look the same, but instead of storing the whole tuple, stores
> just a single attribute. To reconstruct a row with given TID, scan
> descends down the B-trees for all the columns using that TID, and
> fetches all attributes. Likewise, a sequential scan walks all the
> B-trees in lockstep.
Does the size of the metapage limit the number of column [groups]? Or is
there some overflow / tree of trees / whatnot happening?
In design it doesn't limit the number of columns, as can have chain of meta-pages to store the required meta-data, page 0 still being start of the chain.
> Insert:
> Inserting a new row, splits the row into datums. Then for first column
> decide which block to insert the same to, and pick a TID for it, and
> write undo record for the same. Rest of the columns are inserted using
> that same TID and point to same undo position.
Is there some buffering? Without that it seems like retail inserts are
going to be pretty slow?
Yes, regular buffer cache.
> Property is added to Table AM to convey if column projection is
> leveraged by AM for scans. While scanning tables with AM leveraging
> this property, executor parses the plan. Leverages the target list and
> quals to find the required columns for query. This list is passed down
> to AM on beginscan. Zedstore uses this column projection list to only
> pull data from selected columns. Virtual tuple table slot is used to
> pass back the datums for subset of columns.
>
> Current table am API requires enhancement here to pass down column
> projection to AM. The patch showcases two different ways for the same.
>
> * For sequential scans added new beginscan_with_column_projection()
> API. Executor checks AM property and if it leverages column
> projection uses this new API else normal beginscan() API.
>
> * For index scans instead of modifying the begin scan API, added new
> API to specifically pass column projection list after calling begin
> scan to populate the scan descriptor but before fetching the tuples.
FWIW, I don't quite think this is the right approach. I've only a vague
sketch of this in my head, but I think we should want a general API to
pass that down to *any* scan. Even for heap, not deforming leading
columns that a uninteresting, but precede relevant columns, would be
quite a noticable performance win. I don't think the projection list is
the right approach for that.
Sure, would love to hear more on it and can enhance the same as makes more usable for AMs.
> Extremely basic UNDO logging has be implemented just for MVCC
> perspective. MVCC is missing tuple lock right now. Plus, doesn't
> actually perform any undo yet. No WAL logging exist currently hence
> its not crash safe either.
Have you looked at the undo APIs developed for zheap, as discussed on
the list? Seems important that they're suitable for this too.
Not in details yet, but yes plan is to leverage the same common framework and undo log API as zheap. Will look into the details. With the current zedstore implementation the requirements from the undo are prertty clear.
> Test mentioned in thread "Column lookup in a row performance" [6],
> good example query for zedstore locally on laptop using lz4 shows
>
> postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
> heap
> avg
> ---------------------
> 500000.500000000000
> (1 row)
>
> Time: 4679.026 ms (00:04.679)
>
> postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
> zedstore
> avg
> ---------------------
> 500000.500000000000
> (1 row)
>
> Time: 379.710 ms
Well, I'm not sure I'm actually impressed by that. What does the
performance look like if you select i0 instead?
Just for quick test used 100 instead of 200 columns (with 200 the results would be more diverged), this is what it reports
postgres=# SELECT AVG(i0) FROM (select i0 from layout offset 0) x; -- heap
avg
------------------------
1.00000000000000000000
(1 row)
Time: 183.865 ms
postgres=# SELECT AVG(i0) FROM (select i0 from zlayout offset 0) x; -- zedstore
avg
------------------------
1.00000000000000000000
(1 row)
Time: 47.624 ms
avg
------------------------
1.00000000000000000000
(1 row)
Time: 183.865 ms
postgres=# SELECT AVG(i0) FROM (select i0 from zlayout offset 0) x; -- zedstore
avg
------------------------
1.00000000000000000000
(1 row)
Time: 47.624 ms
Hi, On 09.04.2019 3:27, Ashwin Agrawal wrote: > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. Thank you for publishing this patch. IMHO Postgres is really missing normal support of columnar store and table access method API is the best way of integrating it. I wanted to compare memory footprint and performance of zedstore with standard Postgres heap and my VOPS extension. As test data I used TPC-H benchmark (actually only one lineitem table generated with tpch-dbgen utility with scale factor 10 (~8Gb database). I attached script which I have use to populate data (you have to to download, build and run tpch-dbgen yourself, also you can comment code related with VOPS). Unfortunately I failed to load data in zedstore: postgres=# insert into zedstore_lineitem_projection (select l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char" from lineitem); psql: ERROR: compression failed. what now? Time: 237804.775 ms (03:57.805) Then I try to check if there is something in zedstore_lineitem_projection table: postgres=# select count(*) from zedstore_lineitem_projection; psql: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. psql: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Time: 145710.828 ms (02:25.711) Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM killer (undo log?) Subsequent attempt to run the same command is failed with the following error: postgres=# select count(*) from zedstore_lineitem_projection; psql: ERROR: unexpected level encountered when descending tree So the only thing I can do at this moment is report size of tables on the disk: postgres=# select pg_relation_size('lineitem'); pg_relation_size ------------------ 10455441408 (1 row) postgres=# select pg_relation_size('lineitem_projection'); pg_relation_size ------------------ 3129974784 (1 row) postgres=# select pg_relation_size('vops_lineitem_projection'); pg_relation_size ------------------ 1535647744 (1 row) postgres=# select pg_relation_size('zedstore_lineitem_projection'); pg_relation_size ------------------ 2303688704 (1 row) But I do not know how much data was actually loaded in zedstore table... Actually the main question is why this table is not empty if INSERT statement was failed? Please let me know if I can somehow help you to reproduce and investigate the problem. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 09.04.2019 17:09, Konstantin Knizhnik wrote: > Hi, > > On 09.04.2019 3:27, Ashwin Agrawal wrote: >> Heikki and I have been hacking recently for few weeks to implement >> in-core columnar storage for PostgreSQL. Here's the design and initial >> implementation of Zedstore, compressed in-core columnar storage (table >> access method). Attaching the patch and link to github branch [1] to >> follow along. > > Thank you for publishing this patch. IMHO Postgres is really missing > normal support of columnar store and table access method > API is the best way of integrating it. > > I wanted to compare memory footprint and performance of zedstore with > standard Postgres heap and my VOPS extension. > As test data I used TPC-H benchmark (actually only one lineitem table > generated with tpch-dbgen utility with scale factor 10 (~8Gb database). > I attached script which I have use to populate data (you have to to > download, build and run tpch-dbgen yourself, also you can comment code > related with VOPS). > Unfortunately I failed to load data in zedstore: > > postgres=# insert into zedstore_lineitem_projection (select > l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char" > from lineitem); > psql: ERROR: compression failed. what now? > Time: 237804.775 ms (03:57.805) > > > Then I try to check if there is something in > zedstore_lineitem_projection table: > > postgres=# select count(*) from zedstore_lineitem_projection; > psql: WARNING: terminating connection because of crash of another > server process > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another server process > exited abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > psql: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > Time: 145710.828 ms (02:25.711) > > > Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM > killer (undo log?) > Subsequent attempt to run the same command is failed with the > following error: > > postgres=# select count(*) from zedstore_lineitem_projection; > psql: ERROR: unexpected level encountered when descending tree > > > So the only thing I can do at this moment is report size of tables on > the disk: > > postgres=# select pg_relation_size('lineitem'); > pg_relation_size > ------------------ > 10455441408 > (1 row) > > > postgres=# select pg_relation_size('lineitem_projection'); > pg_relation_size > ------------------ > 3129974784 > (1 row) > > postgres=# select pg_relation_size('vops_lineitem_projection'); > pg_relation_size > ------------------ > 1535647744 > (1 row) > > postgres=# select pg_relation_size('zedstore_lineitem_projection'); > pg_relation_size > ------------------ > 2303688704 > (1 row) > > > But I do not know how much data was actually loaded in zedstore table... > Actually the main question is why this table is not empty if INSERT > statement was failed? > > Please let me know if I can somehow help you to reproduce and > investigate the problem. > Looks like the original problem was caused by internal postgres compressor: I have not configured Postgres to use lz4. When I configured Postgres --with-lz4, data was correctly inserted in zedstore table, but looks it is not compressed at all: postgres=# select pg_relation_size('zedstore_lineitem_projection'); pg_relation_size ------------------ 9363010640 No wonder that zedstore shows the worst results: lineitem 6240.261 ms lineitem_projection 5390.446 ms zedstore_lineitem_projection 23310.341 ms vops_lineitem_projection 439.731 ms Updated version of vstore_bench.sql is attached (sorry, there was some errors in previous version of this script). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 09/04/2019 18:00, Konstantin Knizhnik wrote: > On 09.04.2019 17:09, Konstantin Knizhnik wrote: >> standard Postgres heap and my VOPS extension. >> As test data I used TPC-H benchmark (actually only one lineitem table >> generated with tpch-dbgen utility with scale factor 10 (~8Gb database). >> I attached script which I have use to populate data (you have to to >> download, build and run tpch-dbgen yourself, also you can comment code >> related with VOPS). Cool, thanks! >> Unfortunately I failed to load data in zedstore: >> >> postgres=# insert into zedstore_lineitem_projection (select >> l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char" >> from lineitem); >> psql: ERROR: compression failed. what now? >> Time: 237804.775 ms (03:57.805) Yeah, it's still early days, it will crash and burn in a lot of cases. We wanted to publish this early, to gather ideas and comments on the high level design, and to validate that the table AM API that's in v12 is usable. > Looks like the original problem was caused by internal postgres > compressor: I have not configured Postgres to use lz4. > When I configured Postgres --with-lz4, data was correctly inserted in > zedstore table, but looks it is not compressed at all: > > postgres=# select pg_relation_size('zedstore_lineitem_projection'); > pg_relation_size > ------------------ > 9363010640 The single-insert codepath isn't very optimized yet. If you populate the table with large "INSERT ... SELECT ...", you end up with a huge undo log. Try loading it with COPY. You can also see how many pages of each type there is with: select count(*), pg_zs_page_type('zedstore_lineitem_projection', g) from generate_series(0, pg_table_size('zedstore_lineitem_projection') / 8192 - 1) g group by 2; - Heikki
On 09.04.2019 18:08, Heikki Linnakangas wrote: > On 09/04/2019 18:00, Konstantin Knizhnik wrote: >> On 09.04.2019 17:09, Konstantin Knizhnik wrote: >>> standard Postgres heap and my VOPS extension. >>> As test data I used TPC-H benchmark (actually only one lineitem table >>> generated with tpch-dbgen utility with scale factor 10 (~8Gb database). >>> I attached script which I have use to populate data (you have to to >>> download, build and run tpch-dbgen yourself, also you can comment code >>> related with VOPS). > > Cool, thanks! > >>> Unfortunately I failed to load data in zedstore: >>> >>> postgres=# insert into zedstore_lineitem_projection (select >>> l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char" >>> >>> from lineitem); >>> psql: ERROR: compression failed. what now? >>> Time: 237804.775 ms (03:57.805) > > Yeah, it's still early days, it will crash and burn in a lot of cases. > We wanted to publish this early, to gather ideas and comments on the > high level design, and to validate that the table AM API that's in v12 > is usable. > >> Looks like the original problem was caused by internal postgres >> compressor: I have not configured Postgres to use lz4. >> When I configured Postgres --with-lz4, data was correctly inserted in >> zedstore table, but looks it is not compressed at all: >> >> postgres=# select pg_relation_size('zedstore_lineitem_projection'); >> pg_relation_size >> ------------------ >> 9363010640 > > The single-insert codepath isn't very optimized yet. If you populate > the table with large "INSERT ... SELECT ...", you end up with a huge > undo log. Try loading it with COPY. > > You can also see how many pages of each type there is with: > > select count(*), pg_zs_page_type('zedstore_lineitem_projection', g) > from generate_series(0, > pg_table_size('zedstore_lineitem_projection') / 8192 - 1) g group by 2; > > - Heikki postgres=# copy zedstore_lineitem from '/mnt/data/lineitem.tbl' delimiter '|' csv; COPY 59986052 Time: 232802.257 ms (03:52.802) postgres=# select pg_relation_size('zedstore_lineitem'); pg_relation_size ------------------ 10346504192 (1 row) postgres=# select count(*), pg_zs_page_type('zedstore_lineitem', g) from generate_series(0, pg_table_size('zedstore_lineitem') / 8192 - 1) g group by 2; count | pg_zs_page_type ---------+----------------- 1 | META 1262308 | BTREE 692 | UNDO (3 rows) And now performance is much worser: Time: 99819.476 ms (01:39.819) It is strange, because the main advantage of columnar store is that it has to fetch only accessed rows. What I see is that in non-parallel mode (max_parallel_workers_per_gather = 0) backend consumes about 11GB of memory. It fits in my desktop RAM (16GB) and speed is ~58 seconds. But one I start 4 parallel workers, them cause huge swapping: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28195 knizhnik 20 0 11.823g 6.553g 5.072g D 7.6 42.2 0:17.19 postgres 28074 knizhnik 20 0 11.848g 6.726g 5.223g D 7.3 43.3 4:14.96 postgres 28192 knizhnik 20 0 11.854g 6.586g 5.075g D 7.3 42.4 0:17.18 postgres 28193 knizhnik 20 0 11.870g 6.594g 5.064g D 7.3 42.4 0:17.19 postgres 28194 knizhnik 20 0 11.854g 6.589g 5.078g D 7.3 42.4 0:17.09 postgres which is also strange because data should be present in shared buffers. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2019-Apr-09, Konstantin Knizhnik wrote: > On 09.04.2019 3:27, Ashwin Agrawal wrote: > > Heikki and I have been hacking recently for few weeks to implement > > in-core columnar storage for PostgreSQL. Here's the design and initial > > implementation of Zedstore, compressed in-core columnar storage (table > > access method). Attaching the patch and link to github branch [1] to > > follow along. > > Thank you for publishing this patch. IMHO Postgres is really missing normal > support of columnar store Yep. > and table access method API is the best way of integrating it. This is not surprising, considering that columnar store is precisely the reason for starting the work on table AMs. We should certainly look into integrating some sort of columnar storage in mainline. Not sure which of zedstore or VOPS is the best candidate, or maybe we'll have some other proposal. My feeling is that having more than one is not useful; if there are optimizations to one that can be borrowed from the other, let's do that instead of duplicating effort. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09.04.2019 18:51, Alvaro Herrera wrote: > On 2019-Apr-09, Konstantin Knizhnik wrote: > >> On 09.04.2019 3:27, Ashwin Agrawal wrote: >>> Heikki and I have been hacking recently for few weeks to implement >>> in-core columnar storage for PostgreSQL. Here's the design and initial >>> implementation of Zedstore, compressed in-core columnar storage (table >>> access method). Attaching the patch and link to github branch [1] to >>> follow along. >> Thank you for publishing this patch. IMHO Postgres is really missing normal >> support of columnar store > Yep. > >> and table access method API is the best way of integrating it. > This is not surprising, considering that columnar store is precisely the > reason for starting the work on table AMs. > > We should certainly look into integrating some sort of columnar storage > in mainline. Not sure which of zedstore or VOPS is the best candidate, > or maybe we'll have some other proposal. My feeling is that having more > than one is not useful; if there are optimizations to one that can be > borrowed from the other, let's do that instead of duplicating effort. > There are two different aspects: 1. Store format. 2. Vector execution. 1. VOPS is using mixed format, something similar with Apache parquet. Tuples are stored vertically, but only inside one page. It tries to minimize trade-offs between true horizontal and true vertical storage: first is most optimal for selecting all rows, while second - for selecting small subset of rows. To make this approach more efficient, it is better to use large page size - default Postgres 8k pages is not enough. From my point of view such format is better than pure vertical storage which will be very inefficient if query access larger number of columns. This problem can be somehow addressed by creating projections: grouping several columns together. But it requires more space for storing multiple projections. 2. Doesn't matter which format we choose, to take all advantages of vertical representation we need to use vector operations. And Postgres executor doesn't support them now. This is why VOPS is using some hacks, which is definitely not good and not working in all cases. zedstore is not using such hacks and ... this is why it never can reach VOPS performance. The right solution is to add vector operations support to Postgres planner and executors. But is much harder than develop columnar store itself. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 09/04/2019 18:00, Konstantin Knizhnik wrote: > Looks like the original problem was caused by internal postgres > compressor: I have not configured Postgres to use lz4. > When I configured Postgres --with-lz4, data was correctly inserted in > zedstore table, but looks it is not compressed at all: > > postgres=# select pg_relation_size('zedstore_lineitem_projection'); > pg_relation_size > ------------------ > 9363010640 > > No wonder that zedstore shows the worst results: > > lineitem 6240.261 ms > lineitem_projection 5390.446 ms > zedstore_lineitem_projection 23310.341 ms > vops_lineitem_projection 439.731 ms > > Updated version of vstore_bench.sql is attached (sorry, there was some > errors in previous version of this script). I tried this quickly, too. With default work_mem and no parallelism, and 1 gb table size, it seems that the query chooses a different plan with heap and zedstore, with a sort+group for zedstore and hash agg for heap. There's no ANALYZE support in zedstore yet, and we haven't given much thought to parallelism either. With work_mem='1GB' and no parallelism, both queries use a hash agg, and the numbers are much closer than what you saw, about 6 s for heap, and 9 s for zedstore. - Heikki
On 09.04.2019 19:19, Heikki Linnakangas wrote: > On 09/04/2019 18:00, Konstantin Knizhnik wrote: >> Looks like the original problem was caused by internal postgres >> compressor: I have not configured Postgres to use lz4. >> When I configured Postgres --with-lz4, data was correctly inserted in >> zedstore table, but looks it is not compressed at all: >> >> postgres=# select pg_relation_size('zedstore_lineitem_projection'); >> pg_relation_size >> ------------------ >> 9363010640 >> >> No wonder that zedstore shows the worst results: >> >> lineitem 6240.261 ms >> lineitem_projection 5390.446 ms >> zedstore_lineitem_projection 23310.341 ms >> vops_lineitem_projection 439.731 ms >> >> Updated version of vstore_bench.sql is attached (sorry, there was some >> errors in previous version of this script). > > I tried this quickly, too. With default work_mem and no parallelism, > and 1 gb table size, it seems that the query chooses a different plan > with heap and zedstore, with a sort+group for zedstore and hash agg > for heap. There's no ANALYZE support in zedstore yet, and we haven't > given much thought to parallelism either. With work_mem='1GB' and no > parallelism, both queries use a hash agg, and the numbers are much > closer than what you saw, about 6 s for heap, and 9 s for zedstore. > > - Heikki Yes, you was right. The plan for zedstore uses GroupAggregate instead of HashAggregate. Increasing work_mem force optimizer to use HashAggregate in all cases. But it doesn't prevent memory overflow in my case. And it is very strange to me, because there are just 4 groups in the result, so it should not consume any memory. Yet another strange thing is that size of zedstore_table is 10Gb according to pg_relation_size. Q1 query access only some some subset of "lineitem" columns, not touching the largest ones (with text). I have configured 12Gb shared buffers. And all this 11Gb are used! Looks like all columns are fetched from the disk. And looks like except this 11Gb of shard data, backend (and each parallel worker) is also consuming several gigabytes of heap memory. As a result total size of used memory during parallel query execution with 4 workers exceeds 20GB and cause terrible swapping at my system. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > This is not surprising, considering that columnar store is precisely the > reason for starting the work on table AMs. > > We should certainly look into integrating some sort of columnar storage > in mainline. Not sure which of zedstore or VOPS is the best candidate, > or maybe we'll have some other proposal. My feeling is that having more > than one is not useful; if there are optimizations to one that can be > borrowed from the other, let's do that instead of duplicating effort. I think that conclusion may be premature. There seem to be a bunch of different ways of doing columnar storage, so I don't know how we can be sure that one size will fit all, or that the first thing we accept will be the best thing. Of course, we probably do not want to accept a ton of storage manager implementations is core. I think if people propose implementations that are poor quality, or missing important features, or don't have significantly different use cases from the ones we've already got, it's reasonable to reject those. But I wouldn't be prepared to say that if we have two significantly different column store that are both awesome code with a complete feature set and significantly disjoint use cases, we should reject the second one just because it is also a column store. I think that won't get out of control because few people will be able to produce really high-quality implementations. This stuff is hard, which I think is also why we only have 6.5 index AMs in core after many, many years. And our standards have gone up over the years - not all of those would pass muster if they were proposed today. BTW, can I express a small measure of disappointment that the name for the thing under discussion on this thread chose to be called "zedstore"? That seems to invite confusion with "zheap", especially in parts of the world where the last letter of the alphabet is pronounced "zed," where people are going to say zed-heap and zed-store. Brr. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 9, 2019 at 11:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> This is not surprising, considering that columnar store is precisely the
> reason for starting the work on table AMs.
>
> We should certainly look into integrating some sort of columnar storage
> in mainline. Not sure which of zedstore or VOPS is the best candidate,
> or maybe we'll have some other proposal. My feeling is that having more
> than one is not useful; if there are optimizations to one that can be
> borrowed from the other, let's do that instead of duplicating effort.
I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.
Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.
This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.
+1
BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.
Surprised its felt this thread would initiate the invitation to confusion. Based on past internal and meetup discussions for few quite sometime now, the confusion already exists for zheap pronunciation because of the reason mentioned, as last letter is not pronounced universally same. Hence we explicitly called it zedstore to learn from and make the pronunciation world wide universal for new thing atleast.
On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 09.04.2019 18:51, Alvaro Herrera wrote:
> On 2019-Apr-09, Konstantin Knizhnik wrote:
>
>> On 09.04.2019 3:27, Ashwin Agrawal wrote:
>>> Heikki and I have been hacking recently for few weeks to implement
>>> in-core columnar storage for PostgreSQL. Here's the design and initial
>>> implementation of Zedstore, compressed in-core columnar storage (table
>>> access method). Attaching the patch and link to github branch [1] to
>>> follow along.
>> Thank you for publishing this patch. IMHO Postgres is really missing normal
>> support of columnar store
> Yep.
>
>> and table access method API is the best way of integrating it.
> This is not surprising, considering that columnar store is precisely the
> reason for starting the work on table AMs.
>
> We should certainly look into integrating some sort of columnar storage
> in mainline. Not sure which of zedstore or VOPS is the best candidate,
> or maybe we'll have some other proposal. My feeling is that having more
> than one is not useful; if there are optimizations to one that can be
> borrowed from the other, let's do that instead of duplicating effort.
>
There are two different aspects:
1. Store format.
2. Vector execution.
1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.
From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.
Right, storing all the columns in single page doens't give any savings on IO.
2. Doesn't matter which format we choose, to take all advantages of
vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.
Vectorized execution is orthogonal to storage format. It can be even applied to row store and performance gained. Similarly column store without vectorized execution also gives performance gain better compression rations and such benefits. Column store clubbed with vecotorized execution makes it lot more performant agree. Zedstore currently is focused to have AM piece in place, which fits the postgres ecosystem and supports all the features heap does.
On 09/04/2019 23:24, Ashwin Agrawal wrote: > BTW, can I express a small measure of disappointment that the name for > the thing under discussion on this thread chose to be called > "zedstore"? That seems to invite confusion with "zheap", especially > in parts of the world where the last letter of the alphabet is > pronounced "zed," where people are going to say zed-heap and > zed-store. Brr. > > Surprised its felt this thread would initiate the invitation to > confusion. Based on past internal and meetup discussions for few quite > sometime now, the confusion already exists for zheap pronunciation > because of the reason mentioned, as last letter is not pronounced > universally same. Hence we explicitly called it zedstore to learn from > and make the pronunciation world wide universal for new thing atleast. Yeah, you can blame me for the name. It's a pun on zheap. I'm hoping we come up with a better name before this matures; I'm thinking it could be just "column store" or something like that in the end, but it's good to have a more unique name during development. - Heikki
C-Tree?
Peter Geoghegan
(Sent from my phone)
(Sent from my phone)
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > The objective is to gather feedback on design and approach to the > same. The implementation has core basic pieces working but not close > to complete. > > Big thank you to Andres, Haribabu and team for the table access method > API's. Leveraged the API's for implementing zedstore, and proves API > to be in very good shape. Had to enhance the same minimally but > in-general didn't had to touch executor much. > > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. > > High-level Design - B-trees for the win! > ======================================== > > To start simple, let's ignore column store aspect for a moment and > consider it as compressed row store. The column store is natural > extension of this concept, explained in next section. > > The basic on-disk data structure leveraged is a B-tree, indexed by > TID. BTree being a great data structure, fast and versatile. Note this > is not referring to existing Btree indexes, but instead net new > separate BTree for table data storage. > > TID - logical row identifier: > TID is just a 48-bit row identifier. The traditional division into > block and offset numbers is meaningless. In order to find a tuple with > a given TID, one must always descend the B-tree. Having logical TID > provides flexibility to move the tuples around different pages on page > splits or page merges can be performed. > > The internal pages of the B-tree are super simple and boring. Each > internal page just stores an array of TID and downlink pairs. Let's > focus on the leaf level. Leaf blocks have short uncompressed header, > followed by btree items. Two kinds of items exist: > > - plain item, holds one tuple or one datum, uncompressed payload > - a "container item", holds multiple plain items, compressed payload > > +----------------------------- > | Fixed-size page header: > | > | LSN > | TID low and hi key (for Lehman & Yao B-tree operations) > | left and right page pointers > | > | Items: > | > | TID | size | flags | uncompressed size | lastTID | payload (container item) > | TID | size | flags | uncompressed size | lastTID | payload (container item) > | TID | size | flags | undo pointer | payload (plain item) > | TID | size | flags | undo pointer | payload (plain item) > | ... > | > +---------------------------- > > Row store > --------- > > The tuples are stored one after another, sorted by TID. For each > tuple, we store its 48-bit TID, a undo record pointer, and the actual > tuple data uncompressed. > Storing undo record pointer with each tuple can take quite a lot of space in cases where you can't compress them. Have you thought how will you implement the multi-locker scheme in this design? In zheap, we have used undo for the same and it is easy to imagine when you have separate transaction slots for each transaction. I am not sure how will you implement the same here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 10/04/2019 09:29, Amit Kapila wrote: > On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote: >> Row store >> --------- >> >> The tuples are stored one after another, sorted by TID. For each >> tuple, we store its 48-bit TID, a undo record pointer, and the actual >> tuple data uncompressed. >> > > Storing undo record pointer with each tuple can take quite a lot of > space in cases where you can't compress them. Yeah. This does depend on compression to eliminate the unused fields quite heavily at the moment. But you could have a flag in the header to indicate "no undo pointer needed", and just leave it out, when it's needed. > Have you thought how will you implement the multi-locker scheme in > this design? In zheap, we have used undo for the same and it is easy > to imagine when you have separate transaction slots for each > transaction. I am not sure how will you implement the same here. I've been thinking that the undo record would store all the XIDs involved. So if there are multiple lockers, the UNDO record would store a list of XIDs. Alternatively, I suppose you could store multiple UNDO pointers for the same tuple. - Heikki
On 10.04.2019 10:25, Heikki Linnakangas wrote: > On 10/04/2019 09:29, Amit Kapila wrote: >> On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> >> wrote: >>> Row store >>> --------- >>> >>> The tuples are stored one after another, sorted by TID. For each >>> tuple, we store its 48-bit TID, a undo record pointer, and the actual >>> tuple data uncompressed. >>> >> >> Storing undo record pointer with each tuple can take quite a lot of >> space in cases where you can't compress them. > > Yeah. This does depend on compression to eliminate the unused fields > quite heavily at the moment. But you could have a flag in the header > to indicate "no undo pointer needed", and just leave it out, when it's > needed. > >> Have you thought how will you implement the multi-locker scheme in >> this design? In zheap, we have used undo for the same and it is easy >> to imagine when you have separate transaction slots for each >> transaction. I am not sure how will you implement the same here. > I've been thinking that the undo record would store all the XIDs > involved. So if there are multiple lockers, the UNDO record would > store a list of XIDs. Alternatively, I suppose you could store > multiple UNDO pointers for the same tuple. > > - Heikki > > I also a little bit confused about UNDO records and MVCC support in Zedstore. Actually columnar store is mostly needed for analytic for read-only or append-only data. One of the disadvantages of Postgres is quite larger per-record space overhead caused by MVCC. It may be critical if you want to store huge timeseries with relatively small number of columns (like measurements of some sensor). It will be nice to have storage format which reduce this overhead when it is not needed (data is not updated). Right now, even without UNFO pages, size of zedstore is larger than size of original Postgres table. It seems to be very strange. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 10/04/2019 10:38, Konstantin Knizhnik wrote: > I also a little bit confused about UNDO records and MVCC support in > Zedstore. Actually columnar store is mostly needed for analytic for > read-only or append-only data. One of the disadvantages of Postgres is > quite larger per-record space overhead caused by MVCC. > It may be critical if you want to store huge timeseries with relatively > small number of columns (like measurements of some sensor). > It will be nice to have storage format which reduce this overhead when > it is not needed (data is not updated). Sure. Definitely something we need to optimize. > Right now, even without UNFO pages, size of zedstore is larger than size > of original Postgres table. > It seems to be very strange. If you have a table with a lot of columns, but each column is small, e.g. lots of boolean columns, the item headers that zedstore currently stores for each datum take up a lot of space. We'll need to squeeze those harder to make this competitive. Instead of storing a header for each datum, if a group of consecutive tuples have the same visibility information, we could store the header just once, with an array of the datums, for example. - Heikki
On Wed, Apr 10, 2019 at 12:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 10/04/2019 09:29, Amit Kapila wrote: > > On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > >> Row store > >> --------- > >> > >> The tuples are stored one after another, sorted by TID. For each > >> tuple, we store its 48-bit TID, a undo record pointer, and the actual > >> tuple data uncompressed. > >> > > > > Storing undo record pointer with each tuple can take quite a lot of > > space in cases where you can't compress them. > > Yeah. This does depend on compression to eliminate the unused fields > quite heavily at the moment. But you could have a flag in the header to > indicate "no undo pointer needed", and just leave it out, when it's needed. > > > Have you thought how will you implement the multi-locker scheme in > > this design? In zheap, we have used undo for the same and it is easy > > to imagine when you have separate transaction slots for each > > transaction. I am not sure how will you implement the same here. > I've been thinking that the undo record would store all the XIDs > involved. So if there are multiple lockers, the UNDO record would store > a list of XIDs. > This will be quite tricky. Whenever a new locker arrives, you first need to fetch previous undo to see which all XIDs already have a lock on it. Not only that, it will make discarding undo's way complicated. We have considered this approach before implementing the current approach in zheap. > Alternatively, I suppose you could store multiple UNDO > pointers for the same tuple. > This will not only make the length of the tuple unnecessarily long but would make it much harder to reclaim that space once the corresponding undo is discarded. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 9/04/19 12:27 PM, Ashwin Agrawal wrote: > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; psql: ERROR: compression failed. what now? CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote (https://sourceforge.net/projects/benchw/). The row concerned looks like: 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different. best wishes Mark
On 11/04/19 4:01 PM, Mark Kirkwood wrote: > On 9/04/19 12:27 PM, Ashwin Agrawal wrote: > >> Heikki and I have been hacking recently for few weeks to implement >> in-core columnar storage for PostgreSQL. Here's the design and initial >> implementation of Zedstore, compressed in-core columnar storage (table >> access method). Attaching the patch and link to github branch [1] to >> follow along. >> >> > > Very nice. I realize that it is very early days, but applying this > patch I've managed to stumble over some compression bugs doing some > COPY's: > > benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' > USING DELIMITERS ','; > psql: ERROR: compression failed. what now? > CONTEXT: COPY dim1, line 458 > > The log has: > > 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. > what now? > 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 > 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, > zedstore_compression.c:287 > 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM > '/data0/dump/dim1.dat' > USING DELIMITERS ','; > > The dataset is generated from and old DW benchmark I wrote > (https://sourceforge.net/projects/benchw/). The row concerned looks like: > > 457,457th interesting measure,1th measure > type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp > 458,458th interesting measure,2th measure > type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg > > > I'll see if changing to LZ4 makes any different. > > The COPY works with LZ4 configured.
> On Apr 10, 2019, at 9:08 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > >> On 11/04/19 4:01 PM, Mark Kirkwood wrote: >>> On 9/04/19 12:27 PM, Ashwin Agrawal wrote: >>> >>> Heikki and I have been hacking recently for few weeks to implement >>> in-core columnar storage for PostgreSQL. Here's the design and initial >>> implementation of Zedstore, compressed in-core columnar storage (table >>> access method). Attaching the patch and link to github branch [1] to >>> follow along. >>> >>> >> >> Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compressionbugs doing some COPY's: >> >> benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' >> USING DELIMITERS ','; >> psql: ERROR: compression failed. what now? >> CONTEXT: COPY dim1, line 458 >> >> The log has: >> >> 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? >> 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 >> 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 >> 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' >> USING DELIMITERS ','; >> >> The dataset is generated from and old DW benchmark I wrote (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_&d=DwIDaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0&s=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY&e=). Therow concerned looks like: >> >> 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp >> 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg >> >> >> I'll see if changing to LZ4 makes any different. >> >> > > The COPY works with LZ4 configured. Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers.Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case anderrors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast aswell with it.
On 11.04.2019 8:03, Ashwin Agrawal wrote:
On Apr 10, 2019, at 9:08 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:On 11/04/19 4:01 PM, Mark Kirkwood wrote:On 9/04/19 12:27 PM, Ashwin Agrawal wrote: Heikki and I have been hacking recently for few weeks to implement in-core columnar storage for PostgreSQL. Here's the design and initial implementation of Zedstore, compressed in-core columnar storage (table access method). Attaching the patch and link to github branch [1] to follow along.Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; psql: ERROR: compression failed. what now? CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_&d=DwIDaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0&s=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY&e=). The row concerned looks like: 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different.The COPY works with LZ4 configured.Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
Internal Postgres lz compressor is really very inefficient comparing with other compression algorithms.
But in any case you should never assume that size of compressed data will be smaller than size of plain data.
Moreover, if you are trying to compress already compressed data, then result almost always will be larger.
If size of compressed data is larger (or even not significantly smaller) than size of raw data, then you should store original data.
lz4 is actually very fast. But it doesn't provide good compression ratio.
This my results of compressing pbench data using different compressors:
Configuration | Size (Gb) | Time (sec) |
no compression | 15.31 | 92 |
zlib (default level) | 2.37 | 284 |
zlib (best speed) | 2.43 | 191 |
postgres internal lz | 3.89 | 214 |
lz4 | 4.12 | 95 |
snappy | 5.18 | 99 |
lzfse | 2.80 | 1099 |
(apple) 2.80 1099 | 1.69 | 125 |
You see that zstd provides almost 2 times better compression ration and almost at the same speed.
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] tofollow along.
The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.
Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.
High-level Design - B-trees for the win!
========================================
To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.
The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.
TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:
- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload
+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------
Row store
---------
The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.
The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.
Column store
------------
A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.
This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.
Main reasons for storing data this way
--------------------------------------
* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.
* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.
MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.
Implementation
==============
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.
Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.
Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.
Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.
* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.
* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.
Index Support:
Building index also leverages columnar storage and only scans columns
required to build the index. Indexes work pretty similar to heap
tables. Data is inserted into tables and TID for the tuple same gets
stored in index. On index scans, required column Btrees are scanned
for given TID and datums passed back using virtual tuple.
Page Format:
ZedStore table contains different kinds of pages, all in the same
file. Kinds of pages are meta-page, per-attribute btree internal and
leaf pages, UNDO log page, and toast pages. Each page type has its own
distinct data storage format.
Block 0 is always a metapage. It contains the block numbers of the
other data structures stored within the file, like the per-attribute
B-trees, and the UNDO log.
Enhancements to design:
=======================
Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.
When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.
When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.
In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.
If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.
Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.
Notes about current patch
=========================
Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.
Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:
CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.
TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.
Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows
postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; -- heap
avg
---------------------
500000.500000000000
(1 row)
Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; -- zedstore
avg
---------------------
500000.500000000000
(1 row)
Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.
The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.
Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.
Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.
References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com
3] https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org
4] https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6] https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com
7] https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi
Reading about it reminds me of this work -- TAG column storage( http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ). Isn't this storage system inspired from there, with TID as the TAG?
It is not referenced here so made me wonder.
-- Regards,
Rafia Sabih
Rafia Sabih
On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> This is not surprising, considering that columnar store is precisely the
> reason for starting the work on table AMs.
>
> We should certainly look into integrating some sort of columnar storage
> in mainline. Not sure which of zedstore or VOPS is the best candidate,
> or maybe we'll have some other proposal. My feeling is that having more
> than one is not useful; if there are optimizations to one that can be
> borrowed from the other, let's do that instead of duplicating effort.
I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.
Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.
This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.
BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.
+1 on Brr. Looks like Thomas and your thought on having 'z' makes things popular/stylish, etc. is after all true, I was skeptical back then.
Regards,
Rafia Sabih
Rafia Sabih
On 11/04/2019 16:12, Rafia Sabih wrote: > On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > BTW, can I express a small measure of disappointment that the name for > the thing under discussion on this thread chose to be called > "zedstore"? That seems to invite confusion with "zheap", especially > in parts of the world where the last letter of the alphabet is > pronounced "zed," where people are going to say zed-heap and > zed-store. Brr. > > +1 on Brr. Looks like Thomas and your thought on having 'z' makes > things popular/stylish, etc. is after all true, I was skeptical back then. BrrStore works for me, too ;-). - Heikki
On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:
This my results of compressing pbench data using different compressors:
Configuration Size (Gb) Time (sec) no compression 15.31 92 zlib (default level) 2.37 284 zlib (best speed) 2.43 191 postgres internal lz 3.89 214 lz4 4.12 95 snappy 5.18 99 lzfse 2.80 1099 (apple) 2.80 1099 1.69 125
You see that zstd provides almost 2 times better compression ration and almost at the same speed.
What is "(apple) 2.80 1099"? Was that intended to be zstd?
Andreas
On 11.04.2019 16:18, Andreas Karlsson wrote:
Ugh...On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:
This my results of compressing pbench data using different compressors:
Configuration Size (Gb) Time (sec) no compression 15.31 92 zlib (default level) 2.37 284 zlib (best speed) 2.43 191 postgres internal lz 3.89 214 lz4 4.12 95 snappy 5.18 99 lzfse 2.80 1099 (apple) 2.80 1099 1.69 125
You see that zstd provides almost 2 times better compression ration and almost at the same speed.
What is "(apple) 2.80 1099"? Was that intended to be zstd?
Andreas
Cut and paste problems.
The whole document can be found here: http://garret.ru/PageLevelCompression.pdf
lzfse (apple) 2.80 1099
zstd (facebook) 1.69 125
ztsd is compression algorithm proposed by facebook: https://github.com/facebook/zstd
Looks like it provides the best speed/compress ratio result.
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Apr 11, 2019 at 3:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 11/04/2019 16:12, Rafia Sabih wrote:
> On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>> wrote:
>
> BTW, can I express a small measure of disappointment that the name for
> the thing under discussion on this thread chose to be called
> "zedstore"? That seems to invite confusion with "zheap", especially
> in parts of the world where the last letter of the alphabet is
> pronounced "zed," where people are going to say zed-heap and
> zed-store. Brr.
>
> +1 on Brr. Looks like Thomas and your thought on having 'z' makes
> things popular/stylish, etc. is after all true, I was skeptical back then.
BrrStore works for me, too ;-).
Also works as a reference to the Finnish climate?
(Sorry, couldn't help myself)
On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] tofollow along.
The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.
Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.
High-level Design - B-trees for the win!
========================================
To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.
The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.
TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.
In my understanding these TIDs will follow the datatype of the current ones. Then my question is will TIDs be reusable here and how will the reusable range of TIDs be determined? If not, wouldn't that become a hard limit to the number of insertions performed on a table?
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:
- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload
+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------
Row store
---------
The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.
The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.
Column store
------------
A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.
This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.
Main reasons for storing data this way
--------------------------------------
* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.
* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.
MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.
How about using a separate BTree for undo also?
Implementation
==============
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.
Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.
Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.
I am curious about how delete is working here? Will the TID entries will be just marked delete as in current heap, or will they be actually removed and whole btree is restructured (if required) then?
Similarly, about updates, will they be just delete+insert or something clever will be happening there?
Will there be in-place updates and in what scenarios they will be possible? There is nothing mentioned in this direction, however using undo files assures me there must be some in-place updates somewhere.
Enhancements to design:
=======================
Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.
When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.
When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.
In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.
If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.
Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.
Notes about current patch
=========================
Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.
Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:
CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.
TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.
Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows
postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; -- heap
avg
---------------------
500000.500000000000
(1 row)
Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; -- zedstore
avg
---------------------
500000.500000000000
(1 row)
Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.
The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.
Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.
Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.
References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com
3] https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org
4] https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6] https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com
7] https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi
Regards,
Rafia Sabih
Rafia Sabih
Ashwin Agrawal <aagrawal@pivotal.io> writes: > Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers.Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case anderrors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast aswell with it. You realize of course that *every* compression method has some inputs that it makes bigger. If your code assumes that compression always produces a smaller string, that's a bug in your code, not the compression algorithm. regards, tom lane
On 11/04/2019 17:54, Tom Lane wrote: > Ashwin Agrawal <aagrawal@pivotal.io> writes: >> Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers.Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case anderrors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast aswell with it. > > You realize of course that *every* compression method has some inputs that > it makes bigger. If your code assumes that compression always produces a > smaller string, that's a bug in your code, not the compression algorithm. Of course. The code is not making that assumption, although clearly there is a bug there somewhere because it throws that error. It's early days.. In practice it's easy to weasel out of that, by storing the data uncompressed, if compression would make it longer. Then you need an extra flag somewhere to indicate whether it's compressed or not. It doesn't break the theoretical limit because the actual stored length is then original length + 1 bit, but it's usually not hard to find a place for one extra bit. - Heikki
On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote: > Reading about it reminds me of this work -- TAG column storage( http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf). > Isn't this storage system inspired from there, with TID as the TAG? > > It is not referenced here so made me wonder. I don't think they're particularly similar, because that paper describes an architecture based on using purely logical row identifiers, which is not what a TID is. TID is a hybrid physical/logical identifier, sometimes called a "physiological" identifier, which will have significant overhead. Ashwin said that ZedStore TIDs are logical identifiers, but I don't see how that's compatible with a hybrid row/column design (unless you map heap TID to logical row identifier using a separate B-Tree). The big idea with Graefe's TAG design is that there is practically no storage overhead for these logical identifiers, because each entry's identifier is calculated by adding its slot number to the page's tag/low key. The ZedStore design, in contrast, explicitly stores TID for every entry. ZedStore seems more flexible for that reason, but at the same time the per-datum overhead seems very high to me. Maybe prefix compression could help here, which a low key and high key can do rather well. -- Peter Geoghegan
On Tue, Apr 09, 2019 at 02:29:09PM -0400, Robert Haas wrote: >On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> This is not surprising, considering that columnar store is precisely the >> reason for starting the work on table AMs. >> >> We should certainly look into integrating some sort of columnar storage >> in mainline. Not sure which of zedstore or VOPS is the best candidate, >> or maybe we'll have some other proposal. My feeling is that having more >> than one is not useful; if there are optimizations to one that can be >> borrowed from the other, let's do that instead of duplicating effort. > >I think that conclusion may be premature. There seem to be a bunch of >different ways of doing columnar storage, so I don't know how we can >be sure that one size will fit all, or that the first thing we accept >will be the best thing. > >Of course, we probably do not want to accept a ton of storage manager >implementations is core. I think if people propose implementations >that are poor quality, or missing important features, or don't have >significantly different use cases from the ones we've already got, >it's reasonable to reject those. But I wouldn't be prepared to say >that if we have two significantly different column store that are both >awesome code with a complete feature set and significantly disjoint >use cases, we should reject the second one just because it is also a >column store. I think that won't get out of control because few >people will be able to produce really high-quality implementations. > >This stuff is hard, which I think is also why we only have 6.5 index >AMs in core after many, many years. And our standards have gone up >over the years - not all of those would pass muster if they were >proposed today. > It's not clear to me whether you're arguing for not having any such implementation in core, or having multiple ones? I think we should aim to have at least one in-core implementation, even if it's not the best possible one for all sizes. It's not like our rowstore is the best possible implementation for all cases either. I think having a colstore in core is important not just for adoption, but also for testing and development of the executor / planner bits. If we have multiple candidates with sufficient code quality, then we may consider including both. I don't think it's very likely to happen in the same release, considering how much work it will require. And I have no idea if zedstore or VOPS are / will be the only candidates - it's way too early at this point. FWIW I personally plan to focus primarily on the features that aim to be included in core, and that applies to colstores too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 09, 2019 at 02:03:09PM -0700, Ashwin Agrawal wrote: > On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: > > On 09.04.2019 18:51, Alvaro Herrera wrote: > > On 2019-Apr-09, Konstantin Knizhnik wrote: > > > >> On 09.04.2019 3:27, Ashwin Agrawal wrote: > >>> Heikki and I have been hacking recently for few weeks to implement > >>> in-core columnar storage for PostgreSQL. Here's the design and > initial > >>> implementation of Zedstore, compressed in-core columnar storage > (table > >>> access method). Attaching the patch and link to github branch [1] to > >>> follow along. > >> Thank you for publishing this patch. IMHO Postgres is really missing > normal > >> support of columnar store > > Yep. > > > >> and table access method API is the best way of integrating it. > > This is not surprising, considering that columnar store is precisely > the > > reason for starting the work on table AMs. > > > > We should certainly look into integrating some sort of columnar > storage > > in mainline. Not sure which of zedstore or VOPS is the best > candidate, > > or maybe we'll have some other proposal. My feeling is that having > more > > than one is not useful; if there are optimizations to one that can be > > borrowed from the other, let's do that instead of duplicating effort. > > > There are two different aspects: > 1. Store format. > 2. Vector execution. > > 1. VOPS is using mixed format, something similar with Apache parquet. > Tuples are stored vertically, but only inside one page. > It tries to minimize trade-offs between true horizontal and true > vertical storage: > first is most optimal for selecting all rows, while second - for > selecting small subset of rows. > To make this approach more efficient, it is better to use large page > size - default Postgres 8k pages is not enough. > > From my point of view such format is better than pure vertical storage > which will be very inefficient if query access larger number of columns. > This problem can be somehow addressed by creating projections: grouping > several columns together. But it requires more space for storing > multiple projections. > > Right, storing all the columns in single page doens't give any savings on > IO. > Yeah, although you could save some I/O thanks to compression even in that case. > 2. Doesn't matter which format we choose, to take all advantages of > vertical representation we need to use vector operations. > And Postgres executor doesn't support them now. This is why VOPS is > using some hacks, which is definitely not good and not working in all > cases. > zedstore is not using such hacks and ... this is why it never can reach > VOPS performance. > > Vectorized execution is orthogonal to storage format. It can be even > applied to row store and performance gained. Similarly column store > without vectorized execution also gives performance gain better > compression rations and such benefits. Column store clubbed with > vecotorized execution makes it lot more performant agree. Zedstore > currently is focused to have AM piece in place, which fits the postgres > ecosystem and supports all the features heap does. Not sure it's quite orthogonal. Sure, you can apply it to rowstores too, but I'd say column stores are naturally better suited for it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 11, 2019 at 04:52:33PM +0300, Konstantin Knizhnik wrote: > On 11.04.2019 16:18, Andreas Karlsson wrote: > > On 4/11/19 10:46 AM, Konstantin Knizhnik wrote: > > This my results of compressing pbench data using different > compressors: > > +-------------------------------------------------------------+ > |Configuration |Size (Gb) |Time (sec) | > |---------------------------+----------------+----------------| > |no compression |15.31 |92 | > |---------------------------+----------------+----------------| > |zlib (default level) |2.37 |284 | > |---------------------------+----------------+----------------| > |zlib (best speed) |2.43 |191 | > |---------------------------+----------------+----------------| > |postgres internal lz |3.89 |214 | > |---------------------------+----------------+----------------| > |lz4 |4.12 |95 | > |---------------------------+----------------+----------------| > |snappy |5.18 |99 | > |---------------------------+----------------+----------------| > |lzfse |2.80 |1099 | > |---------------------------+----------------+----------------| > |(apple) 2.80 1099 |1.69 |125 | > +-------------------------------------------------------------+ > > You see that zstd provides almost 2 times better compression ration > and almost at the same speed. > > What is "(apple) 2.80 1099"? Was that intended to be zstd? > > Andreas > > Ugh... > Cut and paste problems. > The whole document can be found here: > http://garret.ru/PageLevelCompression.pdf > > lzfse (apple) 2.80 1099 > zstd (facebook) 1.69 125 > > ztsd is compression algorithm proposed by facebook: > https://github.com/facebook/zstd > Looks like it provides the best speed/compress ratio result. > I think those comparisons are cute and we did a fair amount of them when considering a drop-in replacement for pglz, but ultimately it might be a bit pointless because: (a) it very much depends on the dataset (one algorithm may work great on one type of data, suck on another) (b) different systems may require different trade-offs (high ingestion rate vs. best compression ratio) (c) decompression speed may be much more important What I'm trying to say is that we shouldn't obsess about picking one particular algorithm too much, because it's entirely pointless. Instead, we should probably design the system to support different compression algorithms, ideally at column level. Also, while these general purpose algorithms are nice, what I think will be important in later stages of colstore development will be compression algorithms allowing execution directly on the compressed data (like RLE, dictionary and similar approaches). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote: >On 11/04/2019 17:54, Tom Lane wrote: >>Ashwin Agrawal <aagrawal@pivotal.io> writes: >>>Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers.Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case anderrors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast aswell with it. >> >>You realize of course that *every* compression method has some inputs that >>it makes bigger. If your code assumes that compression always produces a >>smaller string, that's a bug in your code, not the compression algorithm. > >Of course. The code is not making that assumption, although clearly >there is a bug there somewhere because it throws that error. It's >early days.. > >In practice it's easy to weasel out of that, by storing the data >uncompressed, if compression would make it longer. Then you need an >extra flag somewhere to indicate whether it's compressed or not. It >doesn't break the theoretical limit because the actual stored length >is then original length + 1 bit, but it's usually not hard to find a >place for one extra bit. > Don't we already have that flag, though? I see ZSCompressedBtreeItem has t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-04-14 18:36:18 +0200, Tomas Vondra wrote: > I think those comparisons are cute and we did a fair amount of them when > considering a drop-in replacement for pglz, but ultimately it might be a > bit pointless because: > > (a) it very much depends on the dataset (one algorithm may work great on > one type of data, suck on another) > > (b) different systems may require different trade-offs (high ingestion > rate vs. best compression ratio) > > (c) decompression speed may be much more important > > What I'm trying to say is that we shouldn't obsess about picking one > particular algorithm too much, because it's entirely pointless. Instead, > we should probably design the system to support different compression > algorithms, ideally at column level. I think we still need to pick a default algorithm, and realistically that's going to be used by like 95% of the users. Greetings, Andres Freund
On Mon, Apr 08, 2019 at 05:27:05PM -0700, Ashwin Agrawal wrote: > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > The objective is to gather feedback on design and approach to the > same. The implementation has core basic pieces working but not close > to complete. > > Big thank you to Andres, Haribabu and team for the table access method > API's. Leveraged the API's for implementing zedstore, and proves API > to be in very good shape. Had to enhance the same minimally but > in-general didn't had to touch executor much. > > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. > Cool. Me gusta. > High-level Design - B-trees for the win! > ======================================== > > To start simple, let's ignore column store aspect for a moment and > consider it as compressed row store. The column store is natural > extension of this concept, explained in next section. > > The basic on-disk data structure leveraged is a B-tree, indexed by > TID. BTree being a great data structure, fast and versatile. Note this > is not referring to existing Btree indexes, but instead net new > separate BTree for table data storage. > > TID - logical row identifier: > TID is just a 48-bit row identifier. The traditional division into > block and offset numbers is meaningless. In order to find a tuple with > a given TID, one must always descend the B-tree. Having logical TID > provides flexibility to move the tuples around different pages on page > splits or page merges can be performed. > So if TIDs are redefined this way, how does affect BRIN indexes? I mean, that's a lightweight indexing scheme which however assumes TIDs encode certain amount of locality - so this probably makes them (and Bitmap Heap Scans in general) much less eficient. That's a bit unfortunate, although I don't see a way around it :-( > The internal pages of the B-tree are super simple and boring. Each > internal page just stores an array of TID and downlink pairs. Let's > focus on the leaf level. Leaf blocks have short uncompressed header, > followed by btree items. Two kinds of items exist: > > - plain item, holds one tuple or one datum, uncompressed payload > - a "container item", holds multiple plain items, compressed payload > > +----------------------------- > | Fixed-size page header: > | > | LSN > | TID low and hi key (for Lehman & Yao B-tree operations) > | left and right page pointers > | > | Items: > | > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | undo pointer | payload (plain item) > | TID | size | flags | undo pointer | payload (plain item) > | ... > | > +---------------------------- > So if I understand it correctly, ZSUncompressedBtreeItem is the "plain" item and ZSCompressedBtreeItem is the container one. Correct? I find it a bit confusing, and I too ran into the issue with data that can't be compressed, so I think the "container" should support both compressed and uncompressed data. Heikki already mentioned that, so I suppose it's just not implemented yet. That however means the name of the "compressed" struct gets confusing, so I suggest to rename to: ZSUncompressedBtreeItem -> ZSPlainBtreeItem ZSCompressedBtreeItem -> ZSContainerBtreeItem where the container supports both compressed and uncompressed mode. Also, maybe we don't need to put "Btree" into every damn name ;-) Looking at the ZSCompressedBtreeItem, I see it stores just first/last TID for the compressed data. Won't that be insufficient when there are some gaps due to deletions or something? Or perhaps I just don't understand how it works. Another thing is that with uncompressed size being stored as uint16, won't that be insufficient for highly compressible data / large pages? I mean, we can have pages up to 32kB, which is not that far. > Column store > ------------ > > A column store uses the same structure but we have *multiple* B-trees, > one for each column, all indexed by TID. The B-trees for all columns > are stored in the same physical file. > > A metapage at block 0, has links to the roots of the B-trees. Leaf > pages look the same, but instead of storing the whole tuple, stores > just a single attribute. To reconstruct a row with given TID, scan > descends down the B-trees for all the columns using that TID, and > fetches all attributes. Likewise, a sequential scan walks all the > B-trees in lockstep. > OK, so data for all the columns are stored in separate btrees, but in the same physical file. Wouldn't it be more convenient to have one relfilenode per column? That would also mean the 32TB limit applies to individual columns, not the whole table. Of course, it'd be more complicated and partitioning allows us to work around that limit. > So, in summary can imagine Zedstore as forest of B-trees, one for each > column, all indexed by TIDs. > > This way of laying out the data also easily allows for hybrid > row-column store, where some columns are stored together, and others > have a dedicated B-tree. Need to have user facing syntax to allow > specifying how to group the columns. > OK, makes sense. Do you also envision supporting per-column / per-group compression etc? > Main reasons for storing data this way > -------------------------------------- > > * Layout the data/tuples in mapped fashion instead of keeping the > logical to physical mapping separate from actual data. So, keep the > meta-data and data logically in single stream of file, avoiding the > need for separate forks/files to store meta-data and data. > > * Stick to fixed size physical blocks. Variable size blocks pose need > for increased logical to physical mapping maintenance, plus > restrictions on concurrency of writes and reads to files. Hence > adopt compression to fit fixed size blocks instead of other way > round. > > MVCC > ---- > MVCC works very similar to zheap for zedstore. Undo record pointers > are used to implement MVCC. Transaction information if not directly > stored with the data. In zheap, there's a small, fixed, number of > "transaction slots" on each page, but zedstore has undo pointer with > each item directly; in normal cases, the compression squeezes this > down to almost nothing. > > Implementation > ============== > > Insert: > Inserting a new row, splits the row into datums. Then for first column > decide which block to insert the same to, and pick a TID for it, and > write undo record for the same. Rest of the columns are inserted using > that same TID and point to same undo position. > What about deletes? How do these work? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 14, 2019 at 09:45:10AM -0700, Andres Freund wrote: >Hi, > >On 2019-04-14 18:36:18 +0200, Tomas Vondra wrote: >> I think those comparisons are cute and we did a fair amount of them when >> considering a drop-in replacement for pglz, but ultimately it might be a >> bit pointless because: >> >> (a) it very much depends on the dataset (one algorithm may work great on >> one type of data, suck on another) >> >> (b) different systems may require different trade-offs (high ingestion >> rate vs. best compression ratio) >> >> (c) decompression speed may be much more important >> >> What I'm trying to say is that we shouldn't obsess about picking one >> particular algorithm too much, because it's entirely pointless. Instead, >> we should probably design the system to support different compression >> algorithms, ideally at column level. > >I think we still need to pick a default algorithm, and realistically >that's going to be used by like 95% of the users. > True. Do you expect it to be specific to the column store, or should be set per-instance default (even for regular heap)? FWIW I think the conclusion from past dev meetings was we're unlikely to find anything better than lz4. I doubt that changed very much. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On Tue, Apr 09, 2019 at 02:29:09PM -0400, Robert Haas wrote: > >On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > >>This is not surprising, considering that columnar store is precisely the > >>reason for starting the work on table AMs. > >> > >>We should certainly look into integrating some sort of columnar storage > >>in mainline. Not sure which of zedstore or VOPS is the best candidate, > >>or maybe we'll have some other proposal. My feeling is that having more > >>than one is not useful; if there are optimizations to one that can be > >>borrowed from the other, let's do that instead of duplicating effort. > > > >I think that conclusion may be premature. There seem to be a bunch of > >different ways of doing columnar storage, so I don't know how we can > >be sure that one size will fit all, or that the first thing we accept > >will be the best thing. > > > >Of course, we probably do not want to accept a ton of storage manager > >implementations is core. I think if people propose implementations > >that are poor quality, or missing important features, or don't have > >significantly different use cases from the ones we've already got, > >it's reasonable to reject those. But I wouldn't be prepared to say > >that if we have two significantly different column store that are both > >awesome code with a complete feature set and significantly disjoint > >use cases, we should reject the second one just because it is also a > >column store. I think that won't get out of control because few > >people will be able to produce really high-quality implementations. > > > >This stuff is hard, which I think is also why we only have 6.5 index > >AMs in core after many, many years. And our standards have gone up > >over the years - not all of those would pass muster if they were > >proposed today. > > It's not clear to me whether you're arguing for not having any such > implementation in core, or having multiple ones? I think we should aim > to have at least one in-core implementation, even if it's not the best > possible one for all sizes. It's not like our rowstore is the best > possible implementation for all cases either. > > I think having a colstore in core is important not just for adoption, > but also for testing and development of the executor / planner bits. Agreed. > If we have multiple candidates with sufficient code quality, then we may > consider including both. I don't think it's very likely to happen in the > same release, considering how much work it will require. And I have no > idea if zedstore or VOPS are / will be the only candidates - it's way > too early at this point. Definitely, but having as many different indexes as we have is certainly a good thing and we should be looking to a future where we have multiple in-core options for row and column-oriented storage. > FWIW I personally plan to focus primarily on the features that aim to > be included in core, and that applies to colstores too. Yeah, same here. Thanks! Stephen
Attachment
On Sun, Apr 14, 2019 at 06:39:47PM +0200, Tomas Vondra wrote: >On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote: >>On 11/04/2019 17:54, Tom Lane wrote: >>>Ashwin Agrawal <aagrawal@pivotal.io> writes: >>>>Thank you for trying it out. Yes, noticed for certain patterns >>>>pg_lzcompress() actually requires much larger output buffers. Like >>>>for one 86 len source it required 2296 len output buffer. Current >>>>zedstore code doesn’t handle this case and errors out. LZ4 for same >>>>patterns works fine, would highly recommend using LZ4 only, as >>>>anyways speed is very fast as well with it. >>> >>>You realize of course that *every* compression method has some inputs >>>that it makes bigger. If your code assumes that compression always >>>produces a smaller string, that's a bug in your code, not the >>>compression algorithm. >> >>Of course. The code is not making that assumption, although clearly >>there is a bug there somewhere because it throws that error. It's >>early days.. >> >>In practice it's easy to weasel out of that, by storing the data >>uncompressed, if compression would make it longer. Then you need an >>extra flag somewhere to indicate whether it's compressed or not. It >>doesn't break the theoretical limit because the actual stored length >>is then original length + 1 bit, but it's usually not hard to find a >>place for one extra bit. >> > >Don't we already have that flag, though? I see ZSCompressedBtreeItem >has t_flags, and there's ZSBT_COMPRESSED, but maybe it's more >complicated. > After thinking about this a bit more, I think a simple flag may not be enough. It might be better to have some sort of ID of the compression algorithm in each item, which would allow switching algorithm for new data (which may be useful e.g after we add new stuff in core, or when the initial choice was not the best one). Of course, those are just wild thoughts at this point, it's not something the current PoC has to solve right away. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Stephen Frost <sfrost@snowman.net> writes: > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> I think having a colstore in core is important not just for adoption, >> but also for testing and development of the executor / planner bits. > Agreed. TBH, I thought the reason we were expending so much effort on a tableam API was exactly so we *wouldn't* have to include such stuff in core. There is a finite limit to how much stuff we can maintain as part of core. We should embrace the notion that Postgres is an extensible system, rather than build all the tooling for extension and then proceed to dump stuff into core anyway. >> If we have multiple candidates with sufficient code quality, then we may >> consider including both. Dear god, no. regards, tom lane
On Sat, Apr 13, 2019 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> Reading about it reminds me of this work -- TAG column storage( https://urldefense.proofpoint.com/v2/url?u=http-3A__www09.sigmod.org_sigmod_record_issues_0703_03.article-2Dgraefe.pdf&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=H2hOVqCm9svWVOW1xh7FhoURKEP-WWpWso6lKD1fLoM&s=KNOse_VUg9-BW7SyDXt1vw92n6x_B92N9SJHZKrdoIo&e= ).
> Isn't this storage system inspired from there, with TID as the TAG?
>
> It is not referenced here so made me wonder.
I don't think they're particularly similar, because that paper
describes an architecture based on using purely logical row
identifiers, which is not what a TID is. TID is a hybrid
physical/logical identifier, sometimes called a "physiological"
identifier, which will have significant overhead.
Storage system wasn't inspired by that paper, but yes seems it also talks about laying out column data in btrees, which is good to see. But yes as pointed out by Peter, the main aspect the paper is focusing on to save space for TAG, isn't something zedstore plan's to leverage, it being more restrictive. As discussed below we can use other alternatives to save space.
Ashwin said that
ZedStore TIDs are logical identifiers, but I don't see how that's
compatible with a hybrid row/column design (unless you map heap TID to
logical row identifier using a separate B-Tree).
Would like to know more specifics on this Peter. We may be having different context on hybrid row/column design. When we referenced design supports hybrid row/column families, it meant not within same table. So, not inside a table one can have some data in row and some in column nature. For a table, the structure will be homogenous. But it can easily support storing all the columns together, or subset of columns together or single column all connected together by TID.
The big idea with Graefe's TAG design is that there is practically no
storage overhead for these logical identifiers, because each entry's
identifier is calculated by adding its slot number to the page's
tag/low key. The ZedStore design, in contrast, explicitly stores TID
for every entry. ZedStore seems more flexible for that reason, but at
the same time the per-datum overhead seems very high to me. Maybe
prefix compression could help here, which a low key and high key can
do rather well.
Yes, the plan to optimize out TID space per datum, either by prefix compression or delta compression or some other trick.
On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
>On 11/04/2019 17:54, Tom Lane wrote:
>>Ashwin Agrawal <aagrawal@pivotal.io> writes:
>>>Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
>>
>>You realize of course that *every* compression method has some inputs that
>>it makes bigger. If your code assumes that compression always produces a
>>smaller string, that's a bug in your code, not the compression algorithm.
>
>Of course. The code is not making that assumption, although clearly
>there is a bug there somewhere because it throws that error. It's
>early days..
>
>In practice it's easy to weasel out of that, by storing the data
>uncompressed, if compression would make it longer. Then you need an
>extra flag somewhere to indicate whether it's compressed or not. It
>doesn't break the theoretical limit because the actual stored length
>is then original length + 1 bit, but it's usually not hard to find a
>place for one extra bit.
>
Don't we already have that flag, though? I see ZSCompressedBtreeItem has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.
The flag ZSBT_COMPRESSED differentiates between container (compressed) item and plain (uncompressed item). Current code is writtten such that within container (compressed) item, all the data is compressed. If need exists to store some part of uncompressed data inside container item, then this additional flag would be required to indicate the same. Hence its different than ZSBT_COMPRESSED. I am thinking one of the ways could be to just not store this datum in container item if can't be compressed and just store it as plain item with uncompressed data, this additional flag won't be required. Will know more once write code for this.
On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote: > On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote: > >On 11/04/2019 17:54, Tom Lane wrote: > >>Ashwin Agrawal <aagrawal@pivotal.io> writes: > >>>Thank you for trying it out. Yes, noticed for certain patterns > pg_lzcompress() actually requires much larger output buffers. Like for > one 86 len source it required 2296 len output buffer. Current zedstore > code doesn’t handle this case and errors out. LZ4 for same patterns > works fine, would highly recommend using LZ4 only, as anyways speed is > very fast as well with it. > >> > >>You realize of course that *every* compression method has some inputs > that > >>it makes bigger. If your code assumes that compression always > produces a > >>smaller string, that's a bug in your code, not the compression > algorithm. > > > >Of course. The code is not making that assumption, although clearly > >there is a bug there somewhere because it throws that error. It's > >early days.. > > > >In practice it's easy to weasel out of that, by storing the data > >uncompressed, if compression would make it longer. Then you need an > >extra flag somewhere to indicate whether it's compressed or not. It > >doesn't break the theoretical limit because the actual stored length > >is then original length + 1 bit, but it's usually not hard to find a > >place for one extra bit. > > > > Don't we already have that flag, though? I see ZSCompressedBtreeItem has > t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated. > > The flag ZSBT_COMPRESSED differentiates between container (compressed) > item and plain (uncompressed item). Current code is writtten such that > within container (compressed) item, all the data is compressed. If need > exists to store some part of uncompressed data inside container item, then > this additional flag would be required to indicate the same. Hence its > different than ZSBT_COMPRESSED. I am thinking one of the ways could be to > just not store this datum in container item if can't be compressed and > just store it as plain item with uncompressed data, this additional flag > won't be required. Will know more once write code for this. I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it means "this is a container". And then have another flag to track whether the container is compressed or not. But as I suggested elsewhere in this thread, I think it might be better to store some ID of the compression algorithm used instead of a simple flag. FWIW when I had to deal with incremental compression (adding data into already compressed buffers), which is what seems to be happening here, I found it very useful/efficient to allow partially compressed buffers and only trigger recompressin when absolutely needed. Applied to this case, the container would first store compressed chunk, followed by raw (uncompressed) data. Say, like this: ZSContainerData { // header etc. int nbytes; /* total bytes in data */ int ncompressed; /* ncompressed <= nbytes, fully compressed when * (ncompressed == nbytes) */ char data[FLEXIBLE_ARRAY_MEMBER]; } When adding a value to the buffer, it'd be simply appended to the data array. When the container would grow too much (can't fit on the page or something), recompression is triggered. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote:
> On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
> >On 11/04/2019 17:54, Tom Lane wrote:
> >>Ashwin Agrawal <aagrawal@pivotal.io> writes:
> >>>Thank you for trying it out. Yes, noticed for certain patterns
> pg_lzcompress() actually requires much larger output buffers. Like for
> one 86 len source it required 2296 len output buffer. Current zedstore
> code doesn’t handle this case and errors out. LZ4 for same patterns
> works fine, would highly recommend using LZ4 only, as anyways speed is
> very fast as well with it.
> >>
> >>You realize of course that *every* compression method has some inputs
> that
> >>it makes bigger. If your code assumes that compression always
> produces a
> >>smaller string, that's a bug in your code, not the compression
> algorithm.
> >
> >Of course. The code is not making that assumption, although clearly
> >there is a bug there somewhere because it throws that error. It's
> >early days..
> >
> >In practice it's easy to weasel out of that, by storing the data
> >uncompressed, if compression would make it longer. Then you need an
> >extra flag somewhere to indicate whether it's compressed or not. It
> >doesn't break the theoretical limit because the actual stored length
> >is then original length + 1 bit, but it's usually not hard to find a
> >place for one extra bit.
> >
>
> Don't we already have that flag, though? I see ZSCompressedBtreeItem has
> t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.
>
> The flag ZSBT_COMPRESSED differentiates between container (compressed)
> item and plain (uncompressed item). Current code is writtten such that
> within container (compressed) item, all the data is compressed. If need
> exists to store some part of uncompressed data inside container item, then
> this additional flag would be required to indicate the same. Hence its
> different than ZSBT_COMPRESSED. I am thinking one of the ways could be to
> just not store this datum in container item if can't be compressed and
> just store it as plain item with uncompressed data, this additional flag
> won't be required. Will know more once write code for this.
I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
means "this is a container". And then have another flag to track whether
the container is compressed or not. But as I suggested elsewhere in this
thread, I think it might be better to store some ID of the compression
algorithm used instead of a simple flag.
FWIW when I had to deal with incremental compression (adding data into
already compressed buffers), which is what seems to be happening here, I
found it very useful/efficient to allow partially compressed buffers and
only trigger recompressin when absolutely needed.
Applied to this case, the container would first store compressed chunk,
followed by raw (uncompressed) data. Say, like this:
ZSContainerData {
// header etc.
int nbytes; /* total bytes in data */
int ncompressed; /* ncompressed <= nbytes, fully compressed when
* (ncompressed == nbytes) */
char data[FLEXIBLE_ARRAY_MEMBER];
}
When adding a value to the buffer, it'd be simply appended to the data
array. When the container would grow too much (can't fit on the page or
something), recompression is triggered.
On Sun, Apr 14, 2019 at 12:22 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > It's not clear to me whether you're arguing for not having any such > implementation in core, or having multiple ones? I think we should aim > to have at least one in-core implementation, even if it's not the best > possible one for all sizes. It's not like our rowstore is the best > possible implementation for all cases either. I'm mostly arguing that it's too early to decide anything at this point. I'm definitely not opposed to having a column store in core. > I think having a colstore in core is important not just for adoption, > but also for testing and development of the executor / planner bits. > > If we have multiple candidates with sufficient code quality, then we may > consider including both. I don't think it's very likely to happen in the > same release, considering how much work it will require. And I have no > idea if zedstore or VOPS are / will be the only candidates - it's way > too early at this point. > > FWIW I personally plan to focus primarily on the features that aim to > be included in core, and that applies to colstores too. I agree with all of that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 11:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > TBH, I thought the reason we were expending so much effort on a tableam > API was exactly so we *wouldn't* have to include such stuff in core. > > There is a finite limit to how much stuff we can maintain as part of core. > We should embrace the notion that Postgres is an extensible system, rather > than build all the tooling for extension and then proceed to dump stuff > into core anyway. I don't agree with that at all. I expect, and hope, that there will be some table AMs maintained outside of core, and I think that's great. At the same time, it's not like we have had any great success with out-of-core index AMs, and I don't see that table AMs are likely to be any different in that regard; indeed, they may be quite a bit worse. Up until now an index has only had to worry about one kind of a table, but now a table is going to have to worry about every kind of index. Furthermore, different table AMs are going to have different needs. It has already been remarked by both Andres and on this thread that for columnar storage to really zip along, the executor is going to need to be much smarter about deciding which columns to request. Presumably there will be a market for planner/executor optimizations that postpone fetching columns for as long as possible. It's not going to be maintainable to build that kind of infrastructure in core and then have no in-core user of it. But even if it were, it would be foolish from an adoption perspective to drive away people who are trying to contribute that kind of technology to PostgreSQL. Columnar storage is a big deal. Very significant numbers of people who won't consider PostgreSQL today because the performance characteristics are not good enough for what they need will consider it if it's got something like what Ashwin and Heikki are building built in. Some of those people may be determined enough that even if the facility is out-of-core they'll be willing to download an extension and compile it, but others won't. It's already a problem that people have to go get pgbouncer and/or pgpool to do something that they kinda think the database should just handle. Columnar storage, like JSON, is not some fringe thing where we can say that the handful of people who want it can go get it: people expect that to be a standard offering, and they wonder why PostgreSQL hasn't got it yet. > >> If we have multiple candidates with sufficient code quality, then we may > >> consider including both. > > Dear god, no. I hate to pick on any particular part of the tree, but it seems entirely plausible to me that a second columnar storage implementation could deliver more incremental value than spgist, an index AM you committed. We should not move the goal posts into the stratosphere here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 10:50:21AM -0700, Ashwin Agrawal wrote: > On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > ... > > I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it > means "this is a container". And then have another flag to track whether > the container is compressed or not. But as I suggested elsewhere in this > thread, I think it might be better to store some ID of the compression > algorithm used instead of a simple flag. > > FWIW when I had to deal with incremental compression (adding data into > already compressed buffers), which is what seems to be happening here, I > found it very useful/efficient to allow partially compressed buffers and > only trigger recompressin when absolutely needed. > > Applied to this case, the container would first store compressed chunk, > followed by raw (uncompressed) data. Say, like this: > > ZSContainerData { > > // header etc. > > int nbytes; /* total bytes in data */ > int ncompressed; /* ncompressed <= nbytes, fully compressed when > * (ncompressed == nbytes) */ > > char data[FLEXIBLE_ARRAY_MEMBER]; > } > > When adding a value to the buffer, it'd be simply appended to the data > array. When the container would grow too much (can't fit on the page or > something), recompression is triggered. > > I think what you suggested here is exactly how its handled currently, just > the mechanics are little different. Plain items are added to page as > insertions are performed. Then when page becomes full, compression is > triggerred container item is created for them to store the compressed > data. Then new insertions are stored as plain items, once again when page > becomes full, they are compressed and container item created for it. So, > never, compressed data is attempted to be compressed again. So, on page > plain items are acting as data section you mentioned above. A page can > have mix of n plain and n container items. Maybe. I'm not going to pretend I fully understand the internals. Does that mean the container contains ZSUncompressedBtreeItem as elements? Or just the plain Datum values? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-04-15 11:10:38 -0400, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >> I think having a colstore in core is important not just for adoption, > >> but also for testing and development of the executor / planner bits. > > > Agreed. > > TBH, I thought the reason we were expending so much effort on a tableam > API was exactly so we *wouldn't* have to include such stuff in core. I think it's mostly orthogonal. We need something like tableam to have multiple types of storage options for tables - independent of whether they are in core. And sure, we could have maybe reduced the effort a bit here and there by e.g. not allowing AMs to be dynamlically loaded, or writing fewer comments or such. > There is a finite limit to how much stuff we can maintain as part of core. > We should embrace the notion that Postgres is an extensible system, rather > than build all the tooling for extension and then proceed to dump stuff > into core anyway. I totally agree that that's something we should continue to focus on. I personally think we *already* *have* embraced that - pretty heavily so. And sometimes to the detriment of our users. I think there's a pretty good case for e.g. *one* column store in-core. For one there is a huge portion of existing postgres workloads that benefit from them (often not for all tables, but some). Relatedly, it's also one of the more frequent reasons why users can't migrate to postgres / have to migrate off. And from a different angle, there's plenty planner and executor work to be done to make column stores fast - and that can't really be done nicely outside of core; and doing the improvements in core without a user there is both harder, less likely to be accepted, and more likely to regress. > >> If we have multiple candidates with sufficient code quality, then we may > >> consider including both. > > Dear god, no. Yea, I don't see much point in that. Unless there's a pretty fundamental reason why one columnar AM can't fullfill two different workloads (e.g. by having options that define how things are laid out / compressed / whatnot), I think that'd be a *terrible* idea. By that logic we'd just get a lot of AMs with a few differences in some workloads, and our users would be unable to choose one, and all of them would suck. I think one such fundamental difference is e.g. the visibility management for an in-line mvcc approach like heap, and an undo-based mvcc row-store (like zheap) - it's very hard to imagine meaningful code savings by having those combined into one AM. I'm sure we can find similar large architectural issues for some types of columnar AMs - but I'm far far from convinced that there's enough distinctive need for two different approaches in postgres. Without having heap historically and the desire for on-disk compat, I can't quite see being convinced that we should e.g. add a store like heap if we already had zheap. I think it's perfectly reasonable to have in-core AMs try to optimize ~80% for a lot of different [sets of] workloads, even if a very specialized AM could be optimized for it much further. Greetings, Andres Freund
Hi, On 2019-04-15 14:11:02 -0400, Robert Haas wrote: > Furthermore, different table AMs are going to have different > needs. It has already been remarked by both Andres and on this thread > that for columnar storage to really zip along, the executor is going > to need to be much smarter about deciding which columns to request. > Presumably there will be a market for planner/executor optimizations > that postpone fetching columns for as long as possible. It's not > going to be maintainable to build that kind of infrastructure in core > and then have no in-core user of it. Right. Two notes on that: A lot of that infrastructure needed for fast query execution (both plan time and execution time) is also going to be useful for a row store like heap, even though it won't have the ~order-of-magnitude impacts it can have for column stores. Secondly, even without those, the storage density alone can make column stores worthwhile, even without query execution speedups (or even slowdowns). Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 15, 2019 at 11:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There is a finite limit to how much stuff we can maintain as part of core. > I don't agree with that at all. Really? Let's have a discussion of how thermodynamics applies to software management sometime. >>> If we have multiple candidates with sufficient code quality, then we may >>> consider including both. >> Dear god, no. > I hate to pick on any particular part of the tree, but it seems > entirely plausible to me that a second columnar storage implementation > could deliver more incremental value than spgist, an index AM you > committed. Yeah, and that's something I've regretted more than once; I think SP-GiST is a sterling example of something that isn't nearly useful enough in the real world to justify the amount of maintenance effort we've been forced to expend on it. You might trawl the commit logs to get a sense of the amount of my own personal time --- not that of the original submitters --- that's gone into that one module. Then ask yourself how much that model will scale, and what other more-useful things I could've accomplished with that time. We do need to limit what we accept into core PG. I do not buy your argument that users expect everything to be in core. Or more accurately, the people who do think that way won't be using PG anyway --- they'll be using MSSQL because it comes from their OS vendor. regards, tom lane
Hi, On 2019-04-15 14:11:02 -0400, Robert Haas wrote: > I hate to pick on any particular part of the tree, but it seems > entirely plausible to me that a second columnar storage implementation > could deliver more incremental value than spgist, an index AM you > committed. We should not move the goal posts into the stratosphere > here. Oh, I forgot: I agree that we don't need to be absurdly picky - but I also think that table storage is much more crucial to get right than index storage, which is already plenty crucial. Especially when that type of index is not commonly usable for constraints. It really sucks to get wrong query results due to a corrupted index / wrong index implementation - but if your table AM level corruption, you're *really* in a dark place. There's no way to just REINDEX and potentially recover most information with a bit of surgery. Sure there can be app level consequences to wrong query results that can be really bad, and lead to very permanent data loss. On-disk compat is also much more important for table level data - it's annoying to have to reindex indexes after an upgrade, but at least it can be done concurrently after the most important indexes are operational. Greetings, Andres Freund
On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Maybe. I'm not going to pretend I fully understand the internals. Does
that mean the container contains ZSUncompressedBtreeItem as elements? Or
just the plain Datum values?
Container contains ZSUncompressedBtreeItem as elements. As for Item will have to store meta-data like size, undo and such info. We don't wish to restrict compressing only items from same insertion sessions only. Hence, yes doens't just store Datum values. Wish to consider it more tuple level operations and have meta-data for it and able to work with tuple level granularity than block level.
Definitely many more tricks can be and need to be applied to optimize storage format, like for fixed width columns no need to store the size in every item. Keep it simple is theme have been trying to maintain. Compression ideally should compress duplicate data pretty easily and efficiently as well, but we will try to optimize as much we can without the same.
On Mon, Apr 15, 2019 at 11:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > We do need to limit what we accept into core PG. I do not buy your > argument that users expect everything to be in core. Or more accurately, > the people who do think that way won't be using PG anyway --- they'll > be using MSSQL because it comes from their OS vendor. I am also concerned by the broad scope of ZedStore, and I tend to agree that it will be difficult to maintain in core. At the same time, I think that Andres and Robert are probably right about the difficulty of maintaining it outside of core -- that would be difficult to impossible as a practical matter. Unfortunately, you're both right. I don't know where that leaves us. -- Peter Geoghegan
On Mon, Apr 15, 2019 at 2:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Apr 15, 2019 at 11:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> There is a finite limit to how much stuff we can maintain as part of core. > > > I don't agree with that at all. > > Really? Let's have a discussion of how thermodynamics applies to > software management sometime. Sounds like an interesting discussion, perhaps for PGCon, but what I was actually disagreeing with was the idea that we should add a table AM interface and then not accept any new table AMs, which I think would be silly. And if we're going to accept any, a columnar one seems like a strong candidate. > Yeah, and that's something I've regretted more than once; I think SP-GiST > is a sterling example of something that isn't nearly useful enough in the > real world to justify the amount of maintenance effort we've been forced > to expend on it. You might trawl the commit logs to get a sense of the > amount of my own personal time --- not that of the original submitters --- > that's gone into that one module. Then ask yourself how much that model > will scale, and what other more-useful things I could've accomplished > with that time. Yep, that's fair. > We do need to limit what we accept into core PG. I do not buy your > argument that users expect everything to be in core. Or more accurately, > the people who do think that way won't be using PG anyway --- they'll > be using MSSQL because it comes from their OS vendor. I agree that we need to be judicious in what we accept, but I don't agree that we should therefore accept nothing. There are lots of things that we could put in core and users would like it that I'm glad we haven't put in core. I think you might be surprised at the number of people who normally want everything from a single source but are still willing to consider PostgreSQL; vendors like my employer help to smooth the road for such people. Still, I don't think there is any major database product other than PostgreSQL that ships only a single table storage format and just expects that it will be good enough for everyone. Like 640kB, it just isn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Peter Geoghegan <pg@bowt.ie> writes: > I am also concerned by the broad scope of ZedStore, and I tend to > agree that it will be difficult to maintain in core. At the same time, > I think that Andres and Robert are probably right about the difficulty > of maintaining it outside of core -- that would be difficult to > impossible as a practical matter. Perhaps, but we won't know if we don't try. I think we should try, and be willing to add hooks and flexibility to core as needed to make it possible. Adding such flexibility would be good for other outside projects that have no chance of (or perhaps no interest in) getting into core, even if we end up deciding that ZedStore or some other specific implementation is so useful that it does belong in core. regards, tom lane
Hi, On 2019-04-15 14:35:43 -0400, Tom Lane wrote: > Yeah, and that's something I've regretted more than once; I think SP-GiST > is a sterling example of something that isn't nearly useful enough in the > real world to justify the amount of maintenance effort we've been forced > to expend on it. You might trawl the commit logs to get a sense of the > amount of my own personal time --- not that of the original submitters --- > that's gone into that one module. Then ask yourself how much that model > will scale, and what other more-useful things I could've accomplished > with that time. I do agree that the [group of] contributor's history of maintaining such work should play a role. And I think that's doubly so with a piece as crucial as a table AM. But: > We do need to limit what we accept into core PG. I do not buy your > argument that users expect everything to be in core. Or more accurately, > the people who do think that way won't be using PG anyway --- they'll > be using MSSQL because it comes from their OS vendor. I don't think anybody disagrees with that, actually. Including Robert. But I don't think it follows that we shouldn't provide things that are either much more reasonably done in core like a pooler (authentication / encryption; infrastructure for managing state like prepared statements, GUCs; avoiding issues of explosion of connection counts with pooling in other places), are required by a very significant portion of our users (imo the case for a columnar store or a row store without the architectural issues of heap), or where it's hard to provide the necessary infrastructure without an in-core user (imo also the case with columnar, due to the necessary planner / executor improvements for fast query execution). We also have at times pretty explicitly resisted making crucial pieces of infrastructure usable outside of core. E.g. because it's legitimately hard (grammar extensibility), or because we'd some concerns around stability and the exact approach (WAL - the generic stuff is usable for anything that wants to even be somewhat efficient, some xlog integration). So there's several types of extensions that one realistically cannot do out of core, by our choice. Greetings, Andres Freund
On Mon, Apr 15, 2019 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Perhaps, but we won't know if we don't try. I think we should try, > and be willing to add hooks and flexibility to core as needed to make > it possible. We could approach it without taking a firm position on inclusion in core until the project begins to mature. I have little faith in our ability to predict which approach will be the least painful at this early stage. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > We also have at times pretty explicitly resisted making crucial pieces > of infrastructure usable outside of core. E.g. because it's legitimately > hard (grammar extensibility), or because we'd some concerns around > stability and the exact approach (WAL - the generic stuff is usable for > anything that wants to even be somewhat efficient, some xlog > integration). So there's several types of extensions that one > realistically cannot do out of core, by our choice. Well, the grammar issue comes from a pretty specific technical problem: bison grammars don't cope with run-time extension, and moving off of bison would cost a lot of work, and probably more than just work (i.e., probable loss of ability to detect grammar ambiguity). WAL extensibility likewise has some technical issues that are hard to surmount (how do you find the code for replaying an extension WAL record, when you can't read catalogs). I think we could fix the latter, it's just that no one has yet troubled to expend the effort. Similarly, things like the planner's hard-wired handling of physical-tlist optimization are certainly a problem for column stores, but I think the way to solve that is to provide an actual extension capability, not merely replace one hard-wired behavior with two. As a counterpoint to my gripe about SP-GiST being a time sink, I do not think I'll regret the time I spent a few months ago on implementing "planner support function" hooks. I'm all in favor of adding flexibility like that. regards, tom lane
Hi, On 2019-04-15 15:19:41 -0400, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > I am also concerned by the broad scope of ZedStore, and I tend to > > agree that it will be difficult to maintain in core. At the same time, > > I think that Andres and Robert are probably right about the difficulty > > of maintaining it outside of core -- that would be difficult to > > impossible as a practical matter. > > Perhaps, but we won't know if we don't try. I think we should try, > and be willing to add hooks and flexibility to core as needed to make > it possible. Adding such flexibility would be good for other outside > projects that have no chance of (or perhaps no interest in) getting into > core, even if we end up deciding that ZedStore or some other specific > implementation is so useful that it does belong in core. I don't think anybody argued against providing that flexibility. I think we should absolutely do so - but that's imo not an argument against integrating something like a hypothetical well developed columnstore to core. I worked on tableam, which certainly provides a lot of new extensibility, because it was the sane architecture to able to integrate zheap. The current set of UNDO patches (developed for zheap), while requiring core integration for xact.c etc, co-initiated improvements to make the checkpointer fsync being closer to extensible and UNDO as currently developed would be extensible if WAL was extensible as it's tied to rmgrlist.h. And the improvements necessary to make query executions for in-core columnar AM faster, would largely also be applicable for out-of-core columnar AMs, and I'm sure we'd try to make the necessary decisions not hardcoded if reasonable. I think it's actually really hard to actually make something non-trivial extensible without there being a proper in-core user of most of that infrastructure. Greetings, Andres Freund
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > Would like to know more specifics on this Peter. We may be having different context on hybrid row/column design. I'm confused about how close your idea of a TID is to the traditional definition from heapam (and even zheap). If it's a purely logical identifier, then why would it have two components like a TID? Is that just a short-term convenience or something? > Yes, the plan to optimize out TID space per datum, either by prefix compression or delta compression or some other trick. It would be easier to do this if you knew for sure that the TID behaves almost the same as a bigserial column -- a purely logical monotonically increasing identifier. That's why I'm interested in what exactly you mean by TID, the stability of a TID value, etc. If a leaf page almost always stores a range covering no more than few hundred contiguous logical values, you can justify aggressively compressing the representation in the B-Tree entries. Compression would still be based on prefix compression, but the representation itself can be specialized. -- Peter Geoghegan
Hi, On 2019-04-15 12:50:14 -0700, Peter Geoghegan wrote: > On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > Would like to know more specifics on this Peter. We may be having different context on hybrid row/column design. > > I'm confused about how close your idea of a TID is to the traditional > definition from heapam (and even zheap). If it's a purely logical > identifier, then why would it have two components like a TID? Is that > just a short-term convenience or something? There's not much of an alternative currently. Indexes require tid looking things, and as a consequence (and some other comparatively small changes that'd be required) tableam does too. And there's a few places that imbue additional meaning into the higher bits of ip_posid too, so not all of them are valid (It can't currently be zero - or ItemPointerIsValid fails, it can't be larger than MaxOffsetNumber - that's used to allocate things in e.g. indexes, tidbmap.c etc). That's one of the reasons why I've been trying to get you to get on board with allowing different leaf-level "item pointer equivalents" widths inside nbtree... Greetings, Andres Freund
On Mon, Apr 15, 2019 at 1:02 PM Andres Freund <andres@anarazel.de> wrote: > There's not much of an alternative currently. Indexes require tid > looking things, and as a consequence (and some other comparatively small > changes that'd be required) tableam does too. I'm trying to establish whether or not that's the only reason. It might be okay to use the same item pointer struct as the representation of a integer-like logical identifier. Even if it isn't, I'm still interested in just how logical the TIDs are, because it's an important part of the overall design. > That's one of the reasons why I've been trying to get you to get on > board with allowing different leaf-level "item pointer equivalents" > widths inside nbtree... Getting me to agree that that would be nice and getting me to do the work are two very different things. ;-) -- Peter Geoghegan
On Mon, Apr 15, 2019 at 11:57:49AM -0700, Ashwin Agrawal wrote: > On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > Maybe. I'm not going to pretend I fully understand the internals. Does > that mean the container contains ZSUncompressedBtreeItem as elements? Or > just the plain Datum values? > > First, your reading of code and all the comments/questions so far have > been highly encouraging. Thanks a lot for the same. ;-) > Container contains ZSUncompressedBtreeItem as elements. As for Item will > have to store meta-data like size, undo and such info. We don't wish to > restrict compressing only items from same insertion sessions only. Hence, > yes doens't just store Datum values. Wish to consider it more tuple level > operations and have meta-data for it and able to work with tuple level > granularity than block level. OK, thanks for the clarification, that somewhat explains my confusion. So if I understand it correctly, ZSCompressedBtreeItem is essentially a sequence of ZSUncompressedBtreeItem(s) stored one after another, along with some additional top-level metadata. > Definitely many more tricks can be and need to be applied to optimize > storage format, like for fixed width columns no need to store the size in > every item. Keep it simple is theme have been trying to maintain. > Compression ideally should compress duplicate data pretty easily and > efficiently as well, but we will try to optimize as much we can without > the same. I think there's plenty of room for improvement. The main problem I see is that it mixes different types of data, which is bad for compression and vectorized execution. I think we'll end up with a very different representation of the container, essentially decomposing the items into arrays of values of the same type - array of TIDs, array of undo pointers, buffer of serialized values, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
> Would like to know more specifics on this Peter. We may be having different context on hybrid row/column design.
I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?
TID is purely logical identifier. Hence, stated in initial email that for Zedstore TID, block number and offset split carries no meaning at all. It's purely 48 bit integer entity assigned to datum of first column during insertion, based on where in BTree it gets inserted. Rest of the column datums are inserted using this assigned TID value. Just due to rest to system restrictions discussed by Heikki and Andres on table am thread poses limitations of value it can carry currently otherwise from zedstore design perspective it just integer number.
> Yes, the plan to optimize out TID space per datum, either by prefix compression or delta compression or some other trick.
It would be easier to do this if you knew for sure that the TID
behaves almost the same as a bigserial column -- a purely logical
monotonically increasing identifier. That's why I'm interested in what
exactly you mean by TID, the stability of a TID value, etc. If a leaf
page almost always stores a range covering no more than few hundred
contiguous logical values, you can justify aggressively compressing
the representation in the B-Tree entries. Compression would still be
based on prefix compression, but the representation itself can be
specialized.
Yes, it's for sure logical increasing number. With only inserts the number is monotonically increasing. With deletes and updates, insert could use the previously free'd TID values. Since TID is logical datums can be easily moved around to split or merge pages as required.
On Mon, Apr 15, 2019 at 10:45:51PM -0700, Ashwin Agrawal wrote: >On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan <pg@bowt.ie> wrote: > >> On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io> >> wrote: >> > Would like to know more specifics on this Peter. We may be having >> different context on hybrid row/column design. >> >> I'm confused about how close your idea of a TID is to the traditional >> definition from heapam (and even zheap). If it's a purely logical >> identifier, then why would it have two components like a TID? Is that >> just a short-term convenience or something? >> > >TID is purely logical identifier. Hence, stated in initial email that for >Zedstore TID, block number and offset split carries no meaning at all. It's >purely 48 bit integer entity assigned to datum of first column during >insertion, based on where in BTree it gets inserted. Rest of the column >datums are inserted using this assigned TID value. Just due to rest to >system restrictions discussed by Heikki and Andres on table am thread poses >limitations of value it can carry currently otherwise from zedstore design >perspective it just integer number. > I'm not sure it's that clear cut, actually. Sure, it's not the usual (block,item) pair so it's not possible to jump to the exact location, so it's not the raw physical identifier as regular TID. But the data are organized in a btree, with the TID as a key, so it does actually provide some information about the location. I've asked about BRIN indexes elsewhere in this thread, which I think is related to this question, because that index type relies on TID providing sufficient information about location. And I think BRIN indexes are going to be rather important for colstores (and formats like ORC have something very similar built-in). But maybe all we'll have to do is define the ranges differently - instead of "number of pages" we may define them as "number of rows" and it might be working. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 16, 2019 at 9:15 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I'm not sure it's that clear cut, actually. Sure, it's not the usual
(block,item) pair so it's not possible to jump to the exact location, so
it's not the raw physical identifier as regular TID. But the data are
organized in a btree, with the TID as a key, so it does actually provide
some information about the location.
From representation perspective its logical identifier. But yes since
is used as used as key to layout datum's, there exists pretty good
correlation between TIDs and physical location. Can consider it as
clustered based on TID.
is used as used as key to layout datum's, there exists pretty good
correlation between TIDs and physical location. Can consider it as
clustered based on TID.
I've asked about BRIN indexes elsewhere in this thread, which I think is
related to this question, because that index type relies on TID providing
sufficient information about location. And I think BRIN indexes are going
to be rather important for colstores (and formats like ORC have something
very similar built-in).
But maybe all we'll have to do is define the ranges differently - instead
of "number of pages" we may define them as "number of rows" and it might
be working.
just a range of TIDs in zedstore, as pointed out above. When one converts a
zstid to an ItemPointer, can get the "block number" from the
ItemPointer, like from a normal heap TID. It doesn't mean the direct
physical location of the row in zedstore, but that's fine.
It might be sub-optimal in some cases. For example if one zedstore
page contains TIDs 1-1000, and another 1000-2000, and the entry in the
BRIN index covers TIDs 500-1500, have to access both zedstore
pages. Would be better if the cutoff points in the BRIN index would
match the physical pages of the zedstore. But it still works, and is
probably fine in practice.
Plan is to add integrated BRIN index in zedstore, means keep min-max
values for appropriate columns within page. This will not help to
eliminate the IO as external BRIN index does but helps to skip
uncompression and visibility checks etc... for blocks not matching the
conditions.
Just to showcase brin works for zedstore, played with hands-on example
mentioned in [1].
With btree index on zedstore
With btree index on zedstore
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4351.50..4351.51 rows=1 width=32) (actual time=1267.140..1267.140 rows=1 loops=1)
-> Index Scan using idx_ztemperature_log_log_timestamp on ztemperature_log (cost=0.56..4122.28 rows=91686 width=4) (actual time=0.117..1244.112 rows=86400 loops=1)
Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp without time zone))
Planning Time: 0.240 ms
Execution Time: 1269.016 ms
(5 rows)
With brin index on zedstore.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4351.50..4351.51 rows=1 width=32) (actual time=1267.140..1267.140 rows=1 loops=1)
-> Index Scan using idx_ztemperature_log_log_timestamp on ztemperature_log (cost=0.56..4122.28 rows=91686 width=4) (actual time=0.117..1244.112 rows=86400 loops=1)
Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp without time zone))
Planning Time: 0.240 ms
Execution Time: 1269.016 ms
(5 rows)
With brin index on zedstore.
Note: Bitmap index for zedstore currently scans all the columns.
Scanning only required columns for query is yet to be implemented.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=217538.85..217538.86 rows=1 width=32) (actual time=54.167..54.167 rows=1 loops=1)
-> Gather (cost=217538.63..217538.84 rows=2 width=32) (actual time=53.967..55.184 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=216538.63..216538.64 rows=1 width=32) (actual time=42.956..42.957 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on ztemperature_log (cost=59.19..216446.98 rows=36660 width=4) (actual time=3.571..35.904 rows=28800 loops=3)
Recheck Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp without time zone))
Rows Removed by Index Recheck: 3968
Heap Blocks: lossy=381
-> Bitmap Index Scan on idx_ztemperature_log_log_timestamp (cost=0.00..37.19 rows=98270 width=0) (actual time=1.201..1.201 rows=7680 loops=1)
Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp without time zone))
Planning Time: 0.240 ms
Execution Time: 55.341 ms
(13 rows)
schema_name | index_name | index_ratio | index_size | table_size
-------------+------------------------------------+-------------+------------+------------
public | idx_ztemperature_log_log_timestamp | 0 | 80 kB | 1235 MB
(1 row)
1] https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=217538.85..217538.86 rows=1 width=32) (actual time=54.167..54.167 rows=1 loops=1)
-> Gather (cost=217538.63..217538.84 rows=2 width=32) (actual time=53.967..55.184 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=216538.63..216538.64 rows=1 width=32) (actual time=42.956..42.957 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on ztemperature_log (cost=59.19..216446.98 rows=36660 width=4) (actual time=3.571..35.904 rows=28800 loops=3)
Recheck Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp without time zone))
Rows Removed by Index Recheck: 3968
Heap Blocks: lossy=381
-> Bitmap Index Scan on idx_ztemperature_log_log_timestamp (cost=0.00..37.19 rows=98270 width=0) (actual time=1.201..1.201 rows=7680 loops=1)
Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp without time zone))
Planning Time: 0.240 ms
Execution Time: 55.341 ms
(13 rows)
schema_name | index_name | index_ratio | index_size | table_size
-------------+------------------------------------+-------------+------------+------------
public | idx_ztemperature_log_log_timestamp | 0 | 80 kB | 1235 MB
(1 row)
1] https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them
On 15/04/2019 22:32, Peter Geoghegan wrote: > On Mon, Apr 15, 2019 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Perhaps, but we won't know if we don't try. I think we should try, >> and be willing to add hooks and flexibility to core as needed to make >> it possible. > > We could approach it without taking a firm position on inclusion in > core until the project begins to mature. I have little faith in our > ability to predict which approach will be the least painful at this > early stage. When we started hacking on this, we went in with the assumption that this would have to be in core, because WAL-logging, and also because a column-store will probably need some changes to the planner and executor to make it shine. And also because a lot of people would like to have a column store in PostgreSQL (although a "column store" could mean many different things with different tradeoffs). But if we just have all the necessary hooks in core, sure, this could be an extension, too. But as you said, we don't need to decide that yet. Let's wait and see, as this matures. - Heikki
We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.
Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).
Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.
- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.
- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4] on
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.
Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).
Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.
- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.
- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4] on
how modern filesystems manage space acted as a good source for ideas.
- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.
Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO
Open questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)
Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.
References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.
Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO
Open questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)
Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.
References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
Attachment
Hi Ashwin,
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
We were working on a similar columnar storage using pluggable APIs; one idea that we thought of was to modify the scan slot based on the targetlist to have only the relevant columns in the scan descriptor. This way the table AMs are passed a slot with only relevant columns in the descriptor. Today we do something similar to the result slot using ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So somewhere after creating the scan slot using ExecInitScanTupleSlot(), call a table am handler API to modify the scan tuple slot based on the targetlist, a probable name for the new table am handler would be: exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
We were working on a similar columnar storage using pluggable APIs; one idea that we thought of was to modify the scan slot based on the targetlist to have only the relevant columns in the scan descriptor. This way the table AMs are passed a slot with only relevant columns in the descriptor. Today we do something similar to the result slot using ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So somewhere after creating the scan slot using ExecInitScanTupleSlot(), call a table am handler API to modify the scan tuple slot based on the targetlist, a probable name for the new table am handler would be: exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
So this way the scan am handlers like getnextslot is passed a slot only having the relevant columns in the scan descriptor. One issue though is that the beginscan is not passed the slot, so if some memory allocation needs to be done based on the column list, it can't be done in beginscan. Let me know what you think.
regards,
Ajin Cherian
Fujitsu Australia
regards,
Ajin Cherian
Fujitsu Australia
On Thu, May 23, 2019 at 3:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.
Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).
Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.
- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.
- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4] onhow modern filesystems manage space acted as a good source for ideas.
- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.
Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO
Open questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)
Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.
References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
On Thu, May 23, 2019 at 7:30 PM Ajin Cherian <itsajin@gmail.com> wrote:
Hi Ashwin,
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
We were working on a similar columnar storage using pluggable APIs; one idea that we thought of was to modify the scan slot based on the targetlist to have only the relevant columns in the scan descriptor. This way the table AMs are passed a slot with only relevant columns in the descriptor. Today we do something similar to the result slot using ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So somewhere after creating the scan slot using ExecInitScanTupleSlot(), call a table am handler API to modify the scan tuple slot based on the targetlist, a probable name for the new table am handler would be: exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
Interesting.
Though this reads hacky and not clean approach to me. Reasons:
- The memory allocation and initialization for slot descriptor was
done in ExecInitScanTupleSlot(). exec_init_scan_slot_tl() would
redo lot of work. ExecInitScanTupleSlot() ideally just points to
tupleDesc from Relation object. But for exec_init_scan_slot_tl()
will free the existing tupleDesc and reallocate fresh. Plus, can't
point to Relation tuple desc but essentially need to craft one out.
- As discussed in thread [1], several places want to use different
slots for the same scan, so that means will have to modify the
descriptor every time on such occasions even if it remains the same
throughout the scan. Some extra code can be added to keep around old
tupledescriptor and then reuse for next slot, but that seems again
added code complexity.
- AM needs to know the attnum in terms of relation's attribute number
to scan. How would tupledesc convey that? Like TupleDescData's attrs
currently carries info for attnum at attrs[attnum - 1]. If TupleDesc
needs to convey random attributes to scan, seems this relationship
has to be broken. attrs[offset] will provide info for some attribute
in relation, means offset != (attrs->attnum + 1). Which I am not
sure how many places in code rely on that logic to get information.
- The tupledesc provides lot of information not just attribute numbers
to scan. Like it provides information in TupleConstr about default
value for column. If AM layer has to modify existing slot's
tupledesc, it would have to copy over such information as well. This
information today is fetched using attnum as offset value in
constr->missing array. If this information will be retained how will
the constr array constructed? Will the array contain only values for
columns to scan or will contain constr array as is from Relation's
tuple descriptor as it does today. Seems will be overhead to
construct the constr array fresh and if not constructing fresh seems
will have mismatch between natt and array elements.
Seems with the proposed exec_init_scan_slot_tl() API, will have to
call it after beginscan and before calling getnextslot, to provide
column projection list to AM. Special dedicated API we have for
Zedstore to pass down column projection list, needs same calling
convention which is the reason I don't like it and trying to find
alternative. But at least the api we added for Zedstore seems much
simple, generic and flexible, in comparison, as lets AM decide what it
wishes to do with it. AM can fiddle with slot's TupleDescriptor if
wishes or can handle the column projection some other way.
Though this reads hacky and not clean approach to me. Reasons:
- The memory allocation and initialization for slot descriptor was
done in ExecInitScanTupleSlot(). exec_init_scan_slot_tl() would
redo lot of work. ExecInitScanTupleSlot() ideally just points to
tupleDesc from Relation object. But for exec_init_scan_slot_tl()
will free the existing tupleDesc and reallocate fresh. Plus, can't
point to Relation tuple desc but essentially need to craft one out.
- As discussed in thread [1], several places want to use different
slots for the same scan, so that means will have to modify the
descriptor every time on such occasions even if it remains the same
throughout the scan. Some extra code can be added to keep around old
tupledescriptor and then reuse for next slot, but that seems again
added code complexity.
- AM needs to know the attnum in terms of relation's attribute number
to scan. How would tupledesc convey that? Like TupleDescData's attrs
currently carries info for attnum at attrs[attnum - 1]. If TupleDesc
needs to convey random attributes to scan, seems this relationship
has to be broken. attrs[offset] will provide info for some attribute
in relation, means offset != (attrs->attnum + 1). Which I am not
sure how many places in code rely on that logic to get information.
- The tupledesc provides lot of information not just attribute numbers
to scan. Like it provides information in TupleConstr about default
value for column. If AM layer has to modify existing slot's
tupledesc, it would have to copy over such information as well. This
information today is fetched using attnum as offset value in
constr->missing array. If this information will be retained how will
the constr array constructed? Will the array contain only values for
columns to scan or will contain constr array as is from Relation's
tuple descriptor as it does today. Seems will be overhead to
construct the constr array fresh and if not constructing fresh seems
will have mismatch between natt and array elements.
Seems with the proposed exec_init_scan_slot_tl() API, will have to
call it after beginscan and before calling getnextslot, to provide
column projection list to AM. Special dedicated API we have for
Zedstore to pass down column projection list, needs same calling
convention which is the reason I don't like it and trying to find
alternative. But at least the api we added for Zedstore seems much
simple, generic and flexible, in comparison, as lets AM decide what it
wishes to do with it. AM can fiddle with slot's TupleDescriptor if
wishes or can handle the column projection some other way.
So this way the scan am handlers like getnextslot is passed a slot only having the relevant columns in the scan descriptor. One issue though is that the beginscan is not passed the slot, so if some memory allocation needs to be done based on the column list, it can't be done in beginscan. Let me know what you think.
Yes, ideally would like to see if possible having this information
available on beginscan. But if can't be then seems fine to delay such
allocations on first calls to getnextslot and friends, that's how we
do today for Zedstore.
available on beginscan. But if can't be then seems fine to delay such
allocations on first calls to getnextslot and friends, that's how we
do today for Zedstore.
On 23/05/19 12:07 PM, Ashwin Agrawal wrote: > > We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to > share the recent additions and modifications. Attaching a patch > with the latest code. Link to github branch [1] to follow > along. The approach we have been leaning towards is to build required > functionality, get passing the test and then continue to iterate to > optimize the same. It's still work-in-progress. > > Sharing the details now, as have reached our next milestone for > Zedstore. All table AM API's are implemented for Zedstore (except > compute_xid_horizon_for_tuples, seems need test for it first). > > Current State: > > - A new type of item added to Zedstore "Array item", to boost > compression and performance. Based on Konstantin's performance > experiments [2] and inputs from Tomas Vodra [3], this is > added. Array item holds multiple datums, with consecutive TIDs and > the same visibility information. An array item saves space compared > to multiple single items, by leaving out repetitive UNDO and TID > fields. An array item cannot mix NULLs and non-NULLs. So, those > experiments should result in improved performance now. Inserting > data via COPY creates array items currently. Code for insert has not > been modified from last time. Making singleton inserts or insert > into select, performant is still on the todo list. > > - Now we have a separate and dedicated meta-column btree alongside > rest of the data column btrees. This special or first btree for > meta-column is used to assign TIDs for tuples, track the UNDO > location which provides visibility information. Also, this special > btree, which always exists, helps to support zero-column tables > (which can be a result of ADD COLUMN DROP COLUMN actions as > well). Plus, having meta-data stored separately from data, helps to > get better compression ratios. And also helps to further simplify > the overall design/implementation as for deletes just need to edit > the meta-column and avoid touching the actual data btrees. Index > scans can just perform visibility checks based on this meta-column > and fetch required datums only for visible tuples. For tuple locks > also just need to access this meta-column only. Previously, every > column btree used to carry the same undo pointer. Thus visibility > check could be potentially performed, with the past layout, using > any column. But considering overall simplification new layout > provides it's fine to give up on that aspect. Having dedicated > meta-column highly simplified handling for add columns with default > and null values, as this column deterministically provides all the > TIDs present in the table, which can't be said for any other data > columns due to default or null values during add column. > > - Free Page Map implemented. The Free Page Map keeps track of unused > pages in the relation. The FPM is also a b-tree, indexed by physical > block number. To be more compact, it stores "extents", i.e. block > ranges, rather than just blocks, when possible. An interesting paper > [4] on > how modern filesystems manage space acted as a good source for ideas. > > - Tuple locks implemented > > - Serializable isolation handled > > - With "default_table_access_method=zedstore" > - 31 out of 194 failing regress tests > - 10 out of 86 failing isolation tests > Many of the current failing tests are due to plan differences, like > Index scans selected for zedstore over IndexOnly scans, as zedstore > doesn't yet have visibility map. I am yet to give a thought on > index-only scans. Or plan diffs due to table size differences between > heap and zedstore. > > Next few milestones we wish to hit for Zedstore: > - Make check regress green > - Make check isolation green > - Zedstore crash safe (means also replication safe). Implement WAL > logs > - Performance profiling and optimizations for Insert, Selects, Index > Scans, etc... > - Once UNDO framework lands in Upstream, Zedstore leverages it instead > of its own version of UNDO > > Open questions / discussion items: > > - how best to get "column projection list" from planner? (currently, > we walk plan and find the columns required for the query in > the executor, refer GetNeededColumnsForNode()) > > - how to pass the "column projection list" to table AM? (as stated in > initial email, currently we have modified table am API to pass the > projection to AM) > > - TID treated as (block, offset) in current indexing code > > - Physical tlist optimization? (currently, we disabled it for > zedstore) > > Team: > Melanie joined Heikki and me to write code for zedstore. Majority of > the code continues to be contributed by Heikki. We are continuing to > have fun building column store implementation and iterate > aggressively. > > References: > 1] https://github.com/greenplum-db/postgres/tree/zedstore > 2] > https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru > 3] > https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development > 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf > FWIW - building this against latest 12 beta1: Loading and examining the standard pgbench schema (with the old names, sorry) in v10 (standard heap_ and v12 (zedstore) v10: bench=# \i load.sql COPY 100 Time: 16.335 ms COPY 1000 Time: 16.748 ms COPY 10000000 Time: 50276.230 ms (00:50.276) bench=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+----------+-------+----------+------------+------------- public | accounts | table | postgres | 1281 MB | public | branches | table | postgres | 8192 bytes | public | history | table | postgres | 0 bytes | public | tellers | table | postgres | 72 kB | v12+zedstore: bench=# \i load.sql COPY 100 Time: 0.656 ms COPY 1000 Time: 3.573 ms COPY 10000000 Time: 26244.832 ms (00:26.245) bench=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+----------+-------+----------+---------+------------- public | accounts | table | postgres | 264 MB | public | branches | table | postgres | 56 kB | public | history | table | postgres | 0 bytes | public | tellers | table | postgres | 64 kB | So a good improvement in load times and on disk footprint! Also note that I did not build with lz4 so looks like you guys have fixed the quirks with compression making things bigger. regards Mark
it's really cool and very good progress, I'm interesting if SIDM/JIT will be supported best wishes TY On 2019/5/23 08:07, Ashwin Agrawal wrote: > We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to > share the recent additions and modifications. Attaching a patch > with the latest code. Link to github branch [1] to follow > along. The approach we have been leaning towards is to build required > functionality, get passing the test and then continue to iterate to > optimize the same. It's still work-in-progress. > > Sharing the details now, as have reached our next milestone for > Zedstore. All table AM API's are implemented for Zedstore (except > compute_xid_horizon_for_tuples, seems need test for it first). > > Current State: > > - A new type of item added to Zedstore "Array item", to boost > compression and performance. Based on Konstantin's performance > experiments [2] and inputs from Tomas Vodra [3], this is > added. Array item holds multiple datums, with consecutive TIDs and > the same visibility information. An array item saves space compared > to multiple single items, by leaving out repetitive UNDO and TID > fields. An array item cannot mix NULLs and non-NULLs. So, those > experiments should result in improved performance now. Inserting > data via COPY creates array items currently. Code for insert has not > been modified from last time. Making singleton inserts or insert > into select, performant is still on the todo list. > > - Now we have a separate and dedicated meta-column btree alongside > rest of the data column btrees. This special or first btree for > meta-column is used to assign TIDs for tuples, track the UNDO > location which provides visibility information. Also, this special > btree, which always exists, helps to support zero-column tables > (which can be a result of ADD COLUMN DROP COLUMN actions as > well). Plus, having meta-data stored separately from data, helps to > get better compression ratios. And also helps to further simplify > the overall design/implementation as for deletes just need to edit > the meta-column and avoid touching the actual data btrees. Index > scans can just perform visibility checks based on this meta-column > and fetch required datums only for visible tuples. For tuple locks > also just need to access this meta-column only. Previously, every > column btree used to carry the same undo pointer. Thus visibility > check could be potentially performed, with the past layout, using > any column. But considering overall simplification new layout > provides it's fine to give up on that aspect. Having dedicated > meta-column highly simplified handling for add columns with default > and null values, as this column deterministically provides all the > TIDs present in the table, which can't be said for any other data > columns due to default or null values during add column. > > - Free Page Map implemented. The Free Page Map keeps track of unused > pages in the relation. The FPM is also a b-tree, indexed by physical > block number. To be more compact, it stores "extents", i.e. block > ranges, rather than just blocks, when possible. An interesting paper [4] > on > how modern filesystems manage space acted as a good source for ideas. > > - Tuple locks implemented > > - Serializable isolation handled > > - With "default_table_access_method=zedstore" > - 31 out of 194 failing regress tests > - 10 out of 86 failing isolation tests > Many of the current failing tests are due to plan differences, like > Index scans selected for zedstore over IndexOnly scans, as zedstore > doesn't yet have visibility map. I am yet to give a thought on > index-only scans. Or plan diffs due to table size differences between > heap and zedstore. > > Next few milestones we wish to hit for Zedstore: > - Make check regress green > - Make check isolation green > - Zedstore crash safe (means also replication safe). Implement WAL > logs > - Performance profiling and optimizations for Insert, Selects, Index > Scans, etc... > - Once UNDO framework lands in Upstream, Zedstore leverages it instead > of its own version of UNDO > > Open questions / discussion items: > > - how best to get "column projection list" from planner? (currently, > we walk plan and find the columns required for the query in > the executor, refer GetNeededColumnsForNode()) > > - how to pass the "column projection list" to table AM? (as stated in > initial email, currently we have modified table am API to pass the > projection to AM) > > - TID treated as (block, offset) in current indexing code > > - Physical tlist optimization? (currently, we disabled it for > zedstore) > > Team: > Melanie joined Heikki and me to write code for zedstore. Majority of > the code continues to be contributed by Heikki. We are continuing to > have fun building column store implementation and iterate > aggressively. > > References: > 1] https://github.com/greenplum-db/postgres/tree/zedstore > 2] > https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru > 3] > https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development > 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf >
From: Ashwin Agrawal [mailto:aagrawal@pivotal.io] > The objective is to gather feedback on design and approach to the same. > The implementation has core basic pieces working but not close to complete. Thank you for proposing a very interesting topic. Are you thinking of including this in PostgreSQL 13 if possible? > * All Indexes supported ... > work. Btree indexes can be created. Btree and bitmap index scans work. Does Zedstore allow to create indexes of existing types on the table (btree, GIN, BRIN, etc.) and perform index scans (pointquery, range query, etc.)? > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. ... > This way of laying out the data also easily allows for hybrid row-column > store, where some columns are stored together, and others have a dedicated > B-tree. Need to have user facing syntax to allow specifying how to group > the columns. ... > Zedstore Table can be > created using command: > > CREATE TABLE <name> (column listing) USING zedstore; Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table can be accessed simultaneously for both OLTP andanalytics with the minimal performance impact on OLTP? (I got that impression from the word "hybrid".) If yes, is the assumption that only a limited number of columns are to be stored in columnar format (for efficient scanning),and many other columns are to be stored in row format for efficient tuple access? Are those row-formatted columns stored in the same file as the column-formatted columns, or in a separate file? Regarding the column grouping, can I imagine HBase and Cassandra? How could the current CREATE TABLE syntax support column grouping? (I guess CREATE TABLE needs a syntax for columnar store,and Zedstore need to be incorporated in core, not as an extension...) > A column store uses the same structure but we have *multiple* B-trees, one > for each column, all indexed by TID. The B-trees for all columns are stored > in the same physical file. Did you think that it's not a good idea to have a different file for each group of columns? Is that because we can't expectphysical adjacency of data blocks on disk even if we separate a column in a separate file? I thought a separate file for each group of columns would be easier and less error-prone to implement and debug. Addingand dropping the column group would also be very easy and fast. Regards Takayuki Tsunakawa
On Sun, Jun 30, 2019 at 7:59 PM Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
From: Ashwin Agrawal [mailto:aagrawal@pivotal.io]
> The objective is to gather feedback on design and approach to the same.
> The implementation has core basic pieces working but not close to complete.
Thank you for proposing a very interesting topic. Are you thinking of including this in PostgreSQL 13 if possible?
> * All Indexes supported
...
> work. Btree indexes can be created. Btree and bitmap index scans work.
Does Zedstore allow to create indexes of existing types on the table (btree, GIN, BRIN, etc.) and perform index scans (point query, range query, etc.)?
Yes, all indexes types work for zedstore and allow point or range queries.
> * Hybrid row-column store, where some columns are stored together, and
> others separately. Provide flexibility of granularity on how to
> divide the columns. Columns accessed together can be stored
> together.
...
> This way of laying out the data also easily allows for hybrid row-column
> store, where some columns are stored together, and others have a dedicated
> B-tree. Need to have user facing syntax to allow specifying how to group
> the columns.
...
> Zedstore Table can be
> created using command:
>
> CREATE TABLE <name> (column listing) USING zedstore;
Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table can be accessed simultaneously for both OLTP and analytics with the minimal performance impact on OLTP? (I got that impression from the word "hybrid".)
Well "hybrid" is more to convey compressed row and column store can be supported with same design. It really wasn't referring to HTAP. In general the goal we are moving towards is column store to be extremely efficient at analytics but still should be able to support all the OLTP operations (with minimal performance or storage size impact) Like when making trade-offs between different design choices and if both can't be meet, preference if towards analytics.
If yes, is the assumption that only a limited number of columns are to be stored in columnar format (for efficient scanning), and many other columns are to be stored in row format for efficient tuple access?
Yes, like if its known that certain columns are always accessed together better to store them together and avoid the tuple formation cost. Though its still to be seen if compression plays role and storing each individual column and compressing can still be winner compared to compressing different columns as blob. Like saving on IO cost offsets out the tuple formation cost or not.
Are those row-formatted columns stored in the same file as the column-formatted columns, or in a separate file?
Currently, we are focused to just get pure column store working and hence not coded anything for hybrid layout yet. But at least right now the thought is would be in same file.
Regarding the column grouping, can I imagine HBase and Cassandra?
How could the current CREATE TABLE syntax support column grouping? (I guess CREATE TABLE needs a syntax for columnar store, and Zedstore need to be incorporated in core, not as an extension...)
When column grouping comes up yes will need to modify CREATE TABLE syntax, we are still to reach that point in development.
> A column store uses the same structure but we have *multiple* B-trees, one
> for each column, all indexed by TID. The B-trees for all columns are stored
> in the same physical file.
Did you think that it's not a good idea to have a different file for each group of columns? Is that because we can't expect physical adjacency of data blocks on disk even if we separate a column in a separate file?
I thought a separate file for each group of columns would be easier and less error-prone to implement and debug. Adding and dropping the column group would also be very easy and fast.
Currently, each group is a single column (till we don't have column families) and having file for each column definitely seems not good idea. As it just explodes the number of files. Separate file may have its advantage from pre-fetching point of view but yes can't expect physical adjacency of data blocks plus access pattern will anyways involve reading multiple files (if each column stored in separate file).
I doubt storing each group makes it any easier to implement or debug, I feel its actually reverse. Storing everything in single file but separate blocks, keep the logic contained inside AM layer. And don't have to write special code for example for drop table to delete files for all the groups and all, or while moving table to different tablespace and all such complication.
Adding and dropping column group, irrespective can be made easy and fast with blocks for that group, added or marked for reuse within same file.
Thank you for the questions.
On Thu, May 30, 2019 at 8:07 AM DEV_OPS <devops@ww-it.cn> wrote:
it's really cool and very good progress,
I'm interesting if SIDM/JIT will be supported
That's something outside of Zedstore work directly at least now. The intent is to work with current executor code or enhance it only wherever needed. If current executor code supports something that would work for Zedstore. But any other enhancements to executor will be separate undertaking.
Hi Ashwin, I tried playing around with the zedstore code a bit today and there are couple questions that came into my mind. 1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore table set the VM bits associated with it. 2) Is there a chance that IndexOnlyScan would ever be required for zedstore tables considering the design approach taken for it? Further, I tried creating a zedstore table with btree index on one of it's column and loaded around 50 lacs record into the table. When the indexed column was scanned (with enable_seqscan flag set to off), it went for IndexOnlyScan and that took around 15-20 times more than it would take for IndexOnly Scan on heap table just because IndexOnlyScan in zedstore always goes to heap as the visibility check fails. However, the seqscan on zedstore table is quite faster than seqscan on heap table because the time taken for I/O is quite less in case for zedstore. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Tue, Jul 2, 2019 at 12:45 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > On Thu, May 30, 2019 at 8:07 AM DEV_OPS <devops@ww-it.cn> wrote: >> >> >> it's really cool and very good progress, >> >> I'm interesting if SIDM/JIT will be supported > > > That's something outside of Zedstore work directly at least now. The intent is to work with current executor code or enhanceit only wherever needed. If current executor code supports something that would work for Zedstore. But any other enhancementsto executor will be separate undertaking.
On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi Ashwin,
I tried playing around with the zedstore code a bit today and there
are couple questions that came into my mind.
Great! Thank You.
1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore
table set the VM bits associated with it.
Zedstore tables can be vacuumed. On vacuum, minimal work is performed
though compared to heap. Full table is not scanned. Only UNDO log is
truncated/discarded based on RecentGlobalXmin. Plus, only TidTree or
Meta column is scanned to find dead tuples and index entries cleaned
for them, based on the same.
Currently, for zedstore we have not used the VM at all. So, it doesn't
touch the same during any operation.
though compared to heap. Full table is not scanned. Only UNDO log is
truncated/discarded based on RecentGlobalXmin. Plus, only TidTree or
Meta column is scanned to find dead tuples and index entries cleaned
for them, based on the same.
Currently, for zedstore we have not used the VM at all. So, it doesn't
touch the same during any operation.
2) Is there a chance that IndexOnlyScan would ever be required for
zedstore tables considering the design approach taken for it?
We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.
Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.
However, the seqscan on zedstore table is quite faster than seqscan on
heap table because the time taken for I/O is quite less in case for
zedstore.
it. Given no VM exists, IndexOnlyScans currently for zedstore behave
more or less like IndexScans. Planner picks IndexOnlyScans for
zedstore, mostly due to off values for reltuples, relpages, and
relallvisible.
We have been focused on implementing and optimizing the AM pieces. So,
not much work has been done for planner estimates and tunning yet. The
first step for the same to get the needed columns in the planner
instead of the executor in [1] is proposed. Once, that bakes will use
the same to perform more planner estimates and all. Also, analyze
needs work to properly reflect reltuples and relpages to influence the
planner correctly.
On 14/08/2019 20:32, Ashwin Agrawal wrote: > On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma wrote: >> 2) Is there a chance that IndexOnlyScan would ever be required for >> zedstore tables considering the design approach taken for it? > > We have not given much thought to IndexOnlyScans so far. But I think > IndexOnlyScan definitely would be beneficial for zedstore as > well. Even for normal index scans as well, fetching as many columns > possible from Index itself and only getting rest of required columns > from the table would be good for zedstore. It would help to further > cut down IO. Ideally, for visibility checking only TidTree needs to be > scanned and visibility checked with the same, so the cost of checking > is much lower compared to heap (if VM can't be consulted) but still is > a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility > checks are pretty cheap. Still given all that, having VM type thing to > optimize the same further would help. Hmm, yeah. An index-only scan on a zedstore table could perform the "VM checks" by checking the TID tree in the zedstore. It's not as compact as the 2 bits per TID in the heapam's visibility map, but it's pretty good. >> Further, I tried creating a zedstore table with btree index on one of >> it's column and loaded around 50 lacs record into the table. When the >> indexed column was scanned (with enable_seqscan flag set to off), it >> went for IndexOnlyScan and that took around 15-20 times more than it >> would take for IndexOnly Scan on heap table just because IndexOnlyScan >> in zedstore always goes to heap as the visibility check fails. Currently, an index-only scan on zedstore should be pretty much the same speed as a regular index scan. All the visibility checks will fail, and you end up fetching every row from the table, just like a regular index scan. So I think what you're seeing is that the index fetches on a zedstore table is much slower than on heap. Ideally, on a column store the index fetches would only fetch the needed columns, but I don't think that's been implemented yet, so all the columns are fetched. That can make a big difference, if you have a wide table with lots of columns, but only actually need a few of them. Was your test case something like that? We haven't spent much effort on optimizing index fetches yet, so I hope there's many other little tweaks there as well, that we can do to make it faster. - Heikki
We've continued hacking on Zedstore, here's a new patch version against current PostgreSQL master (commit f1bf619acdf). If you want to follow the development in real-time, we're working on this branch: https://github.com/greenplum-db/postgres/tree/zedstore If you want to do performance testing with this, make sure you configure with the --with-lz4 option. Otherwise, you'll get pglz compression, which is *much* slower. Major TODOs: * Make it crash-safe, by WAL-logging. * Teach the planner and executor to pass down the list of columns needed. Currently, many plans will unnecessarily fetch columns that are not needed. * Make visibility checks against the TID tree in index-only scans. * zedstore-toast pages are currently leaked, so you'll get a lot of bloat if you delete/update rows with large datums * Use the UNDO framework that's been discussed on another thread. There's UNDO-logging built into zedstore at the moment, but it's not very optimized. * Improve free space management. Pages that become empty are currently recycled, but space on pages that are not completely empty is not not reused, and half-empty pages are not merged. * Implement TID recycling. Currently, TIDs are allocated in increasing order, and after all 2^48 TIDs have been used, even if the rows have been deleted since, no more ruples can be inserted. - Heikki
Attachment
On Thu, Aug 15, 2019 at 01:05:49PM +0300, Heikki Linnakangas wrote: > We've continued hacking on Zedstore, here's a new patch version against > current PostgreSQL master (commit f1bf619acdf). If you want to follow the > development in real-time, we're working on this branch: > https://github.com/greenplum-db/postgres/tree/zedstore Thanks for persuing this. It's an exciting development and I started looking at how we'd put it to use. I imagine we'd use it in favour of ZFS tablespaces, which I hope to retire. I've just done very brief experiment so far. Some thoughts: . I was missing a way to check for compression ratio; it looks like zedstore with lz4 gets ~4.6x for our largest customer's largest table. zfs using compress=gzip-1 gives 6x compression across all their partitioned tables, and I'm surprised it beats zedstore . . What do you think about pg_restore --no-tableam; similar to --no-tablespaces, it would allow restoring a table to a different AM: PGOPTIONS='-c default_table_access_method=zedstore' pg_restore --no-tableam ./pg_dump.dat -d postgres Otherwise, the dump says "SET default_table_access_method=heap", which overrides any value from PGOPTIONS and precludes restoring to new AM. . It occured to me that indices won't be compressed. That's no fault of zedstore, but it may mean that some sites would need to retain their ZFS tablespace, and suggests the possibility of an separate, future project (I wonder if there's some way a new meta-AM could "enable" compression of other index AMs, to avoid the need to implement zbtree, zhash, zgin, ...). . it'd be nice if there was an ALTER TABLE SET ACCESS METHOD, to allow migrating data. Otherwise I think the alternative is: begin; lock t; CREATE TABLE new_t LIKE (t INCLUDING ALL) USING (zedstore); INSERT INTO new_t SELECT * FROM t; for index; do CREATE INDEX...; done DROP t; RENAME new_t (and all its indices). attach/inherit, etc. commit; . Speaking of which, I think LIKE needs a new option for ACCESS METHOD, which is otherwise lost. Cheers, Justin
On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
. I was missing a way to check for compression ratio;
Here are the ways to check compression ratio for zedstore:
Table level:
select sum(uncompressedsz::numeric) / sum(totalsz) as compratio from
pg_zs_btree_pages(<tablename>);
Per column level:
select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
compratio from pg_zs_btree_pages(<tablename>) group by attno order by attno;
Table level:
select sum(uncompressedsz::numeric) / sum(totalsz) as compratio from
pg_zs_btree_pages(<tablename>);
Per column level:
select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
compratio from pg_zs_btree_pages(<tablename>) group by attno order by attno;
it looks like zedstore
with lz4 gets ~4.6x for our largest customer's largest table. zfs using
compress=gzip-1 gives 6x compression across all their partitioned tables,
and I'm surprised it beats zedstore .
of the table? Did you perform 'INSERT INTO ... SELECT' or COPY?
Currently COPY give better compression ratios than single INSERT
because it generates less pages for meta data. Using the above per column
level compression ratio will provide which columns have lower
compression ratio.
We plan to add other compression algorithms like RLE and delta
encoding which should give better compression ratios for column store
along with LZ4.
On Mon, Aug 19, 2019 at 04:15:30PM -0700, Alexandra Wang wrote: > On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > . I was missing a way to check for compression ratio; > > Here are the ways to check compression ratio for zedstore: > > Table level: > SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages(<tablename>); postgres=# SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages('child.cdrs_huawei_pgwrecord_2019_07_01'); compratio | 4.2730304163521529 For a fair test, I created a separate ZFS tablspace for storing just a copy of that table. ts=# CREATE TABLE test TABLESPACE testcomp AS SELECT * FROM child.cdrs_huawei_pgwrecord_2019_07_01; SELECT 39933381 Time: 882417.775 ms (14:42.418) zfs/testJTP20190819 compressratio 6.01x - zfs/testJTP20190819 compression gzip-1 inherited from zfs > Per column level: > select attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) as compratio from pg_zs_btree_pages(<tablename>) groupby attno order by attno; Order by 3; I see we have SOME highly compressed columns. It's still surprising to me that's as low as it is, given their content: phone numbers and IPv4 addresses in text form, using characters limited to [[:digit:].] (I realize we can probably save space using inet type.) 0 | 4743 | 1.00000000000000000000 32 | 21912 | 1.05953637381493823513 80 | 36441 | 1.2416446300175039 4 | 45059 | 1.3184106811322728 83 | 45059 | 1.3184106811322728 52 | 39208 | 1.3900788061770992 ... 74 | 3464 | 10.8258665101057364 17 | 3535 | 10.8776086243096534 3 | 7092 | 11.0388009154683678 11 | 3518 | 11.4396055611832109 65 | 3333 | 14.6594723104237634 35 | 14077 | 15.1642131499381887 ... 43 | 1601 | 21.4200106784573211 79 | 1599 | 21.4487670806076829 89 | 1934 | 23.6292134031933401 33 | 1934 | 23.6292134031933401 It seems clear the columns with high n_distinct have low compress ratio, and columns with high compress ratio are those with n_distinct=1... CREATE TEMP TABLE zs AS SELECT zs.*, n_distinct, avg_width, a.attname FROM (SELECT 'child.cdrs_huawei_pgwrecord_2019_07_01'::regclasst)t , LATERAL (SELECT attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz)AS compratio FROM pg_zs_btree_pages(t) GROUP BY attno)zs , pg_attribute a, pg_classc, pg_stats s WHERE a.attrelid=t AND a.attnum=zs.attno AND c.oid=a.attrelid AND c.relname=s.tablename AND s.attname=a.attname; n_distinct | compratio ------------+------------------------ 217141 | 1.2416446300175039 154829 | 1.5306062496764190 144486 | 1.3900788061770992 128334 | 1.5395022739568842 121324 | 1.4005533187886683 86341 | 1.6262709389296389 84073 | 4.4379336418590519 65413 | 5.1890181028038757 63703 | 5.5029855093836425 63637 | 5.3648468796642262 46450 | 1.3184106811322728 46450 | 1.3184106811322728 43029 | 1.8003513772661308 39363 | 1.5845730687475706 36720 | 1.4751147557399539 36445 | 1.8403087513759131 36445 | 1.5453935268318613 11455 | 1.05953637381493823513 2862 | 9.8649823666870671 2625 | 2.3573614181847621 1376 | 1.7895024285340428 1335 | 2.2812551964262787 807 | 7.1192324141359373 610 | 7.9373623460089360 16 | 11.4396055611832109 10 | 5.5429763442365557 7 | 5.0440578041440675 7 | 5.2000132813261135 4 | 6.9741514753325536 4 | 4.2872818036896340 3 | 1.9080838412634827 3 | 2.9915954457453485 3 | 2.3056387009407882 2 | 10.8776086243096534 2 | 5.5950929307378287 2 | 18.5796576388128741 2 | 10.8258665101057364 2 | 9.1112820658021406 2 | 3.4986057630739795 2 | 4.6250999234025238 2 | 11.0388009154683678 1 | 15.1642131499381887 1 | 2.8855860118178798 1 | 23.6292134031933401 1 | 21.4200106784573211 [...] > > it looks like zedstore > > with lz4 gets ~4.6x for our largest customer's largest table. zfs using > > compress=gzip-1 gives 6x compression across all their partitioned > > tables, > > and I'm surprised it beats zedstore . > > > > What kind of tables did you use? Is it possible to give us the schema > of the table? Did you perform 'INSERT INTO ... SELECT' or COPY? I did this: |time ~/src/postgresql.bin/bin/pg_restore /srv/cdrperfbackup/ts/final/child.cdrs_huawei_pgwrecord_2019_07_01 -f- |PGOPTIONS='-cdefault_table_access_method=zedstore'psql --port 5678 postgres --host /tmp ... COPY 39933381 ... real 100m25.764s child | cdrs_huawei_pgwrecord_2019_07_01 | table | pryzbyj | permanent | 8277 MB | postgres=# SELECT array_to_string(array_agg(format_type(atttypid, atttypmod) ||CASE WHEN attnotnull THEN ' not null' ELSE'' END ORDER BY attnum),',') FROM pg_attribute WHERE attrelid='child.cdrs_huawei_pgwrecord_2019_07_01'::regclass ANDattnum>0; array_to_string | text not null,text,text not null,text not null,text not null,text,text,text,boolean,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,timestamp withouttime zone not null,bigint not null,text not null,text,text,text,text,text,text,text,text,text,text not null,text,boolean,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,timestamp withtime zone,timestamp with time zone,text,text,boolean,text,text,boolean,boolean,text not null,text not null
On 20/08/2019 05:04, Justin Pryzby wrote: >>> it looks like zedstore >>> with lz4 gets ~4.6x for our largest customer's largest table. zfs using >>> compress=gzip-1 gives 6x compression across all their partitioned >>> tables, >>> and I'm surprised it beats zedstore . I did a quick test, with 10 million random IP addresses, in text format. I loaded it into a zedstore table ("create table ips (ip text) using zedstore"), and poked around a little bit to see how the space is used. postgres=# select lokey, nitems, ncompressed, totalsz, uncompressedsz, freespace from pg_zs_btree_pages('ips') where attno=1 and level=0 limit 10; lokey | nitems | ncompressed | totalsz | uncompressedsz | freespace -------+--------+-------------+---------+----------------+----------- 1 | 4 | 4 | 6785 | 7885 | 1320 537 | 5 | 5 | 7608 | 8818 | 492 1136 | 4 | 4 | 6762 | 7888 | 1344 1673 | 5 | 5 | 7548 | 8776 | 540 2269 | 4 | 4 | 6841 | 7895 | 1256 2807 | 5 | 5 | 7555 | 8784 | 540 3405 | 5 | 5 | 7567 | 8772 | 524 4001 | 4 | 4 | 6791 | 7899 | 1320 4538 | 5 | 5 | 7596 | 8776 | 500 5136 | 4 | 4 | 6750 | 7875 | 1360 (10 rows) There's on average about 10% of free space on the pages. We're losing quite a bit to to ZFS compression right there. I'm sure there's some free space on the heap pages as well, but ZFS compression will squeeze it out. The compression ratio is indeed not very good. I think one reason is that zedstore does LZ4 in relatively small chunks, while ZFS surely compresses large blocks in one go. Looking at the above, there is on average 125 datums packed into each "item" (avg(hikey-lokey) / nitems). I did a quick test with the "lz4" command-line utility, compressing flat files containing random IP addresses. $ lz4 /tmp/125-ips.txt Compressed filename will be : /tmp/125-ips.txt.lz4 Compressed 1808 bytes into 1519 bytes ==> 84.02% $ lz4 /tmp/550-ips.txt Compressed filename will be : /tmp/550-ips.txt.lz4 Compressed 7863 bytes into 6020 bytes ==> 76.56% $ lz4 /tmp/750-ips.txt Compressed filename will be : /tmp/750-ips.txt.lz4 Compressed 10646 bytes into 8035 bytes ==> 75.47% The first case is roughly what we do in zedstore currently: we compress about 125 datums as one chunk. The second case is roughty what we would get, if we collected on 8k worth of datums and compressed them all as one chunk. And the third case simulates the case we would allow the input to be larger than 8k, so that the compressed chunk just fits on an 8k page. Not too much difference between the second and third case, but its pretty clear that we're being hurt by splitting the input into such small chunks. The downside of using a larger compression chunk size is that random access becomes more expensive. Need to give the on-disk format some more thought. Although I actually don't feel too bad about the current compression ratio, perfect can be the enemy of good. - Heikki
Thanks Ashwin and Heikki for your responses. I've one more query here, If BTree index is created on a zedstore table, the t_tid field of Index tuple contains the physical tid that is not actually pointing to the data block instead it contains something from which the logical tid can be derived. So, when IndexScan is performed on a zedstore table, it fetches the physical tid from the index page and derives the logical tid out of it and then retrieves the data corresponding to this logical tid from the zedstore table. For that, it kind of performs SeqScan on the zedstore table for the given tid. From this it appears to me as if the Index Scan is as good as SeqScan for zedstore table. If that is true, will we be able to get the benefit of IndexScan on zedstore tables? Please let me know if i am missing something here. AFAIU, the following user level query on zedstore table select * from zed_tab where a = 3; gets internally converted to select * from zed_tab where tid = 3; -- assuming that index is created on column 'a' and the logical tid associated with a = 3 is 3. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 15, 2019 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 14/08/2019 20:32, Ashwin Agrawal wrote: > > On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma wrote: > >> 2) Is there a chance that IndexOnlyScan would ever be required for > >> zedstore tables considering the design approach taken for it? > > > > We have not given much thought to IndexOnlyScans so far. But I think > > IndexOnlyScan definitely would be beneficial for zedstore as > > well. Even for normal index scans as well, fetching as many columns > > possible from Index itself and only getting rest of required columns > > from the table would be good for zedstore. It would help to further > > cut down IO. Ideally, for visibility checking only TidTree needs to be > > scanned and visibility checked with the same, so the cost of checking > > is much lower compared to heap (if VM can't be consulted) but still is > > a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility > > checks are pretty cheap. Still given all that, having VM type thing to > > optimize the same further would help. > > Hmm, yeah. An index-only scan on a zedstore table could perform the "VM > checks" by checking the TID tree in the zedstore. It's not as compact as > the 2 bits per TID in the heapam's visibility map, but it's pretty good. > > >> Further, I tried creating a zedstore table with btree index on one of > >> it's column and loaded around 50 lacs record into the table. When the > >> indexed column was scanned (with enable_seqscan flag set to off), it > >> went for IndexOnlyScan and that took around 15-20 times more than it > >> would take for IndexOnly Scan on heap table just because IndexOnlyScan > >> in zedstore always goes to heap as the visibility check fails. > > Currently, an index-only scan on zedstore should be pretty much the same > speed as a regular index scan. All the visibility checks will fail, and > you end up fetching every row from the table, just like a regular index > scan. So I think what you're seeing is that the index fetches on a > zedstore table is much slower than on heap. > > Ideally, on a column store the index fetches would only fetch the needed > columns, but I don't think that's been implemented yet, so all the > columns are fetched. That can make a big difference, if you have a wide > table with lots of columns, but only actually need a few of them. Was > your test case something like that? > > We haven't spent much effort on optimizing index fetches yet, so I hope > there's many other little tweaks there as well, that we can do to make > it faster. > > - Heikki
On Mon, Aug 26, 2019 at 5:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Thanks Ashwin and Heikki for your responses. I've one more query here,
If BTree index is created on a zedstore table, the t_tid field of
Index tuple contains the physical tid that is not actually pointing to
the data block instead it contains something from which the logical
tid can be derived. So, when IndexScan is performed on a zedstore
table, it fetches the physical tid from the index page and derives the
logical tid out of it and then retrieves the data corresponding to
this logical tid from the zedstore table. For that, it kind of
performs SeqScan on the zedstore table for the given tid.
Nope, it won't perform seqscan. As zedstore is laid out as btree itself with logical TID as its key. It can quickly find which page the logical TID belongs to and only access that page. It doesn't need to perform the seqscan for the same. That's one of the rationals for laying out things in btree fashion to easily connect logical to physical world and not keep any external mapping.
AFAIU, the following user level query on zedstore table
select * from zed_tab where a = 3;
gets internally converted to
select * from zed_tab where tid = 3; -- assuming that index is created
on column 'a' and the logical tid associated with a = 3 is 3.
So, for this it will first only access the TID btree, find the leaf page with tid=3. Perform the visibility checks for the tuple and if tuple is visible, then only will fetch all the columns for that TID. Again using the btrees for those columns to only fetch leaf page for that logical tid.
Hope that helps to clarify the confusion.
On Tue, Aug 27, 2019 at 6:03 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
> Hope that helps to clarify the confusion.>
Thanks for the explanation. Yes, it does clarify my doubt to some extent.
My point is, once we find the leaf page containing the given tid, we go through each item in the page until we find the data corresponding to the given tid which means we kind of perform a sequential scan at the page level. I'm referring to the below loop in zsbt_attr_scan_fetch_array().
for (off = FirstOffsetNumber; off <= maxoff; off++)
{
ItemId iid = PageGetItemId(page, off);
ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) PageGetItem(page, iid);
if (item->t_endtid <= nexttid)
continue;
if (item->t_firsttid > nexttid)
break;
But that's not true for IndexScan in case of heap table because there the index tuple contains the exact physical location of tuple in the heap. So, there is no need to scan the entire page.
Further here are some minor comments that i could find while doing a quick code walkthrough.
1) In zsundo_insert_finish(), there is a double call to BufferGetPage(undobuf); Is that required ?
2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called twice? I'm referring to the below code.
if (fetch_proj->num_proj_atts == 0)
{
....
....
zsbt_tid_begin_scan(rel, tid, tid + 1,
snapshot,
&fetch_proj->tid_scan);
fetch_proj->tid_scan.serializable = true;
for (int i = 1; i < fetch_proj->num_proj_atts; i++)
{
int attno = fetch_proj->proj_atts[i];
zsbt_attr_begin_scan(rel, reldesc, attno,
&fetch_proj->attr_scans[i - 1]);
}
MemoryContextSwitchTo(oldcontext);
zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &fetch_proj->tid_scan);
}
Also, for all types of update operation (be it key or non-key update) we create a new tid for the new version of tuple. Can't we use the tid associated with the old tuple for the cases where there is no concurrent transactions to whom the old tuple is still visible.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
On Tue, Aug 27, 2019 at 12:03 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
My point is, once we find the leaf page containing the given tid, we go through each item in the page until we find the data corresponding to the given tid which means we kind of perform a sequential scan at the page level. I'm referring to the below loop in zsbt_attr_scan_fetch_array().
for (off = FirstOffsetNumber; off <= maxoff; off++)
{
ItemId iid = PageGetItemId(page, off);
ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) PageGetItem(page, iid);
if (item->t_endtid <= nexttid)
continue;
if (item->t_firsttid > nexttid)
break;
But that's not true for IndexScan in case of heap table because there the index tuple contains the exact physical location of tuple in the heap. So, there is no need to scan the entire page.
You are correct that we currently go through each item in the leaf page that
contains the given tid, specifically, the logic to retrieve all the attribute
items inside a ZSAttStream is now moved to decode_attstream() in the latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given tid. One
optimization we can to is to tell decode_attstream() to stop decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do binary search
instead of linear search, and for variable length attribute, we can probably
try something that we didn't think of yet.
contains the given tid, specifically, the logic to retrieve all the attribute
items inside a ZSAttStream is now moved to decode_attstream() in the latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given tid. One
optimization we can to is to tell decode_attstream() to stop decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do binary search
instead of linear search, and for variable length attribute, we can probably
try something that we didn't think of yet.
1) In zsundo_insert_finish(), there is a double call to BufferGetPage(undobuf); Is that required ?
Fixed, thanks!
2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called twice? I'm referring to the below code.
if (fetch_proj->num_proj_atts == 0)
{
....
....
zsbt_tid_begin_scan(rel, tid, tid + 1,
snapshot,
&fetch_proj->tid_scan);
fetch_proj->tid_scan.serializable = true;
for (int i = 1; i < fetch_proj->num_proj_atts; i++)
{
int attno = fetch_proj->proj_atts[i];
zsbt_attr_begin_scan(rel, reldesc, attno,
&fetch_proj->attr_scans[i - 1]);
}
MemoryContextSwitchTo(oldcontext);
zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &fetch_proj->tid_scan);
}
I removed the second call, thanks!
Also, for all types of update operation (be it key or non-key update) we create a new tid for the new version of tuple. Can't we use the tid associated with the old tuple for the cases where there is no concurrent transactions to whom the old tuple is still visible.
Zedstore currently implement update as delete+insert, hence the old tid is not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old tuple means
putting the old tuple in the UNDO log, which we have not implemented yet.
Thanks for reporting, this is very helpful! Patches are welcome as well!
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old tuple means
putting the old tuple in the UNDO log, which we have not implemented yet.
Thanks for reporting, this is very helpful! Patches are welcome as well!
On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io> wrote:
You are correct that we currently go through each item in the leaf page that
contains the given tid, specifically, the logic to retrieve all the attribute
items inside a ZSAttStream is now moved to decode_attstream() in the latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given tid.
Okay. Any idea why this new way of storing attribute data as streams (lowerstream and upperstream) has been chosen just for the attributes but not for tids. Are only attribute blocks compressed but not the tids blocks?
One
optimization we can to is to tell decode_attstream() to stop decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do binary search
instead of linear search, and for variable length attribute, we can probably
try something that we didn't think of yet.
I think we can probably ask decode_attstream() to stop once it has found the tid that we are searching for but then we only need to do that for Index Scans.
Zedstore currently implement update as delete+insert, hence the old tid is not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old tuple means
putting the old tuple in the UNDO log, which we have not implemented yet.
OKay, so that means performing update on a non-key attribute would also require changes in the index table. In short, HOT update is currently not possible with zedstore table. Am I right?
On 29/08/2019 14:30, Ashutosh Sharma wrote: > > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io > <mailto:lewang@pivotal.io>> wrote: > > You are correct that we currently go through each item in the leaf > page that > contains the given tid, specifically, the logic to retrieve all the > attribute > items inside a ZSAttStream is now moved to decode_attstream() in the > latest > code, and then in zsbt_attr_fetch() we again loop through each item we > previously retrieved from decode_attstream() and look for the given > tid. > > > Okay. Any idea why this new way of storing attribute data as streams > (lowerstream and upperstream) has been chosen just for the attributes > but not for tids. Are only attribute blocks compressed but not the tids > blocks? Right, only attribute blocks are currently compressed. Tid blocks need to be modified when there are UPDATEs or DELETE, so I think having to decompress and recompress them would be more costly. Also, there is no user data on the TID tree, and the Simple-8b encoded codewords used to represent the TIDs are already pretty compact. I'm not sure how much gain you would get from passing it through a general purpose compressor. I could be wrong though. We could certainly try it out, and see how it performs. > One > optimization we can to is to tell decode_attstream() to stop > decoding at the > tid we are interested in. We can also apply other tricks to speed up the > lookups in the page, for fixed length attribute, it is easy to do > binary search > instead of linear search, and for variable length attribute, we can > probably > try something that we didn't think of yet. > > > I think we can probably ask decode_attstream() to stop once it has found > the tid that we are searching for but then we only need to do that for > Index Scans. I've been thinking that we should add a few "bookmarks" on long streams, so that you could skip e.g. to the midpoint in a stream. It's a tradeoff though; when you add more information for random access, it makes the representation less compact. > Zedstore currently implement update as delete+insert, hence the old > tid is not > reused. We don't store the tuple in our UNDO log, and we only store the > transaction information in the UNDO log. Reusing the tid of the old > tuple means > putting the old tuple in the UNDO log, which we have not implemented > yet. > > OKay, so that means performing update on a non-key attribute would also > require changes in the index table. In short, HOT update is currently > not possible with zedstore table. Am I right? That's right. There's a lot of potential gain for doing HOT updates. For example, if you UPDATE one column on every row on a table, ideally you would only modify the attribute tree containing that column. But that hasn't been implemented. - Heikki
On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 29/08/2019 14:30, Ashutosh Sharma wrote: > > > > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io > > <mailto:lewang@pivotal.io>> wrote: > > > > You are correct that we currently go through each item in the leaf > > page that > > contains the given tid, specifically, the logic to retrieve all the > > attribute > > items inside a ZSAttStream is now moved to decode_attstream() in the > > latest > > code, and then in zsbt_attr_fetch() we again loop through each item we > > previously retrieved from decode_attstream() and look for the given > > tid. > > > > > > Okay. Any idea why this new way of storing attribute data as streams > > (lowerstream and upperstream) has been chosen just for the attributes > > but not for tids. Are only attribute blocks compressed but not the tids > > blocks? > > Right, only attribute blocks are currently compressed. Tid blocks need > to be modified when there are UPDATEs or DELETE, so I think having to > decompress and recompress them would be more costly. Also, there is no > user data on the TID tree, and the Simple-8b encoded codewords used to > represent the TIDs are already pretty compact. I'm not sure how much > gain you would get from passing it through a general purpose compressor. > > I could be wrong though. We could certainly try it out, and see how it > performs. > > > One > > optimization we can to is to tell decode_attstream() to stop > > decoding at the > > tid we are interested in. We can also apply other tricks to speed up the > > lookups in the page, for fixed length attribute, it is easy to do > > binary search > > instead of linear search, and for variable length attribute, we can > > probably > > try something that we didn't think of yet. > > > > > > I think we can probably ask decode_attstream() to stop once it has found > > the tid that we are searching for but then we only need to do that for > > Index Scans. > > I've been thinking that we should add a few "bookmarks" on long streams, > so that you could skip e.g. to the midpoint in a stream. It's a tradeoff > though; when you add more information for random access, it makes the > representation less compact. > > > Zedstore currently implement update as delete+insert, hence the old > > tid is not > > reused. We don't store the tuple in our UNDO log, and we only store the > > transaction information in the UNDO log. Reusing the tid of the old > > tuple means > > putting the old tuple in the UNDO log, which we have not implemented > > yet. > > > > OKay, so that means performing update on a non-key attribute would also > > require changes in the index table. In short, HOT update is currently > > not possible with zedstore table. Am I right? > > That's right. There's a lot of potential gain for doing HOT updates. For > example, if you UPDATE one column on every row on a table, ideally you > would only modify the attribute tree containing that column. But that > hasn't been implemented. Thanks Heikki for your reply. After quite some time today I got chance to look back into the code. I could see that you have changed the tuple insertion and update mechanism a bit. As per the latest changes all the tuples being inserted/updated in a transaction are spooled into a hash table and then flushed at the time of transaction commit and probably due to this change, I could see that the server crashes when trying to perform UPDATE operation on a zedstore table having 10 lacs record. See below example, create table t1(a int, b int) using zedstore; insert into t1 select i, i+10 from generate_series(1, 1000000) i; postgres=# update t1 set b = 200; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Above update statement crashed due to some extensive memory leak. Further, the UPDATE operation on zedstore table is very slow. I think that's because in case of zedstore table we have to update all the btree data structures even if one column is updated and that really sucks. Please let me know if there is some other reason for it. I also found some typos when going through the writeup in zedstore_internal.h and thought of correcting those. Attached is the patch with the changes. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Above update statement crashed due to some extensive memory leak.
noticed some other memory related bugs. We are working on the fixes
please stay tuned!
I also found some typos when going through the writeup in
zedstore_internal.h and thought of correcting those. Attached is the
patch with the changes.
Applied. Thank you!
On Thu, Sep 19, 2019 at 8:10 AM Alexandra Wang <lewang@pivotal.io> wrote: > > On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> create table t1(a int, b int) using zedstore; >> insert into t1 select i, i+10 from generate_series(1, 1000000) i; >> postgres=# update t1 set b = 200; >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. >> >> Above update statement crashed due to some extensive memory leak. > > > Thank you for reporting! We have located the memory leak and also > noticed some other memory related bugs. We are working on the fixes > please stay tuned! > Cool. As I suspected earlier, it's basically "ZedstoreAMTupleBuffers" context that is completely exhausting the memory and it is being used to spool the tuples. >> >> I also found some typos when going through the writeup in >> zedstore_internal.h and thought of correcting those. Attached is the >> patch with the changes. > > > Applied. Thank you! Thanks for that. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Thu, Sep 19, 2019 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Thu, Sep 19, 2019 at 8:10 AM Alexandra Wang <lewang@pivotal.io> wrote: > > > > On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > >> > >> create table t1(a int, b int) using zedstore; > >> insert into t1 select i, i+10 from generate_series(1, 1000000) i; > >> postgres=# update t1 set b = 200; > >> server closed the connection unexpectedly > >> This probably means the server terminated abnormally > >> before or while processing the request. > >> The connection to the server was lost. Attempting reset: Failed. > >> > >> Above update statement crashed due to some extensive memory leak. > > > > > > Thank you for reporting! We have located the memory leak and also > > noticed some other memory related bugs. We are working on the fixes > > please stay tuned! > > > > Cool. As I suspected earlier, it's basically "ZedstoreAMTupleBuffers" > context that is completely exhausting the memory and it is being used > to spool the tuples. > Some more updates on top of this: When doing update operation, for each tuple being modified, *tuplebuffers_insert()* says that there is no entry for the relation being modified in the hash table although it was already added when the first tuple in the table was updated. Why is it so? I mean if I have added an entry in the hash table *tuplebuffers* for let's say table t1 then should the subsequent call to tuplebuffers_insert() say that there is no entry for table t1 in the *tuplebuffers*. Shouldn't that only happen once you have flushed all the tuples in the tupbuffer->attbuffers. Because of this reason, for each tuple, tupbuffer->attbuffers is allocated resulting into a lot of memory consumption. OTOH if the insert is performed on the same table only for the first tuple tuplebuffers_insert() says that is no entry for the the table t1 in hash but from the second time onwards that doesn;t happen. I think because of this reason the memory leak is happening in case of update operation. Please let me know if I'm missing something here just because I didn't get chance to spent much time on this. Thank you. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
> When doing update operation, for each tuple being modified,
> *tuplebuffers_insert()* says that there is no entry for the relation
> being modified in the hash table although it was already added when
Currently, when doing an update, it will actually flush the tuple
buffers every time we update a tuple. As a result, we only ever spool
up one tuple at a time. This is a good place to put in an optimization
like was implemented for insert, but I haven't gotten around to
The memory leak is actually happening because it isn't freeing the
attbuffers after flushing. Alexandra Wang and I have a working
branch[1] where we tried to plug the leak by freeing the attbuffers,
but it has exposed an issue with triggers that I need to understand
before I push the fix into the main zedstore branch.
I don't like our solution of freeing the buffers either, because they
could easily be reused. I'm going to take a stab at making that better
[1] https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues
> *tuplebuffers_insert()* says that there is no entry for the relation
> being modified in the hash table although it was already added when
> the first tuple in the table was updated. Why is it so?
buffers every time we update a tuple. As a result, we only ever spool
up one tuple at a time. This is a good place to put in an optimization
like was implemented for insert, but I haven't gotten around to
looking into that yet.
attbuffers after flushing. Alexandra Wang and I have a working
branch[1] where we tried to plug the leak by freeing the attbuffers,
but it has exposed an issue with triggers that I need to understand
before I push the fix into the main zedstore branch.
I don't like our solution of freeing the buffers either, because they
could easily be reused. I'm going to take a stab at making that better
before merging in the fix.
On Fri, Sep 20, 2019 at 5:48 AM Taylor Vesely <tvesely@pivotal.io> wrote: > > > When doing update operation, for each tuple being modified, > > *tuplebuffers_insert()* says that there is no entry for the relation > > being modified in the hash table although it was already added when > > the first tuple in the table was updated. Why is it so? > > Currently, when doing an update, it will actually flush the tuple > buffers every time we update a tuple. As a result, we only ever spool > up one tuple at a time. This is a good place to put in an optimization > like was implemented for insert, but I haven't gotten around to > looking into that yet. > Okay. So, that's the root cause. Spooling just one tuple where at least 60 tuples can be spooled and then not freeing it at all is altogether the reason for this extensive memory leak. > The memory leak is actually happening because it isn't freeing the > attbuffers after flushing. Alexandra Wang and I have a working > branch[1] where we tried to plug the leak by freeing the attbuffers, > but it has exposed an issue with triggers that I need to understand > before I push the fix into the main zedstore branch. > > I don't like our solution of freeing the buffers either, because they > could easily be reused. I'm going to take a stab at making that better > before merging in the fix. > That's right, why do we need to free the memory after flushing data in attbuffers. We can simply reuse it for next set of data to be updated. > [1] https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi Alexandra, On Tue, Sep 17, 2019 at 4:45 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 29/08/2019 14:30, Ashutosh Sharma wrote: > > > > > > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io > > > <mailto:lewang@pivotal.io>> wrote: > > Further, the UPDATE operation on zedstore table is very slow. I think > that's because in case of zedstore table we have to update all the > btree data structures even if one column is updated and that really > sucks. Please let me know if there is some other reason for it. > There was no answer for this in your previous reply. It seems like you missed it. As I said earlier, I tried performing UPDATE operation with optimised build and found that to update around 10 lacs record in zedstore table it takes around 24k ms whereas for normal heap table it takes 2k ms. Is that because in case of zedstore table we have to update all the Btree data structures even if one column is updated or there is some other reason for it. If yes, could you please let us know. FYI, I'm trying to update the table with just two columns. Further, In the latest code I'm getting this warning message when it is compiled using -O2 optimisation flag. zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’: zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized in this function [-Wmaybe-uninitialized] opaque = ZSBtreePageGetOpaque(page); ^ Attached is the patch that fixes it. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
Hi Ashutosh,
Sorry I indeed missed your question, thanks for the reminder!
Sorry I indeed missed your question, thanks for the reminder!
On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Further, the UPDATE operation on zedstore table is very slow. I think
> that's because in case of zedstore table we have to update all the
> btree data structures even if one column is updated and that really
> sucks. Please let me know if there is some other reason for it.
>
There was no answer for this in your previous reply. It seems like you
missed it. As I said earlier, I tried performing UPDATE operation with
optimised build and found that to update around 10 lacs record in
zedstore table it takes around 24k ms whereas for normal heap table it
takes 2k ms. Is that because in case of zedstore table we have to
update all the Btree data structures even if one column is updated or
there is some other reason for it. If yes, could you please let us
know. FYI, I'm trying to update the table with just two columns.
Zedstore UPDATE operation currently fetches the old rows, updates the
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.
However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.
To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.
postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 1000000) i;
-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 640
public | 32768 | onecol | FREE | 90
public | 32768 | onecol | META | 1
(3 rows)
-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 6254
public | 32768 | onecol | FREE | 26915
public | 32768 | onecol | META | 1
(6 rows)
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
count
-------
5740
(1 row)
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
count
-------
514
(1 row)
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.
However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.
To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.
postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 1000000) i;
-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 640
public | 32768 | onecol | FREE | 90
public | 32768 | onecol | META | 1
(3 rows)
-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 6254
public | 32768 | onecol | FREE | 26915
public | 32768 | onecol | META | 1
(6 rows)
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
count
-------
5740
(1 row)
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
count
-------
514
(1 row)
postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and totalsz > 0;
blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace
-------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980
6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11
13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11
19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11
25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11
32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103
(6 rows)
-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)
As you can see, it took 28s to run the update query for the first
time, it was slow but expected. However, when we run the same update
query again it took 4 mins and 27s, almost 10x slower than the first
run. The profiling result of the second update is attached, it shows
that 57% of all the time it's doing decode_chunk_fixed(), which is
used for decoding a chunk in a attstream so that we can confirm
whether the tid of interest is in that chunk and fetch it if true.
Right now, each chunk contains at most 60 tids for fixed length
attributes and at most 30 tids for varlena attributes, and we decode
all the tids each chunk contains one by one.
Going back to our test, before and after the first UPDATE, the BTREE
page counts increased from 640 to 6254, however, only 6 out of the 514
attribute btree pages actually store data. It seems like a bug that we
left behind 508 empty btree pages, we should fix it, but let's put it
aside as a seperate problem. With 6 pages we stored 1M rows, each page
contains as many as 198,320 tids. This is the reason why the second
UPDATE spent so much time at decoding chunks. The btree structure only
helps us locate the page for a given tid, but once we get to the page,
the better compression we have, the more chunks we can pack in one
page, the more calls per page to decode_chunk(). Even worse, unlike
INSERT, UPDATE currently initialize a new fetcher every time it
fetches a new row, which means it doesn't remember the last position
the decoder was at in the attstream, so everytime it fetches a new
row, the decoder starts all over from the beginning of the attstream,
and we are talking about an attstream that could have 198,320 records.
We also haven't done any optimization inside of decode_chunk() itself,
like checking first and last tid, stop decoding once found the tid, or
doing binary search for fixed length attributes.
So, I think what slows down the second UPDATE are also part of the
reasons why the first UPDATE is slow. We still haven't done any
optimization for UPDATE so far, probably because we didn't expect it
to be better than heap, but we should try to make it not too much
worse.
blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace
-------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980
6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11
13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11
19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11
25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11
32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103
(6 rows)
-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)
As you can see, it took 28s to run the update query for the first
time, it was slow but expected. However, when we run the same update
query again it took 4 mins and 27s, almost 10x slower than the first
run. The profiling result of the second update is attached, it shows
that 57% of all the time it's doing decode_chunk_fixed(), which is
used for decoding a chunk in a attstream so that we can confirm
whether the tid of interest is in that chunk and fetch it if true.
Right now, each chunk contains at most 60 tids for fixed length
attributes and at most 30 tids for varlena attributes, and we decode
all the tids each chunk contains one by one.
Going back to our test, before and after the first UPDATE, the BTREE
page counts increased from 640 to 6254, however, only 6 out of the 514
attribute btree pages actually store data. It seems like a bug that we
left behind 508 empty btree pages, we should fix it, but let's put it
aside as a seperate problem. With 6 pages we stored 1M rows, each page
contains as many as 198,320 tids. This is the reason why the second
UPDATE spent so much time at decoding chunks. The btree structure only
helps us locate the page for a given tid, but once we get to the page,
the better compression we have, the more chunks we can pack in one
page, the more calls per page to decode_chunk(). Even worse, unlike
INSERT, UPDATE currently initialize a new fetcher every time it
fetches a new row, which means it doesn't remember the last position
the decoder was at in the attstream, so everytime it fetches a new
row, the decoder starts all over from the beginning of the attstream,
and we are talking about an attstream that could have 198,320 records.
We also haven't done any optimization inside of decode_chunk() itself,
like checking first and last tid, stop decoding once found the tid, or
doing binary search for fixed length attributes.
So, I think what slows down the second UPDATE are also part of the
reasons why the first UPDATE is slow. We still haven't done any
optimization for UPDATE so far, probably because we didn't expect it
to be better than heap, but we should try to make it not too much
worse.
Further, In the latest code I'm getting this warning message when it
is compiled using -O2 optimisation flag.
zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
opaque = ZSBtreePageGetOpaque(page);
^
Attached is the patch that fixes it.
Applied. Thanks!
Of course I forgot to attach the files.
Attachment
On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang <lewang@pivotal.io> wrote: > > Hi Ashutosh, > > Sorry I indeed missed your question, thanks for the reminder! > > On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> > Further, the UPDATE operation on zedstore table is very slow. I think >> > that's because in case of zedstore table we have to update all the >> > btree data structures even if one column is updated and that really >> > sucks. Please let me know if there is some other reason for it. >> > >> >> There was no answer for this in your previous reply. It seems like you >> missed it. As I said earlier, I tried performing UPDATE operation with >> optimised build and found that to update around 10 lacs record in >> zedstore table it takes around 24k ms whereas for normal heap table it >> takes 2k ms. Is that because in case of zedstore table we have to >> update all the Btree data structures even if one column is updated or >> there is some other reason for it. If yes, could you please let us >> know. FYI, I'm trying to update the table with just two columns. > > > Zedstore UPDATE operation currently fetches the old rows, updates the > undo pointers stored in the tid btree, and insert new rows into all > the attribute btrees with the new tids. So performance of updating one > column makes no difference from updating all the columns. That said, > the wider the table is, the longer it takes to update, regardless > updating one column or all the columns. > > However, since your test table only has two columns, and we also > tested the same on a one-column table and got similar results as > yours, there is definitely room for optimizations. Attached file > zedstore_update_flames_lz4_first_update.svg is the profiling results > for the update query on a one-column table with 1M records. It spent > most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For > zedstoream_fetch_row(), Taylor and I had some interesting findings > which I'm going to talk about next, I haven't dived into > zsbt_tid_update() yet and need to think about it more. > > To understand what slows down zedstore UDPATE, Taylor and I did the > following test and profiling on a zedstore table with only one column. > > postgres=# create table onecol(a int) using zedstore; > postgres=# insert into onecol select i from generate_series(1, 1000000) i; > > -- Create view to count zedstore pages group by page types > postgres=# CREATE VIEW pg_zs_page_counts AS > SELECT > c.relnamespace::regnamespace, > c.oid, > c.relname, > pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)), > count(*) > FROM pg_am am > JOIN pg_class c ON (c.relam = am.oid) > WHERE am.amname='zedstore' > GROUP BY 1,2,3,4; > > postgres=# select * from pg_zs_page_counts; > relnamespace | oid | relname | pg_zs_page_type | count > --------------+-------+---------+-----------------+------- > public | 32768 | onecol | BTREE | 640 > public | 32768 | onecol | FREE | 90 > public | 32768 | onecol | META | 1 > (3 rows) > > -- Run update query the first time > postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg > Time: 28760.199 ms (00:28.760) > > postgres=# select * from pg_zs_page_counts; > relnamespace | oid | relname | pg_zs_page_type | count > --------------+-------+---------+-----------------+------- > public | 32768 | onecol | BTREE | 6254 > public | 32768 | onecol | FREE | 26915 > public | 32768 | onecol | META | 1 > (6 rows) > Oops, the first UPDATE created a lot of free pages. Just FYI, when the second update was ran, it took around 5 mins (which is almost 10-12 times more than what 1st UPDATE took) but this time there was no more free pages added, instead the already available free pages were used. Here is the stats observed before and after second update, before: ===== postgres=# select * from pg_zs_page_counts; relnamespace | oid | relname | pg_zs_page_type | count --------------+-------+---------+-----------------+------- public | 16390 | t1 | FREE | 26915 public | 16390 | t1 | BTREE | 7277 public | 16390 | t1 | META | 1 (3 rows) after: ==== postgres=# select * from pg_zs_page_counts; relnamespace | oid | relname | pg_zs_page_type | count --------------+-------+---------+-----------------+------- public | 16390 | t1 | FREE | 26370 public | 16390 | t1 | BTREE | 7822 public | 16390 | t1 | META | 1 (3 rows) You may see that around 545 pages got added this time and they were all taken from the free pages list. > postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0; > count > ------- > 5740 > (1 row) > This could be because currently tid blocks are not compressed as against the other attribute blocks. > postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1; > count > ------- > 514 > (1 row) > > postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and totalsz > 0; > blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace > -------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+----------- > 730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980 > 6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11 > 13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11 > 19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11 > 25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11 > 32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103 > (6 rows) > > -- Run update query the second time > postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg > Time: 267135.703 ms (04:27.136) > > As you can see, it took 28s to run the update query for the first > time, it was slow but expected. However, when we run the same update > query again it took 4 mins and 27s, almost 10x slower than the first > run. The profiling result of the second update is attached, it shows > that 57% of all the time it's doing decode_chunk_fixed(), which is > used for decoding a chunk in a attstream so that we can confirm > whether the tid of interest is in that chunk and fetch it if true. > Right now, each chunk contains at most 60 tids for fixed length > attributes and at most 30 tids for varlena attributes, and we decode > all the tids each chunk contains one by one. > > Going back to our test, before and after the first UPDATE, the BTREE > page counts increased from 640 to 6254, however, only 6 out of the 514 > attribute btree pages actually store data. It seems like a bug that we > left behind 508 empty btree pages, we should fix it, but let's put it > aside as a seperate problem. With 6 pages we stored 1M rows, each page > contains as many as 198,320 tids. This is the reason why the second > UPDATE spent so much time at decoding chunks. The btree structure only > helps us locate the page for a given tid, but once we get to the page, > the better compression we have, the more chunks we can pack in one > page, the more calls per page to decode_chunk(). Even worse, unlike > INSERT, UPDATE currently initialize a new fetcher every time it > fetches a new row, which means it doesn't remember the last position > the decoder was at in the attstream, so everytime it fetches a new > row, the decoder starts all over from the beginning of the attstream, > and we are talking about an attstream that could have 198,320 records. > We also haven't done any optimization inside of decode_chunk() itself, > like checking first and last tid, stop decoding once found the tid, or > doing binary search for fixed length attributes. > > So, I think what slows down the second UPDATE are also part of the > reasons why the first UPDATE is slow. We still haven't done any > optimization for UPDATE so far, probably because we didn't expect it > to be better than heap, but we should try to make it not too much > worse. > That's right, if the situation is too worse, it would be difficult to compromise. So, some fix is certainly required here. >> >> Further, In the latest code I'm getting this warning message when it >> is compiled using -O2 optimisation flag. >> >> zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’: >> zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized >> in this function [-Wmaybe-uninitialized] >> opaque = ZSBtreePageGetOpaque(page); >> ^ >> Attached is the patch that fixes it. > > > Applied. Thanks! Thanks for that and for sharing the detail information on why update operation on zedstore table is so slow. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi, I got chance to spend some time looking into the recent changes done in the zedstore code, basically the functions for packing datums into the attribute streams and handling attribute leaf pages. I didn't find any issues but there are some minor comments that I found when reviewing. I have worked on those and attached is the patch with the changes. See if the changes looks meaningful to you. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Mon, Sep 30, 2019 at 4:08 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang <lewang@pivotal.io> wrote: > > > > Hi Ashutosh, > > > > Sorry I indeed missed your question, thanks for the reminder! > > > > On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > >> > >> > Further, the UPDATE operation on zedstore table is very slow. I think > >> > that's because in case of zedstore table we have to update all the > >> > btree data structures even if one column is updated and that really > >> > sucks. Please let me know if there is some other reason for it. > >> > > >> > >> There was no answer for this in your previous reply. It seems like you > >> missed it. As I said earlier, I tried performing UPDATE operation with > >> optimised build and found that to update around 10 lacs record in > >> zedstore table it takes around 24k ms whereas for normal heap table it > >> takes 2k ms. Is that because in case of zedstore table we have to > >> update all the Btree data structures even if one column is updated or > >> there is some other reason for it. If yes, could you please let us > >> know. FYI, I'm trying to update the table with just two columns. > > > > > > Zedstore UPDATE operation currently fetches the old rows, updates the > > undo pointers stored in the tid btree, and insert new rows into all > > the attribute btrees with the new tids. So performance of updating one > > column makes no difference from updating all the columns. That said, > > the wider the table is, the longer it takes to update, regardless > > updating one column or all the columns. > > > > However, since your test table only has two columns, and we also > > tested the same on a one-column table and got similar results as > > yours, there is definitely room for optimizations. Attached file > > zedstore_update_flames_lz4_first_update.svg is the profiling results > > for the update query on a one-column table with 1M records. It spent > > most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For > > zedstoream_fetch_row(), Taylor and I had some interesting findings > > which I'm going to talk about next, I haven't dived into > > zsbt_tid_update() yet and need to think about it more. > > > > To understand what slows down zedstore UDPATE, Taylor and I did the > > following test and profiling on a zedstore table with only one column. > > > > postgres=# create table onecol(a int) using zedstore; > > postgres=# insert into onecol select i from generate_series(1, 1000000) i; > > > > -- Create view to count zedstore pages group by page types > > postgres=# CREATE VIEW pg_zs_page_counts AS > > SELECT > > c.relnamespace::regnamespace, > > c.oid, > > c.relname, > > pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)), > > count(*) > > FROM pg_am am > > JOIN pg_class c ON (c.relam = am.oid) > > WHERE am.amname='zedstore' > > GROUP BY 1,2,3,4; > > > > postgres=# select * from pg_zs_page_counts; > > relnamespace | oid | relname | pg_zs_page_type | count > > --------------+-------+---------+-----------------+------- > > public | 32768 | onecol | BTREE | 640 > > public | 32768 | onecol | FREE | 90 > > public | 32768 | onecol | META | 1 > > (3 rows) > > > > -- Run update query the first time > > postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg > > Time: 28760.199 ms (00:28.760) > > > > postgres=# select * from pg_zs_page_counts; > > relnamespace | oid | relname | pg_zs_page_type | count > > --------------+-------+---------+-----------------+------- > > public | 32768 | onecol | BTREE | 6254 > > public | 32768 | onecol | FREE | 26915 > > public | 32768 | onecol | META | 1 > > (6 rows) > > > > Oops, the first UPDATE created a lot of free pages. > > Just FYI, when the second update was ran, it took around 5 mins (which > is almost 10-12 times more than what 1st UPDATE took) but this time > there was no more free pages added, instead the already available free > pages were used. Here is the stats observed before and after second > update, > > before: > ===== > postgres=# select * from pg_zs_page_counts; > relnamespace | oid | relname | pg_zs_page_type | count > --------------+-------+---------+-----------------+------- > public | 16390 | t1 | FREE | 26915 > public | 16390 | t1 | BTREE | 7277 > public | 16390 | t1 | META | 1 > (3 rows) > > > after: > ==== > postgres=# select * from pg_zs_page_counts; > relnamespace | oid | relname | pg_zs_page_type | count > --------------+-------+---------+-----------------+------- > public | 16390 | t1 | FREE | 26370 > public | 16390 | t1 | BTREE | 7822 > public | 16390 | t1 | META | 1 > (3 rows) > > You may see that around 545 pages got added this time and they were > all taken from the free pages list. > > > postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0; > > count > > ------- > > 5740 > > (1 row) > > > > This could be because currently tid blocks are not compressed as > against the other attribute blocks. > > > postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1; > > count > > ------- > > 514 > > (1 row) > > > > postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and totalsz > 0; > > blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace > > -------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+----------- > > 730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980 > > 6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11 > > 13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11 > > 19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11 > > 25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11 > > 32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103 > > (6 rows) > > > > -- Run update query the second time > > postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg > > Time: 267135.703 ms (04:27.136) > > > > As you can see, it took 28s to run the update query for the first > > time, it was slow but expected. However, when we run the same update > > query again it took 4 mins and 27s, almost 10x slower than the first > > run. The profiling result of the second update is attached, it shows > > that 57% of all the time it's doing decode_chunk_fixed(), which is > > used for decoding a chunk in a attstream so that we can confirm > > whether the tid of interest is in that chunk and fetch it if true. > > Right now, each chunk contains at most 60 tids for fixed length > > attributes and at most 30 tids for varlena attributes, and we decode > > all the tids each chunk contains one by one. > > > > Going back to our test, before and after the first UPDATE, the BTREE > > page counts increased from 640 to 6254, however, only 6 out of the 514 > > attribute btree pages actually store data. It seems like a bug that we > > left behind 508 empty btree pages, we should fix it, but let's put it > > aside as a seperate problem. With 6 pages we stored 1M rows, each page > > contains as many as 198,320 tids. This is the reason why the second > > UPDATE spent so much time at decoding chunks. The btree structure only > > helps us locate the page for a given tid, but once we get to the page, > > the better compression we have, the more chunks we can pack in one > > page, the more calls per page to decode_chunk(). Even worse, unlike > > INSERT, UPDATE currently initialize a new fetcher every time it > > fetches a new row, which means it doesn't remember the last position > > the decoder was at in the attstream, so everytime it fetches a new > > row, the decoder starts all over from the beginning of the attstream, > > and we are talking about an attstream that could have 198,320 records. > > We also haven't done any optimization inside of decode_chunk() itself, > > like checking first and last tid, stop decoding once found the tid, or > > doing binary search for fixed length attributes. > > > > So, I think what slows down the second UPDATE are also part of the > > reasons why the first UPDATE is slow. We still haven't done any > > optimization for UPDATE so far, probably because we didn't expect it > > to be better than heap, but we should try to make it not too much > > worse. > > > > That's right, if the situation is too worse, it would be difficult to > compromise. So, some fix is certainly required here. > > >> > >> Further, In the latest code I'm getting this warning message when it > >> is compiled using -O2 optimisation flag. > >> > >> zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’: > >> zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized > >> in this function [-Wmaybe-uninitialized] > >> opaque = ZSBtreePageGetOpaque(page); > >> ^ > >> Attached is the patch that fixes it. > > > > > > Applied. Thanks! > > Thanks for that and for sharing the detail information on why update > operation on zedstore table is so slow. > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com
Attachment
On 15/10/2019 13:49, Ashutosh Sharma wrote: > Hi, > > I got chance to spend some time looking into the recent changes done > in the zedstore code, basically the functions for packing datums into > the attribute streams and handling attribute leaf pages. I didn't find > any issues but there are some minor comments that I found when > reviewing. I have worked on those and attached is the patch with the > changes. See if the changes looks meaningful to you. Thanks for looking! Applied to the development repository (https://github.com/greenplum-db/postgres/tree/zedstore/) - Heikki
On Thu, Oct 17, 2019 at 2:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 15/10/2019 13:49, Ashutosh Sharma wrote: > > Hi, > > > > I got chance to spend some time looking into the recent changes done > > in the zedstore code, basically the functions for packing datums into > > the attribute streams and handling attribute leaf pages. I didn't find > > any issues but there are some minor comments that I found when > > reviewing. I have worked on those and attached is the patch with the > > changes. See if the changes looks meaningful to you. > > Thanks for looking! Applied to the development repository Thank you. Here is one more observation: When a zedstore table is queried using *invalid* ctid, the server crashes due to assertion failure. See below, postgres=# select * from t1 where ctid = '(0, 0)'; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I believe above should have either returned 0 rows or failed with some user friendly error. Further, when the same table is queried using some non-existing ctid, the query returns 0 rows. See below, postgres=# select count(*) from t1; count ------- 2 (1 row) postgres=# select * from t1 where ctid = '(0, 2)'; a | b ---+------ 2 | str2 (1 row) postgres=# select * from t1 where ctid = '(0, 3)'; a | b ---+--- (0 rows) postgres=# select * from t1 where ctid = '(0, 4)'; a | b ---+--- (0 rows) -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
> When a zedstore table is queried using *invalid* ctid, the server
> crashes due to assertion failure. See below,
>
> postgres=# select * from t2 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:
ERROR: arrived at incorrect block 2 while descending zedstore btree
> I believe above should have either returned 1 rows or failed with some
> user friendly error.
Agreed. I think it should match the behavior of heap as closely as
possible.
> crashes due to assertion failure. See below,
>
> postgres=# select * from t2 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:
ERROR: arrived at incorrect block 2 while descending zedstore btree
> I believe above should have either returned 1 rows or failed with some
> user friendly error.
Agreed. I think it should match the behavior of heap as closely as
possible.
Alex Wang and I have been doing some performance analysis of the most
recent version of the zedstore branch, and have some interesting
statistics to share.
We specifically focused on TPC-DS query 2, because it plays to what
should be the strength of zedstore- namely it does a full table scan
of only a subset of columns. I've attached the explain verbose output
for reference.
We scan two columns of 'catalog_sales', and two columns of 'web_sales'.
-> Parallel Append
-> Parallel Seq Scan on tpcds.catalog_sales
Output: catalog_sales.cs_ext_sales_price, catalog_sales.cs_sold_date_sk
-> Parallel Seq Scan on tpcds.web_sales
Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk
For heap, it needs to do a full table scan of both tables, and we need
to read the entire table into memory. For our dataset, that totals
around 119GB of data.
***HEAP***
tpcds=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
39 GB
(1 row)
tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
80 GB
(1 row)
***/HEAP***
With Zedstore the total relation size is smaller because of
compression. When scanning the table, we only scan the blocks with
data we are interested in, and leave the rest alone. So the total
size we need to scan for these tables totals around 4GB
***ZEDSTORE***
zedstore=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
20 GB
(1 row)
zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
40 GB
(1 row)
zedstore=# with zedstore_tables as (select d.oid, f.*
zedstore(# from (select c.oid
zedstore(# from pg_am am
zedstore(# join pg_class c on (c.relam = am.oid)
zedstore(# where am.amname = 'zedstore') d,
zedstore(# pg_zs_btree_pages(d.oid) f)
zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno) as pages
zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from zedstore_tables zs
zedstore-# left join pg_attribute att on zs.attno = att.attnum
zedstore-# and zs.oid = att.attrelid
zedstore-# where zs.oid in ('catalog_sales'::regclass, 'web_sales'::regclass)
zedstore-# and (att.attname in ('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk')
zedstore(# or zs.attno = 0)
zedstore-# group by zs.attno, att.attname, zs.oid
zedstore-# order by zs.oid , zs.attno;
attno | attname | oid | pages | pg_size_pretty
-------+--------------------+---------------+--------+----------------
0 | | catalog_sales | 39549 | 309 MB
1 | cs_sold_date_sk | catalog_sales | 2441 | 19 MB
24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB
0 | | web_sales | 20013 | 156 MB
1 | ws_sold_date_sk | web_sales | 17578 | 137 MB
24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB
***/ZEDSTORE ***
On our test machine, our tables were stored on a single spinning disk,
so our read speed was pretty abysmal with this query. This query is
I/O bound for us, so it was the single largest factor. With heap, the
tables are scanned sequentially, and therefore can scan around 150MB of
table data per second:
***HEAP***
avg-cpu: %user %nice %system %iowait %steal %idle
8.54 0.00 1.85 11.62 0.00 77.98
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1685.33 0.00 157069.33 0.00 18.67 0.00 1.10 0.00 1.56 0.00 2.62 93.20 0.00 0.59 100.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1655.33 0.00 154910.67 0.00 21.33 0.00 1.27 0.00 1.62 0.00 2.68 93.58 0.00 0.60 100.13
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1746.33 0.00 155121.33 0.00 28.00 0.00 1.58 0.00 1.48 0.00 2.61 88.83 0.00 0.57 100.00
***/HEAP***
Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.
***ZEDSTORE***
avg-cpu: %user %nice %system %iowait %steal %idle
6.24 0.00 1.22 6.34 0.00 86.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 129.33 0.00 1034.67 0.00 0.00 0.00 0.00 0.00 15.89 0.00 2.05 8.00 0.00 7.67 99.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 120.67 0.00 965.33 0.00 0.00 0.00 0.00 0.00 16.51 0.00 1.99 8.00 0.00 8.21 99.07
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 121.00 0.00 968.00 0.00 0.00 0.00 0.00 0.00 16.76 0.00 2.02 8.00 0.00 8.19 99.07
***/ZEDSTORE***
The total query time:
***HEAP***
Execution Time: 758807.571 ms
***/HEAP***
***ZEDSTORE***
Execution Time: 2111576.259 ms
***/ZEDSTORE***
Every attribute in zedstore is stored in a btree with the TID as a
key. Unlike heap, the TID is a logical address, and not a physical
one. The pages of one attribute are interspersed with the pages of all
other attributes. When you do a sequential scan on zedstore the pages
are, therefore, not stored in sequential order, so the access pattern
can resemble random I/O.
On our system, query time for zedstore was around 3x slower than heap
for this query. If your storage does not handle semi-random reads very
well, then zedstore can be very slow. This setup was a worst case
scenario because random read was 150x slower than with sequential
read. On hardware with better random I/O zedstore would really shine.
On a side note, a second run of this query with zedstore was finished
in around 57 seconds, because the ~4GB of column data was already in
the relcache. The data size is smaller because we only store the
relevant columns in memory, also the datums are compressed and
encoded. Conversely, subsequently running the same query with heap
still takes around 750 seconds because our system cannot store 119GB
of relation data in the relcache/system caches.
Our main takeaway with this is that anything we can do to group
together data that is accessed together can help zedstore to have
larger, more frequent sequential reads.
recent version of the zedstore branch, and have some interesting
statistics to share.
We specifically focused on TPC-DS query 2, because it plays to what
should be the strength of zedstore- namely it does a full table scan
of only a subset of columns. I've attached the explain verbose output
for reference.
We scan two columns of 'catalog_sales', and two columns of 'web_sales'.
-> Parallel Append
-> Parallel Seq Scan on tpcds.catalog_sales
Output: catalog_sales.cs_ext_sales_price, catalog_sales.cs_sold_date_sk
-> Parallel Seq Scan on tpcds.web_sales
Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk
For heap, it needs to do a full table scan of both tables, and we need
to read the entire table into memory. For our dataset, that totals
around 119GB of data.
***HEAP***
tpcds=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
39 GB
(1 row)
tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
80 GB
(1 row)
***/HEAP***
With Zedstore the total relation size is smaller because of
compression. When scanning the table, we only scan the blocks with
data we are interested in, and leave the rest alone. So the total
size we need to scan for these tables totals around 4GB
***ZEDSTORE***
zedstore=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
20 GB
(1 row)
zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
40 GB
(1 row)
zedstore=# with zedstore_tables as (select d.oid, f.*
zedstore(# from (select c.oid
zedstore(# from pg_am am
zedstore(# join pg_class c on (c.relam = am.oid)
zedstore(# where am.amname = 'zedstore') d,
zedstore(# pg_zs_btree_pages(d.oid) f)
zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno) as pages
zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from zedstore_tables zs
zedstore-# left join pg_attribute att on zs.attno = att.attnum
zedstore-# and zs.oid = att.attrelid
zedstore-# where zs.oid in ('catalog_sales'::regclass, 'web_sales'::regclass)
zedstore-# and (att.attname in ('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk')
zedstore(# or zs.attno = 0)
zedstore-# group by zs.attno, att.attname, zs.oid
zedstore-# order by zs.oid , zs.attno;
attno | attname | oid | pages | pg_size_pretty
-------+--------------------+---------------+--------+----------------
0 | | catalog_sales | 39549 | 309 MB
1 | cs_sold_date_sk | catalog_sales | 2441 | 19 MB
24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB
0 | | web_sales | 20013 | 156 MB
1 | ws_sold_date_sk | web_sales | 17578 | 137 MB
24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB
***/ZEDSTORE ***
On our test machine, our tables were stored on a single spinning disk,
so our read speed was pretty abysmal with this query. This query is
I/O bound for us, so it was the single largest factor. With heap, the
tables are scanned sequentially, and therefore can scan around 150MB of
table data per second:
***HEAP***
avg-cpu: %user %nice %system %iowait %steal %idle
8.54 0.00 1.85 11.62 0.00 77.98
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1685.33 0.00 157069.33 0.00 18.67 0.00 1.10 0.00 1.56 0.00 2.62 93.20 0.00 0.59 100.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1655.33 0.00 154910.67 0.00 21.33 0.00 1.27 0.00 1.62 0.00 2.68 93.58 0.00 0.60 100.13
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1746.33 0.00 155121.33 0.00 28.00 0.00 1.58 0.00 1.48 0.00 2.61 88.83 0.00 0.57 100.00
***/HEAP***
Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.
***ZEDSTORE***
avg-cpu: %user %nice %system %iowait %steal %idle
6.24 0.00 1.22 6.34 0.00 86.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 129.33 0.00 1034.67 0.00 0.00 0.00 0.00 0.00 15.89 0.00 2.05 8.00 0.00 7.67 99.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 120.67 0.00 965.33 0.00 0.00 0.00 0.00 0.00 16.51 0.00 1.99 8.00 0.00 8.21 99.07
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 121.00 0.00 968.00 0.00 0.00 0.00 0.00 0.00 16.76 0.00 2.02 8.00 0.00 8.19 99.07
***/ZEDSTORE***
The total query time:
***HEAP***
Execution Time: 758807.571 ms
***/HEAP***
***ZEDSTORE***
Execution Time: 2111576.259 ms
***/ZEDSTORE***
Every attribute in zedstore is stored in a btree with the TID as a
key. Unlike heap, the TID is a logical address, and not a physical
one. The pages of one attribute are interspersed with the pages of all
other attributes. When you do a sequential scan on zedstore the pages
are, therefore, not stored in sequential order, so the access pattern
can resemble random I/O.
On our system, query time for zedstore was around 3x slower than heap
for this query. If your storage does not handle semi-random reads very
well, then zedstore can be very slow. This setup was a worst case
scenario because random read was 150x slower than with sequential
read. On hardware with better random I/O zedstore would really shine.
On a side note, a second run of this query with zedstore was finished
in around 57 seconds, because the ~4GB of column data was already in
the relcache. The data size is smaller because we only store the
relevant columns in memory, also the datums are compressed and
encoded. Conversely, subsequently running the same query with heap
still takes around 750 seconds because our system cannot store 119GB
of relation data in the relcache/system caches.
Our main takeaway with this is that anything we can do to group
together data that is accessed together can help zedstore to have
larger, more frequent sequential reads.
On Mon, Oct 28, 2019 at 3:22 PM Taylor Vesely <tvesely@pivotal.io> wrote:
> When a zedstore table is queried using *invalid* ctid, the server
> crashes due to assertion failure. See below,
>
> postgres=# select * from t2 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:
ERROR: arrived at incorrect block 2 while descending zedstore btree
> I believe above should have either returned 1 rows or failed with some
> user friendly error.
Agreed. I think it should match the behavior of heap as closely as
possible.
Attachment
> When a zedstore table is queried using *invalid* ctid, the server
> crashes due to assertion failure. See below,
>
> postgres=# select * from t1 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> I believe above should have either returned 0 rows or failed with some
> user friendly error.
We pushed a fix for this today. It now returns zero rows, like the
equivalent query with heap. Thanks for reporting!
> crashes due to assertion failure. See below,
>
> postgres=# select * from t1 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> I believe above should have either returned 0 rows or failed with some
> user friendly error.
We pushed a fix for this today. It now returns zero rows, like the
equivalent query with heap. Thanks for reporting!
Hello,
We (David and I) recently observed that a Zedstore table can be considerably
bloated when we load data into it with concurrent copies. Also, we found that
concurrent insert performance was less than desirable. This is a detailed
analysis of the extent of the problem, the cause of the problem and how we
fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki,
Melanie, Taylor and myself.
An example of the bloat that we observed:
TPC-DS scale = 270:
Table heap zed(serial) zed(16 parallel COPYs)
web_sales 39G 19G 39G
We found that it was caused due to inefficient page splits resultant from
out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The
degree of under-utilization was significant - attribute tree leaves with a
serial data load had 6-8x more datums than the attribute tree leaves
resultant with a parallel load of 16 sessions.
Consider the scenario below:
Assumptions:
1. Let us consider two concurrent copy commands executing (sessions S1 and S2).
2. The table has only one (fixed-length for sake of argument) attribute 'a'.
3. For attribute 'a', a full attribute tree leaf page can accommodate 1500 datums.
TID allocations:
S1: 1-1000
S2: 1001-2000, 2001-3000
Order of operations:
1. S2 writes datums for tids 1001-2000, 2001-3000.
The resulting leaves are:
L1:
lokey = 1 hikey = 2500
firsttid = 1001 lasttid = 2500
L2:
lokey = 2501 hikey = MaxZSTid
firsttid = 2501 lasttid = 3000
2. S1 now writes datums for tids 1-1000.
We have to split L1 into L1' and L1''.
L1':
lokey = 1 hikey = 1500
firsttid = 1 lasttid = 1500
L1'': [under-utilized page]
lokey = 1501 hikey = 2000
firsttid = 1501 lasttid = 2000
L2:
lokey = 2501 hikey = MaxZSTid
firsttid = 2501 lasttid = 3000
Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas firsttid
and lasttid reflect what actually have been inserted.
L1'' will be an under-utilized page that is not going to be filled again because
it inherits the tight hikey from L1. In this example, space wastage in L1'' is
66% but it could very easily be close to 100%, especially under concurrent
workloads which mixes single and multi-inserts, or even unequally sized
multi-inserts.
Solution (kudos to Ashwin!):
For every multi-insert (and only multi-insert, not for singleton inserts),
allocate N times more tids. Each session will keep these extra tids in a
buffer. Subsequent calls to multi-insert would use these buffered tids. If at
any time a tid allocation request cannot be met by the remaining buffered tids,
a new batch of N times the number of tids requested will again be allocated.
If we take the same example above and say we allocated N=5 times the number of
tids upon the first request for 1000 tids.:
TID allocations:
S1: 1-5000
S2: 5001-10000
Order of operations:
1. S2 writes datums for tids 5001-6000, 6001-7000.
The resulting leaves are:
L1:
lokey = 1 hikey = 6500
firsttid = 5001 lasttid = 6500
L2:
lokey = 6501 hikey = MaxZSTid
firsttid = 6501 lasttid = 7000
2. S1 writes datums for tids 1-1000.
L1 will be split into L1' and L1''.
L1':
lokey = 1 hikey = 5500
firsttid = 1 lasttid = 1000
L1'' [under-utilized page]:
lokey = 5501 hikey = 6500
firsttid = 5501 lasttid = 6500
L2:
lokey = 6501 hikey = MaxZSTid
firsttid = 6501 lasttid = 7000
Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive.
However, we do end up with the inefficient page L1''. With a high enough value
of N, we reduce the frequency of such pages. We could further reduce this
wastage by incorporating a special left split (Since L1 was already full, we
don't change it at all -> we simply update it's lokey -> L1 becomes L1'' and we
fork of a new leaf to its left: L1'). This would look like:
L1':
lokey = 1 hikey = 5000
firsttid = 1 lasttid = 1000
L1'':
lokey = 5001 hikey = 6500
firsttid = 5001 lasttid = 6500
We found that with a high enough value of N, we did not get significant space
benefits from the left split. Thus, we decided to only incorporate N.
Results: [TPC-DS scale = 50, 16 conc copies]
Table zed N=10 N=100 N=1000 heap zed(serial)
catalog_sales 15G 9.1G 7.7G 7.5G 15G 8.0G
catalog_returns 1.5G 0.9G 0.7G 0.7G 1.2G 0.8G
store_returns 2.1G 1.2G 1.1G 1.1G 1.9G 1.2G
store_sales 17G 11G 10.1G 10.1G 21G 10G
Load time:
N=10 30min
N=100 10min
N=1000 7min
zed 100min
heap 8min
'zed' refers to the zedstore branch without our fix. We see that with N = 10, we
get closer to what we get with serial inserts. For N = 100, we even beat serial
insert.
We can attribute the differences in runtime to the fact that by lowering the
number of tid range requests, we reduce the contention on the tid tree - which
is a bottleneck for concurrent loads. A significant win!
How N relates to the other parameters in play:
Let S be the number of concurrent sessions
Let T be the average number of rows that a session wants to write in t sized
multi-insert batches
Regards,
Soumyadeep & David
We (David and I) recently observed that a Zedstore table can be considerably
bloated when we load data into it with concurrent copies. Also, we found that
concurrent insert performance was less than desirable. This is a detailed
analysis of the extent of the problem, the cause of the problem and how we
fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki,
Melanie, Taylor and myself.
An example of the bloat that we observed:
TPC-DS scale = 270:
Table heap zed(serial) zed(16 parallel COPYs)
web_sales 39G 19G 39G
We found that it was caused due to inefficient page splits resultant from
out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The
degree of under-utilization was significant - attribute tree leaves with a
serial data load had 6-8x more datums than the attribute tree leaves
resultant with a parallel load of 16 sessions.
Consider the scenario below:
Assumptions:
1. Let us consider two concurrent copy commands executing (sessions S1 and S2).
2. The table has only one (fixed-length for sake of argument) attribute 'a'.
3. For attribute 'a', a full attribute tree leaf page can accommodate 1500 datums.
TID allocations:
S1: 1-1000
S2: 1001-2000, 2001-3000
Order of operations:
1. S2 writes datums for tids 1001-2000, 2001-3000.
The resulting leaves are:
L1:
lokey = 1 hikey = 2500
firsttid = 1001 lasttid = 2500
L2:
lokey = 2501 hikey = MaxZSTid
firsttid = 2501 lasttid = 3000
2. S1 now writes datums for tids 1-1000.
We have to split L1 into L1' and L1''.
L1':
lokey = 1 hikey = 1500
firsttid = 1 lasttid = 1500
L1'': [under-utilized page]
lokey = 1501 hikey = 2000
firsttid = 1501 lasttid = 2000
L2:
lokey = 2501 hikey = MaxZSTid
firsttid = 2501 lasttid = 3000
Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas firsttid
and lasttid reflect what actually have been inserted.
L1'' will be an under-utilized page that is not going to be filled again because
it inherits the tight hikey from L1. In this example, space wastage in L1'' is
66% but it could very easily be close to 100%, especially under concurrent
workloads which mixes single and multi-inserts, or even unequally sized
multi-inserts.
Solution (kudos to Ashwin!):
For every multi-insert (and only multi-insert, not for singleton inserts),
allocate N times more tids. Each session will keep these extra tids in a
buffer. Subsequent calls to multi-insert would use these buffered tids. If at
any time a tid allocation request cannot be met by the remaining buffered tids,
a new batch of N times the number of tids requested will again be allocated.
If we take the same example above and say we allocated N=5 times the number of
tids upon the first request for 1000 tids.:
TID allocations:
S1: 1-5000
S2: 5001-10000
Order of operations:
1. S2 writes datums for tids 5001-6000, 6001-7000.
The resulting leaves are:
L1:
lokey = 1 hikey = 6500
firsttid = 5001 lasttid = 6500
L2:
lokey = 6501 hikey = MaxZSTid
firsttid = 6501 lasttid = 7000
2. S1 writes datums for tids 1-1000.
L1 will be split into L1' and L1''.
L1':
lokey = 1 hikey = 5500
firsttid = 1 lasttid = 1000
L1'' [under-utilized page]:
lokey = 5501 hikey = 6500
firsttid = 5501 lasttid = 6500
L2:
lokey = 6501 hikey = MaxZSTid
firsttid = 6501 lasttid = 7000
Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive.
However, we do end up with the inefficient page L1''. With a high enough value
of N, we reduce the frequency of such pages. We could further reduce this
wastage by incorporating a special left split (Since L1 was already full, we
don't change it at all -> we simply update it's lokey -> L1 becomes L1'' and we
fork of a new leaf to its left: L1'). This would look like:
L1':
lokey = 1 hikey = 5000
firsttid = 1 lasttid = 1000
L1'':
lokey = 5001 hikey = 6500
firsttid = 5001 lasttid = 6500
We found that with a high enough value of N, we did not get significant space
benefits from the left split. Thus, we decided to only incorporate N.
Results: [TPC-DS scale = 50, 16 conc copies]
Table zed N=10 N=100 N=1000 heap zed(serial)
catalog_sales 15G 9.1G 7.7G 7.5G 15G 8.0G
catalog_returns 1.5G 0.9G 0.7G 0.7G 1.2G 0.8G
store_returns 2.1G 1.2G 1.1G 1.1G 1.9G 1.2G
store_sales 17G 11G 10.1G 10.1G 21G 10G
Load time:
N=10 30min
N=100 10min
N=1000 7min
zed 100min
heap 8min
'zed' refers to the zedstore branch without our fix. We see that with N = 10, we
get closer to what we get with serial inserts. For N = 100, we even beat serial
insert.
We can attribute the differences in runtime to the fact that by lowering the
number of tid range requests, we reduce the contention on the tid tree - which
is a bottleneck for concurrent loads. A significant win!
How N relates to the other parameters in play:
Let S be the number of concurrent sessions
Let T be the average number of rows that a session wants to write in t sized
multi-insert batches
Let A be the number of attributes
Number of times a session multi-inserts into the tid tree without buffered
allocation = T/t
Number of times a session multi-inserts into the tid tree with buffered
allocation = T/Nt
Total number of multi-inserts into the tid tree = Mt = ST/Nt
Also, total number of adverse insert cases where we could have bloat ∝ Mt
So, bloat ∝ Mt
Run-time of a parallel data load ∝ Mt * A
So the guidance would be to increase N with the increase in S or in T (t will
be relatively constant for a certain table - it is constrained by the size of a
row and the copy buffer) and also if the table is significantly wide.
We can see that it is difficult to provide a default to N, it really should be
a GUC. Also, SINGLE_INSERT_TID_RESERVATION_THRESHOLD and
SINGLE_INSERT_TID_RESERVATION_SIZE should be turned into GUCs. In our
implementation, we treat MULTI_INSERT_TID_RESERVATION_FACTOR = N. We leave the
GUC implementation for later.
Cost of killing the extra unused tids not consumed by multi-inserts:
The maximum number of tids that can be wasted (W) is capped at (tN - 1) * S. This is
the worst case: where the last tid allocation request only used 1 tid out of the
tN tids it received and buffered for every session.
So average case ~ (tN /2) * S. Number of times the tid tree has to be accessed
to delete these (tN/2) * S tids is S. So taking tid wastage into account, on
average, number of accesses to the tid tree = Mt + W = ST/Nt +
Thus this additional cost of S, and thus cost of tid killing is not really
significant.
Number of times a session multi-inserts into the tid tree without buffered
allocation = T/t
Number of times a session multi-inserts into the tid tree with buffered
allocation = T/Nt
Total number of multi-inserts into the tid tree = Mt = ST/Nt
Also, total number of adverse insert cases where we could have bloat ∝ Mt
So, bloat ∝ Mt
Run-time of a parallel data load ∝ Mt * A
So the guidance would be to increase N with the increase in S or in T (t will
be relatively constant for a certain table - it is constrained by the size of a
row and the copy buffer) and also if the table is significantly wide.
We can see that it is difficult to provide a default to N, it really should be
a GUC. Also, SINGLE_INSERT_TID_RESERVATION_THRESHOLD and
SINGLE_INSERT_TID_RESERVATION_SIZE should be turned into GUCs. In our
implementation, we treat MULTI_INSERT_TID_RESERVATION_FACTOR = N. We leave the
GUC implementation for later.
Cost of killing the extra unused tids not consumed by multi-inserts:
The maximum number of tids that can be wasted (W) is capped at (tN - 1) * S. This is
the worst case: where the last tid allocation request only used 1 tid out of the
tN tids it received and buffered for every session.
So average case ~ (tN /2) * S. Number of times the tid tree has to be accessed
to delete these (tN/2) * S tids is S. So taking tid wastage into account, on
average, number of accesses to the tid tree = Mt + W = ST/Nt +
Thus this additional cost of S, and thus cost of tid killing is not really
significant.
Regards,
Soumyadeep & David
Hello,
On Wed, Oct 30, 2019 at 3:34 PM Taylor Vesely <tvesely@pivotal.io> wrote:
> Because zedstore resembled random I/O, the read speed was
> significantly hindered on our single disk. As a result, we saw ~150x
> slower read speeds.
Deep and I have committed a fix for this. The root cause of this problem is
that attribute tree (and tid tree) pages are not contiguous enough on disk,
especially if we are loading data concurrently into the same table. The effect
of the non-contiguity is especially felt on rotational disks and is magnified
by increasing the number of sessions that load the table concurrently.
Since a base requirement for a column store is that blocks for a single column
be physically adjacent (or nearly adjacent), we sought to optimize for this
requirement.
What we have done is to introduce attribute-level free page maps (FPMs) and to
batch up relation extension requests. We have introduced a new reloption
zedstore_rel_extension_factor: whenever we want to extend the relation by a
single page for the tid/attribute tree, we extend it by
zedstore_rel_extension_factor number of blocks. We return the one block
requested and prepend the extra blocks to the attribute-level FPM. This makes
the blocks available to other concurrent backends and thus, in spite of
backend-interleaved flushes into the same attribute tree, we see more
contiguity of leaf blocks.
We reason about contiguity of blocks by making some major assumptions:
We consider that two blocks are near each other if they have block numbers that
are close to each other. (Refer: BufferGetBlockNumber())
Also we assume that if two successive relation extension requests would yield
blocks with block numbers that are close to each other.
Recycling of pages for attribute and tid tree also are now done at the
attribute-tree level.
Experiment results and methodology:
Metric used to measure performance -> I/O read time reported by the “I/O
Timings” field in: explain (analyze, buffers, timing, verbose) output with the
track_io_timing GUC on. Before every explain run, we restart the database to
flush the buffers and clear the OS page cache.
Experiment parameters: TPC-DS Scale = 270, table = store_sales, opt_level = -O2
#parallel COPY sessions loading store_sales = 16.
N = zedstore_rel_extension_factor
GUCs used:
shared_buffers: 10GB
max_wal_size: 1GB
checkpoint_flush_after: 1MB
max_parallel_workers: 8
max_parallel_maintenance_workers: 8
maintenance_work_mem: 4GB
log_statement: all
effective_cache_size: 32GB
track_io_timing: on
For rotational disks:
Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 115s. Total exec time = 212s
Zed (w/o fix): Table size = 59G. I/O time = 634s. Total exec time = 730s
Zed (N=32): Table size = 59G. I/O time = 91s. Total exec time = 175s
Zed (N=512): Table size = 59G. I/O time = 7s. Total exec time = 87s
Zed (N=4096): Table size = 59G. I/O time = 2.5s. Total exec time = 82s
Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 130s. Total exec time = 214s
Zed (w/o fix): Table size = 59G. I/O time = 2401s. Total exec time = 2813s
Zed (N=32): Table size = 59G. I/O time = 929s. Total exec time = 1300s
Zed (N=512): Table size = 59G. I/O time = 485s. Total exec time = 847s
Zed (N=4096): Table size = 59G. I/O time = 354s. Total exec time = 716s
We also saw discernible differences in I/O time for scale = 50, table size = 10G
for Zedstore and 21G for heap. Results not reported for brevity.
Our fix doesn't impact COPY performance, so we saw no difference in the time
taken to load the data into store_sales.
For NVMe SSDs:
We see no discernible differences in I/O times with and without the fix
(performance for select * was slightly worse for N=4096). Here
are some of the results:
Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 59s. Total exec time = 123s
Zed (w/o fix): Table size = 59G. I/O time = 20s. Total exec time = 79s
Zed (N=4096): Table size = 59G. I/O time = 21s. Total exec time = 87s
Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 64s. Total exec time = 127s
Zed (w/o fix): Table size = 61G. I/O time = 449s. Total exec time = 757s
Zed (N=4096): Table size = 61G. I/O time = 487s. Total exec time = 812s
Analysis of fix:
The following query inspects the (block distance) absolute difference between
two logically adjacent leaf blocks for the ss_sold_date_sk attribute of
store_sales. It shows us the distribution of the block distances in the
ss_sold_date_sk attribute tree. Output is limited for brevity.
with blk_dist(dist) as (select abs(nextblk - blkno) as dist from
pg_zs_btree_pages('store_sales'::regclass) where attno=1 and level=0 and
nextblk != 4294967295)
select dist, count(dist) as cnt from blk_dist group by
dist order by cnt desc limit 5;
> significantly hindered on our single disk. As a result, we saw ~150x
> slower read speeds.
Deep and I have committed a fix for this. The root cause of this problem is
that attribute tree (and tid tree) pages are not contiguous enough on disk,
especially if we are loading data concurrently into the same table. The effect
of the non-contiguity is especially felt on rotational disks and is magnified
by increasing the number of sessions that load the table concurrently.
Since a base requirement for a column store is that blocks for a single column
be physically adjacent (or nearly adjacent), we sought to optimize for this
requirement.
What we have done is to introduce attribute-level free page maps (FPMs) and to
batch up relation extension requests. We have introduced a new reloption
zedstore_rel_extension_factor: whenever we want to extend the relation by a
single page for the tid/attribute tree, we extend it by
zedstore_rel_extension_factor number of blocks. We return the one block
requested and prepend the extra blocks to the attribute-level FPM. This makes
the blocks available to other concurrent backends and thus, in spite of
backend-interleaved flushes into the same attribute tree, we see more
contiguity of leaf blocks.
We reason about contiguity of blocks by making some major assumptions:
We consider that two blocks are near each other if they have block numbers that
are close to each other. (Refer: BufferGetBlockNumber())
Also we assume that if two successive relation extension requests would yield
blocks with block numbers that are close to each other.
Recycling of pages for attribute and tid tree also are now done at the
attribute-tree level.
Experiment results and methodology:
Metric used to measure performance -> I/O read time reported by the “I/O
Timings” field in: explain (analyze, buffers, timing, verbose) output with the
track_io_timing GUC on. Before every explain run, we restart the database to
flush the buffers and clear the OS page cache.
Experiment parameters: TPC-DS Scale = 270, table = store_sales, opt_level = -O2
#parallel COPY sessions loading store_sales = 16.
N = zedstore_rel_extension_factor
GUCs used:
shared_buffers: 10GB
max_wal_size: 1GB
checkpoint_flush_after: 1MB
max_parallel_workers: 8
max_parallel_maintenance_workers: 8
maintenance_work_mem: 4GB
log_statement: all
effective_cache_size: 32GB
track_io_timing: on
For rotational disks:
Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 115s. Total exec time = 212s
Zed (w/o fix): Table size = 59G. I/O time = 634s. Total exec time = 730s
Zed (N=32): Table size = 59G. I/O time = 91s. Total exec time = 175s
Zed (N=512): Table size = 59G. I/O time = 7s. Total exec time = 87s
Zed (N=4096): Table size = 59G. I/O time = 2.5s. Total exec time = 82s
Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 130s. Total exec time = 214s
Zed (w/o fix): Table size = 59G. I/O time = 2401s. Total exec time = 2813s
Zed (N=32): Table size = 59G. I/O time = 929s. Total exec time = 1300s
Zed (N=512): Table size = 59G. I/O time = 485s. Total exec time = 847s
Zed (N=4096): Table size = 59G. I/O time = 354s. Total exec time = 716s
We also saw discernible differences in I/O time for scale = 50, table size = 10G
for Zedstore and 21G for heap. Results not reported for brevity.
Our fix doesn't impact COPY performance, so we saw no difference in the time
taken to load the data into store_sales.
For NVMe SSDs:
We see no discernible differences in I/O times with and without the fix
(performance for select * was slightly worse for N=4096). Here
are some of the results:
Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 59s. Total exec time = 123s
Zed (w/o fix): Table size = 59G. I/O time = 20s. Total exec time = 79s
Zed (N=4096): Table size = 59G. I/O time = 21s. Total exec time = 87s
Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 64s. Total exec time = 127s
Zed (w/o fix): Table size = 61G. I/O time = 449s. Total exec time = 757s
Zed (N=4096): Table size = 61G. I/O time = 487s. Total exec time = 812s
Analysis of fix:
The following query inspects the (block distance) absolute difference between
two logically adjacent leaf blocks for the ss_sold_date_sk attribute of
store_sales. It shows us the distribution of the block distances in the
ss_sold_date_sk attribute tree. Output is limited for brevity.
with blk_dist(dist) as (select abs(nextblk - blkno) as dist from
pg_zs_btree_pages('store_sales'::regclass) where attno=1 and level=0 and
nextblk != 4294967295)
select dist, count(dist) as cnt from blk_dist group by
dist order by cnt desc limit 5;
W/o fix: #parallel_copies=16,
W/ fix: #parallel_copies=16, extension_factor=16
W/o fix W/ fix
dist | cnt dist | cnt
-----+----- -----+------
25 | 89 1 | 3228
26 | 83 2 | 3192
23 | 78 3 | 2664
1 | 75 4 | 2218
29 | 74 5 | 1866
We can see that by increasing zedstore_rel_extension_factor, we end up with
a high number of lower block distances.
Implications of fix:
1. We have to keep track of the FPM heads for the attribute/tid trees in the
meta-page, and since we don't have an extensible meta-page yet, we further limit
the number of columns Zedstore can support. We will get around to it eventually.
2. Worst case extra space wasted on disk from extra free pages that could linger
after a bulk load = zedstore_rel_extension_factor * #attributes * 8192 bytes.
For zedstore_rel_extension_factor = 16, #attributes = 23:
wastage = 16*24*8192/1024/1024 = 3M
For zedstore_rel_extension_factor = 4096, #attributes = 23:
wastage = 4096*24*8192/1024/1024 = 768M
Note: The free pages left behind can of course, be used by subsequent operations
on the table.
In conclusion, increasing zedstore_rel_extension_factor for a wide table may
lead to bloating of the relfile. The percentage of bloat would also be magnified
if the table doesn't have a lot of data.
3. Amount of extra WAL being written (since we are placing/removing the extra
blocks on the FPMs, something we never did without this fix) is independent of
zedstore_rel_extension_factor and we found that we had written approximately 14M
extra WAL for every 1G relfile.
Guidance on setting zedstore_rel_extension_factor:
Users should set a high zedstore_rel_extension_factor, when they are loading
data on rotational disks, with/without a high degree of concurrency and when
they have significant data size.
Attached is a patch with our changes: [1]
Also attached is a rebased version of Zedstore on latest PG master. [2]
Github branch for Zedstore: [3]
[1] 0001-Attribute-level-FPMs-and-rel-extension-batching.patch
[2] v4-zedstore.patch
[3] https://github.com/greenplum-db/postgres/tree/zedstore
--
Alex & Deep
W/o fix W/ fix
dist | cnt dist | cnt
-----+----- -----+------
25 | 89 1 | 3228
26 | 83 2 | 3192
23 | 78 3 | 2664
1 | 75 4 | 2218
29 | 74 5 | 1866
We can see that by increasing zedstore_rel_extension_factor, we end up with
a high number of lower block distances.
Implications of fix:
1. We have to keep track of the FPM heads for the attribute/tid trees in the
meta-page, and since we don't have an extensible meta-page yet, we further limit
the number of columns Zedstore can support. We will get around to it eventually.
2. Worst case extra space wasted on disk from extra free pages that could linger
after a bulk load = zedstore_rel_extension_factor * #attributes * 8192 bytes.
For zedstore_rel_extension_factor = 16, #attributes = 23:
wastage = 16*24*8192/1024/1024 = 3M
For zedstore_rel_extension_factor = 4096, #attributes = 23:
wastage = 4096*24*8192/1024/1024 = 768M
Note: The free pages left behind can of course, be used by subsequent operations
on the table.
In conclusion, increasing zedstore_rel_extension_factor for a wide table may
lead to bloating of the relfile. The percentage of bloat would also be magnified
if the table doesn't have a lot of data.
3. Amount of extra WAL being written (since we are placing/removing the extra
blocks on the FPMs, something we never did without this fix) is independent of
zedstore_rel_extension_factor and we found that we had written approximately 14M
extra WAL for every 1G relfile.
Guidance on setting zedstore_rel_extension_factor:
Users should set a high zedstore_rel_extension_factor, when they are loading
data on rotational disks, with/without a high degree of concurrency and when
they have significant data size.
Attached is a patch with our changes: [1]
Also attached is a rebased version of Zedstore on latest PG master. [2]
Github branch for Zedstore: [3]
[1] 0001-Attribute-level-FPMs-and-rel-extension-batching.patch
[2] v4-zedstore.patch
[3] https://github.com/greenplum-db/postgres/tree/zedstore
--
Alex & Deep
Attachment
Hello, We (Jacob and me) have an update for this thread. 1. We recently made some improvements to the table AM APIs for fetching a single row (tuple_fetch_row_version()) and locking (and fetching) a tuple (tuple_lock()), such that they could take a set of columns. We extracted these columns at plan time and in some cases, executor time. The changes are in the same spirit as some column-oriented changes that are already a part of Zedstore - namely the ability to pass a set of columns to sequential and index scans among other operations. We observed that the two table AM functions are called in contexts which don't need the entire set of columns to be populated in the output TupleTableSlots associated with these APIs. For instance, in DELETE RETURNING, we don't need to fetch all of the columns, just the ones in the RETURNING clause. We saw improvements (see results attached) for a variety of tests - we added a bunch of tests in our storageperf test suite to test these cases. We don't see a performance improvement for UPSERT and ON CONFLICT DO NOTHING as there is an index lookup pulling in the entire row preceding the call to table_tuple_lock() in both these cases. We do see significant improvements (~3x) for DELETE RETURNING and row-level locking and around a ~25x improvement in TidScan runtime. Please refer to src/test/storageperf for the storageperf test suite. 2. We absorbed the scanCols patch [1], replacing some of the existing executor-level column extraction for scans with the scanCols populated during planning as in [1]. 3. We also merged Zedstore upto PG 14 commit: efc5dcfd8a PFA the latest version of the Zedstore patch. Regards, Jacob and Soumyadeep [1] https://www.postgresql.org/message-id/flat/CAAKRu_YxyYOCCO2e83UmHb51sky1hXgeRzQw-PoqT1iHj2ZKVg%40mail.gmail.com#681a254981e915805aec2aea9ea9caf4
Attachment
Hi, Thanks for the updated patch. It's a quite massive amount of code - I I don't think we had many 2MB patches in the past, so this is by no means a full review. 1) the psql_1.out is missing a bit of expected output (due to 098fb0079) 2) I'm getting crashes in intarray contrib, due to hitting this error in lwlock.c (backtrace attached): /* Ensure we will have room to remember the lock */ if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS) elog(ERROR, "too many LWLocks taken"); I haven't investigates this too much, but it's regular build with asserts and TAP tests, so it should be simple to reproduce using "make check-world" I guess. 3) I did a very simple benchmark, loading a TPC-H data (for 75GB), followed by pg_dump, and the duration (in seconds) looks like this: master zedstore/pglz zedstore/lz4 ------------------------------------------------- copy 1855 68092 2131 dump 751 905 811 And the size of the lineitem table (as shown by \d+) is: master: 64GB zedstore/pglz: 51GB zedstore/lz4: 20GB It's mostly expected lz4 beats pglz in performance and compression ratio, but this seems a bit too extreme I guess. Per past benchmarks (e.g. [1] and [2]) the difference in compression/decompression time should be maybe 1-2x or something like that, not 35x like here. [1] https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de [2] https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de Furthermore, the pglz compression is not consuming the most CPU, at least that's what perf says: 24.82% postgres [.] encode_chunk_varlen 20.49% postgres [.] decode_chunk 13.01% postgres [.] merge_attstream_guts.isra.0 12.68% libc-2.32.so [.] __memmove_avx_unaligned_erms 8.72% postgres [.] encode_chunk_fixed 6.16% postgres [.] pglz_compress 4.36% postgres [.] decode_attstream_cont 2.27% postgres [.] 0x00000000000baff0 1.84% postgres [.] AllocSetAlloc 0.79% postgres [.] append_attstream 0.70% postgres [.] palloc So I wonder if this is a sign of a deeper issue - maybe the lower compression ratio (for pglz) triggers some sort of feedback loop in zedstore, or something like that? Not sure, but this seems strange. 4) I looked at some of the code, like merge_attstream etc. and I wonder if this might be related to some of the FIXME comments. For example this bit in merge_attstream seems interesting: * FIXME: we don't actually pay attention to the compression anymore. * We never repack. * FIXME: this is backwords, the normal fast path is if (firsttid1 > lasttid2) But I suppose that should affect both pglz and lz4, and I'm not sure how up to date those comments actually are. BTW the comments in general need updating and tidying up, to make reviews easier. For example the merge_attstream comment references attstream1 and attstream2, but those are not the current parameters of the function. 5) IHMO there should be a #define specifying the maximum number of items per chunk (60). Currently there are literal constants used in various places, sometimes 60, sometimes 59 etc. which makes it harder to understand the code. FWIW 60 seems a bit low, but maybe it's OK. 6) I do think ZSAttStream should track which compression is used by the stream, for two main reasons. Firstly, there's another patch to support "custom compression" methods, which (also) allows multiple compression methods per column. It'd be a bit strange to support that for varlena columns in heap table, and not here, I guess. Secondly, I think one of the interesting columnstore features down the road will be execution on compressed data, which however requires compression method designed for that purpose, and it's often datatype-specific (delta encoding, ...). I don't think we need to go as far as supporting "custom" compression methods here, but I think we should allow different built-in compression methods for different attstreams. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Nov 12, 2020, at 2:40 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > Thanks for the updated patch. It's a quite massive amount of code - I I > don't think we had many 2MB patches in the past, so this is by no means > a full review. Thanks for taking a look! You're not kidding about the patch size. FYI, the tableam changes made recently have been extracted into their own patch, which is up at [1]. > 1) the psql_1.out is missing a bit of expected output (due to 098fb0079) Yeah, this patch was rebased as of efc5dcfd8a. > 2) I'm getting crashes in intarray contrib, due to hitting this error in > lwlock.c (backtrace attached): > > /* Ensure we will have room to remember the lock */ > if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS) > elog(ERROR, "too many LWLocks taken"); > > I haven't investigates this too much, but it's regular build with > asserts and TAP tests, so it should be simple to reproduce using "make > check-world" I guess. I've only seen this intermittently in installcheck, and I'm not able to reproduce with the intarray tests on my machine (macOS). Definitely something we need to look into. What OS are you testing on? > It's mostly expected lz4 beats pglz in performance and compression > ratio, but this seems a bit too extreme I guess. Per past benchmarks > (e.g. [1] and [2]) the difference in compression/decompression time > should be maybe 1-2x or something like that, not 35x like here. Yeah, something seems off about that. We'll take a look. > BTW the comments in general need updating and tidying up, to make > reviews easier. For example the merge_attstream comment references > attstream1 and attstream2, but those are not the current parameters of > the function. Agreed. > 5) IHMO there should be a #define specifying the maximum number of items > per chunk (60). Currently there are literal constants used in various > places, sometimes 60, sometimes 59 etc. which makes it harder to > understand the code. FWIW 60 seems a bit low, but maybe it's OK. Yeah, that seems like a good idea. I think the value 60 comes from the use of simple-8b encoding -- see the comment at the top of zedstore_attstream.c. > 6) I do think ZSAttStream should track which compression is used by the > stream, for two main reasons. Firstly, there's another patch to support > "custom compression" methods, which (also) allows multiple compression > methods per column. It'd be a bit strange to support that for varlena > columns in heap table, and not here, I guess. Secondly, I think one of > the interesting columnstore features down the road will be execution on > compressed data, which however requires compression method designed for > that purpose, and it's often datatype-specific (delta encoding, ...). > > I don't think we need to go as far as supporting "custom" compression > methods here, but I think we should allow different built-in compression > methods for different attstreams. Interesting. We'll need to read/grok that ML thread. Thanks again for the review! --Jacob [1] https://www.postgresql.org/message-id/CAE-ML%2B9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw%40mail.gmail.com
On 11/13/20 8:07 PM, Jacob Champion wrote: > On Nov 12, 2020, at 2:40 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> Thanks for the updated patch. It's a quite massive amount of code - I I >> don't think we had many 2MB patches in the past, so this is by no means >> a full review. > > Thanks for taking a look! You're not kidding about the patch size. > > FYI, the tableam changes made recently have been extracted into their > own patch, which is up at [1]. > >> 1) the psql_1.out is missing a bit of expected output (due to 098fb0079) > > Yeah, this patch was rebased as of efc5dcfd8a. > >> 2) I'm getting crashes in intarray contrib, due to hitting this error in >> lwlock.c (backtrace attached): >> >> /* Ensure we will have room to remember the lock */ >> if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS) >> elog(ERROR, "too many LWLocks taken"); >> >> I haven't investigates this too much, but it's regular build with >> asserts and TAP tests, so it should be simple to reproduce using "make >> check-world" I guess. > > I've only seen this intermittently in installcheck, and I'm not able to > reproduce with the intarray tests on my machine (macOS). Definitely > something we need to look into. What OS are you testing on? > Fedora 32, nothing special. I'm not sure if I ran the tests with pglz or lz4, maybe there's some dependence on that, but it does fail for me quite reliably with this: ./configure --enable-debug --enable-cassert --enable-tap-tests --with-lz4 && make -s clean && make -s -j4 && make check-world >> It's mostly expected lz4 beats pglz in performance and compression >> ratio, but this seems a bit too extreme I guess. Per past benchmarks >> (e.g. [1] and [2]) the difference in compression/decompression time >> should be maybe 1-2x or something like that, not 35x like here. > > Yeah, something seems off about that. We'll take a look. > >> BTW the comments in general need updating and tidying up, to make >> reviews easier. For example the merge_attstream comment references >> attstream1 and attstream2, but those are not the current parameters of >> the function. > > Agreed. > >> 5) IHMO there should be a #define specifying the maximum number of items >> per chunk (60). Currently there are literal constants used in various >> places, sometimes 60, sometimes 59 etc. which makes it harder to >> understand the code. FWIW 60 seems a bit low, but maybe it's OK. > > Yeah, that seems like a good idea. > > I think the value 60 comes from the use of simple-8b encoding -- see the > comment at the top of zedstore_attstream.c. > Yeah, I understand where it comes from. I'm just saying that when you see 59 hardcoded, it may not be obvious where it came from, and something like ITEMS_PER_CHUNK would be better. I wonder how complicated would it be to allow larger chunks, e.g. by using one bit to say "there's another 64-bit codeword". Not sure if it's worth the extra complexity, though - it's just that 60 feels a bit low. >> 6) I do think ZSAttStream should track which compression is used by the >> stream, for two main reasons. Firstly, there's another patch to support >> "custom compression" methods, which (also) allows multiple compression >> methods per column. It'd be a bit strange to support that for varlena >> columns in heap table, and not here, I guess. Secondly, I think one of >> the interesting columnstore features down the road will be execution on >> compressed data, which however requires compression method designed for >> that purpose, and it's often datatype-specific (delta encoding, ...). >> >> I don't think we need to go as far as supporting "custom" compression >> methods here, but I think we should allow different built-in compression >> methods for different attstreams. > > Interesting. We'll need to read/grok that ML thread. > That thread is a bit long not sure it's worth reading as a whole unless you want to work on that feature. The gist is that to seamlessly support multiple compression algorithms we need to store an ID of the algorithm somewhere. For TOAST that's not too difficult, we can do that in the TOAST pointer - the the main challenge is in doing it in a backwards-compatible way. For zedstore we can actually design it from the start. I wonder if we should track version of the format somewhere, to allow future improvements. So that if/when we decide to change something in the future, we don't have to scavenge bits etc. Or perhaps just a "uint32 flags" field, unused/reserved for future use. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > master zedstore/pglz zedstore/lz4 > ------------------------------------------------- > copy 1855 68092 2131 > dump 751 905 811 > > And the size of the lineitem table (as shown by \d+) is: > > master: 64GB > zedstore/pglz: 51GB > zedstore/lz4: 20GB > > It's mostly expected lz4 beats pglz in performance and compression > ratio, but this seems a bit too extreme I guess. Per past benchmarks > (e.g. [1] and [2]) the difference in compression/decompression time > should be maybe 1-2x or something like that, not 35x like here. I can't speak to the ratio, but in basic backup/restore scenarios pglz is absolutely killing me; Performance is just awful; we are cpubound in backups throughout the department. Installations defaulting to plgz will make this feature show very poorly. merlin
On 11/16/20 1:59 PM, Merlin Moncure wrote: > On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> master zedstore/pglz zedstore/lz4 >> ------------------------------------------------- >> copy 1855 68092 2131 >> dump 751 905 811 >> >> And the size of the lineitem table (as shown by \d+) is: >> >> master: 64GB >> zedstore/pglz: 51GB >> zedstore/lz4: 20GB >> >> It's mostly expected lz4 beats pglz in performance and compression >> ratio, but this seems a bit too extreme I guess. Per past benchmarks >> (e.g. [1] and [2]) the difference in compression/decompression time >> should be maybe 1-2x or something like that, not 35x like here. > > I can't speak to the ratio, but in basic backup/restore scenarios pglz > is absolutely killing me; Performance is just awful; we are cpubound > in backups throughout the department. Installations defaulting to > plgz will make this feature show very poorly. > Maybe. I'm not disputing that pglz is considerably slower than lz4, but judging by previous benchmarks I'd expect the compression to be slower maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly for the compression ratio - lz4 is great, but it seems strange it's 1/2 the size of pglz. Which is why I'm speculating that something else is going on. As for the "plgz will make this feature show very poorly" I think that depends. I think we may end up with pglz doing pretty well (compared to heap), but lz4 will probably outperform that. OTOH for various use cases it may be more efficient to use something else with worse compression ratio, but allowing execution on compressed data, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 16, 2020 at 10:07 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > On 11/16/20 1:59 PM, Merlin Moncure wrote: > > On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> master zedstore/pglz zedstore/lz4 > >> ------------------------------------------------- > >> copy 1855 68092 2131 > >> dump 751 905 811 > >> > >> And the size of the lineitem table (as shown by \d+) is: > >> > >> master: 64GB > >> zedstore/pglz: 51GB > >> zedstore/lz4: 20GB > >> > >> It's mostly expected lz4 beats pglz in performance and compression > >> ratio, but this seems a bit too extreme I guess. Per past benchmarks > >> (e.g. [1] and [2]) the difference in compression/decompression time > >> should be maybe 1-2x or something like that, not 35x like here. > > > > I can't speak to the ratio, but in basic backup/restore scenarios pglz > > is absolutely killing me; Performance is just awful; we are cpubound > > in backups throughout the department. Installations defaulting to > > plgz will make this feature show very poorly. > > > > Maybe. I'm not disputing that pglz is considerably slower than lz4, but > judging by previous benchmarks I'd expect the compression to be slower > maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly > for the compression ratio - lz4 is great, but it seems strange it's 1/2 > the size of pglz. Which is why I'm speculating that something else is > going on. > > As for the "plgz will make this feature show very poorly" I think that > depends. I think we may end up with pglz doing pretty well (compared to > heap), but lz4 will probably outperform that. OTOH for various use cases > it may be more efficient to use something else with worse compression > ratio, but allowing execution on compressed data, etc. hm, you might be right. Doing some number crunching, I'm getting about 23mb/sec compression on a 600gb backup image on a pretty typical aws server. That's obviously not great, but your numbers are much worse than that, so maybe something else might be going on. > I think we may end up with pglz doing pretty well (compared to heap) I *don't* think so, or at least I'm skeptical as long as insertion times are part of the overall performance measurement. Naturally, with column stores, insertion times are often very peripheral to the overall performance picture but for cases that aren't I suspect the results are not going to be pleasant, and advise planning accordingly. Aside, I am very interested in this work. I may be able to support testing in an enterprise environment; lmk if interested -- thank you merlin
On Nov 13, 2020, at 2:00 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Fedora 32, nothing special. I'm not sure if I ran the tests with pglz or > lz4, maybe there's some dependence on that, but it does fail for me > quite reliably with this: > > ./configure --enable-debug --enable-cassert --enable-tap-tests > --with-lz4 && make -s clean && make -s -j4 && make check-world I'm not sure what I messed up the first time, but I am able to reproduce reliably now, with and without lz4. It looks like we have a workaround in place that significantly increases the number of simultaneous locks acquired during indexing: #define XLR_MAX_BLOCK_ID 199 So that's in need of resolution. I'd expect gin and gist to be pretty flaky until we fix that. --Jacob
On Wed, 18 Nov 2020 at 00:31, Jacob Champion <pchampion@vmware.com> wrote: > > So that's in need of resolution. I'd expect gin and gist to be pretty > flaky until we fix that. Jacob and Soumyadeep, Thanks for submitting this. I think a fix is still outstanding? and the patch fails to apply on HEAD in two places. Please can you submit the next version? Do you mind if we add this for review to the Jan CF? It is a lot of code and I think there is significant difficulty for the community to accept that as-is, even though it looks to be a very high quality submission. So I would like to suggest a strategy for commit: we accept Zedstore as "Beta" or "Experimental" in PG14, perhaps with a WARNING/Caution similar to the one that used to be given by Postgres in earlier versions when you created a Hash index. We keep Zedstore in "Beta" mode until a later release, PG15 or later when we can declare Zedstore fully safe. That approach allows us to get this into the repo asap, and then be fixed and improved incrementally from here. e.g. "NOTICE: Caution: Zedstore is an experimental feature in PostgreSQL14 intended for robustness and performance testing only. Your data and/or query accuracy may be at risk if you rely on this." -- Simon Riggs http://www.EnterpriseDB.com/
On Dec 31, 2020, at 9:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:On Wed, 18 Nov 2020 at 00:31, Jacob Champion <pchampion@vmware.com> wrote:
So that's in need of resolution. I'd expect gin and gist to be pretty
flaky until we fix that.
Jacob and Soumyadeep,
Thanks for submitting this. I think a fix is still outstanding? and
the patch fails to apply on HEAD in two places.
Please can you submit the next version?
Do you mind if we add this for review to the Jan CF?
It is a lot of code and I think there is significant difficulty for
the community to accept that as-is, even though it looks to be a very
high quality submission. So I would like to suggest a strategy for
commit: we accept Zedstore as "Beta" or "Experimental" in PG14,
perhaps with a WARNING/Caution similar to the one that used to be
given by Postgres in earlier versions when you created a Hash index.
We keep Zedstore in "Beta" mode until a later release, PG15 or later
when we can declare Zedstore fully safe. That approach allows us to
get this into the repo asap, and then be fixed and improved
incrementally from here.
The goal for Zedstore is to get a Column Store into Postgres, but not necessarily Zedstore. (Zedstore itself would be nice) When designing Zedstore success for us would be:
- significantly more performant on OLAP type queries,
- performant enough to not be terrible with OLTP type queries
- must support compression
- cannot be append only, this was the case initially with Greenplum Column Store and it was a mistake. Customers want to update and delete
- it needs to be feature complete as compared to HEAP unless it doesn’t make sense
Our initial goal is to get the TableAM and executor molded into a state where the above is possible for anyone wanting a column store implementation.
Given the goal of addressing API/Executor issues generically first, we have been trying to peel off and work on the parts that are not tightly linked to Zedstore. Specifically I don’t think it would be ok to merge Zedstore into core when it might affect the performance of HEAP relations.
Instead of focusing on the larger, more difficult to review Zedstore patch, we are trying to peel off the touch points where Zedstore and the current server interact. Note this isn’t intended to be an exhaustive list, rather a list of the most immediate issues. Some of these issues are critical for Zedstore to work, i.e. column projection, while some of these issues point more towards ensuring the various layers in the code are clean so that folks leveraging the TableAM don’t need to write their own bits from whole cloth but rather can leverage appropriately generic primitives, i.e. DBsize or page inspect.
As such, an incomplete list of things currently on our radar:
1) Column Projection — We have a patch [1] that is a demonstration of what we would like to do. There are several TODOs in the email that can/will be addressed if the general method is acceptable
2) DBSize —Georgios has a patch [2] that begins to make DBSize less HEAP specific
3) Reloptions —Jeff Davis has a patch [3] that begins to make these more flexible, having spoken with him we think additional work needs to be done here
4) PageInspect —needs to be less HEAP specific but no work has been done here that I’m aware of
5) bitmapHeapScan —currently scans both the index and the relation, there are code comments to address this and we need to look into what a fix would mean
6) Bulk insertion —Justin Pryzby has a patch [4] we are following along with.
7) analyze — Denis has a patch which starts to address this [5]
Ideally we can peel out anything that is useful to any column store. Once those have been discussed and committed the general code should be in better shape as well.
— Rob
e.g.
"NOTICE: Caution: Zedstore is an experimental feature in PostgreSQL14
intended for robustness and performance testing only. Your data and/or
query accuracy may be at risk if you rely on this."
--
Simon Riggs http://www.EnterpriseDB.com/
Greetings. Thanks for the project. I see the code in github has not been updated for a long time, is it still in active development? Thanks -- Best Regards Andy Fan (https://www.aliyun.com/)