Home > mailing lists

zheap: a new storage format for PostgreSQL - Mailing list pgsql-hackers

From	Amit Kapila
Subject	zheap: a new storage format for PostgreSQL
Date	March 1, 2018 20:09:04
Msg-id	CAA4eK1+YtM5vxzSM2NZm+pC37MCwyvtkmJrO_yRBQeZDp9Wa2w@mail.gmail.com Whole thread Raw
Responses	Re: zheap: a new storage format for PostgreSQL (Amit Kapila <amit.kapila16@gmail.com>) Re: zheap: a new storage format for PostgreSQL (Alexander Korotkov <a.korotkov@postgrespro.ru>) Re: zheap: a new storage format for PostgreSQL (Alvaro Herrera <alvherre@alvh.no-ip.org>) Re: zheap: a new storage format for PostgreSQL (Fabien COELHO <coelho@cri.ensmp.fr>) RE: zheap: a new storage format for PostgreSQL ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>) Re: zheap: a new storage format for PostgreSQL (Aleksander Alekseev <a.alekseev@postgrespro.ru>) Re: zheap: a new storage format for PostgreSQL (Mithun Cy <mithun.cy@enterprisedb.com>)
List	pgsql-hackers

Tree view

Sometime back Robert has proposed a solution to reduce the bloat in PostgreSQL [1] which has some other advantages of its own as well. To recap, in the existing heap, we always create a new version of a tuple on an update which must eventually be removed by periodic vacuuming or by HOT-pruning, but still in many cases space is never reclaimed completely. A similar problem occurs for tuples that are deleted. This leads to bloat in the database.

At EnterpriseDB, we (me and some of my colleagues) are working from more than a year on the new storage format in which only the latest version of the data is kept in main storage and the old versions are moved to an undo log. We call this new storage format "zheap". To be clear, this proposal is for PG-12. The purpose of posting this at this stage is that it can help as an example to be integrated with pluggable storage API patch and to get some early feedback on the design. The purpose of this email is to introduce the overall project, however, I think going forward, we need to discuss some of the subsystems (like Indexing, Tuple locking, Vacuum for non-delete-marked indexes, Undo Log Storage, Undo Workers, etc. ) in separate threads.

The three main advantages of this new format are:
1. Provide better control over bloat (a) by allowing in-place updates in common cases and (b) by reusing space as soon as a transaction that has performed a delete or non-in-place-update has committed. In short, with this new storage, whenever possible, we’ll avoid creating bloat in the first place.

2. Reduce write amplification both by avoiding rewrites of heap pages (for setting hint-bits, freezing, etc.) and by making it possible to do an update that touches indexed columns without updating every index.

3. Reduce the tuple size by (a) shrinking the tuple header and (b) eliminating most alignment padding.

You can check README.md in the project folder [1] to understand how to use it and also what are the open issues. The detailed design of the project is present at src/backend/access/zheap/README. The code for this project is being developed in Github repository [1]. You can also read about this project from Robert's recent blog [2]. I have also added few notes on integration with pluggable API on zheap wiki page [3].

Preliminary performance results

-------------------------------------------

We’ve shown the performance improvement of zheap over heap in a few different pgbench scenarios. All of these tests were run with data that fits in shared_buffers (32GB), and 16 transaction slots per zheap page. Scenario-1 and Scenario-2 has used synchronous_commit = off and Scenario-3 and Scenario-4 has used synchronous_commit = on
Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100 shows 5.13% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor; at scale factor 1000, it reaches11.5% with 64 clients.

Scale Factor
HEAP
ZHEAP (tables)*
Improvement
Before test
100
1281 MB
1149 MB
-10.3%
1000
13 GB
11 GB
-15.38%
After test
100
4.08 GB
3 GB
-26.47%
1000
15 GB
12.6 GB
-16%
* The size of zheap tables increase because of the insertions in pgbench_history table.
Scenario 2: To show the effect of bloat, we’ve performed another test similar to the previous scenario, but a transaction is kept open for the first 15 minutes of a 30-minute test. This restricts HOT-pruning for the heap and undo-discarding for zheap for the first half of the test. Scale factor 1000 - 75.86% TPS improvement for zheap at 64 client count.
Scale factor 3000 - 98.18% TPS improvement for zheap at 64 client count.

Scale Factor
HEAP
ZHEAP (tables)*
Improvement
After test
1000
19 GB
14 GB
-26.3%
3000
45 GB
37 GB
-17.7%
* The size of zheap tables increase because of the insertions in pgbench_history table.
The reason for this huge performance improvement is that when the long-running transaction gets committed after 900 seconds, autovacuum workers start working and degrade the performance of heap for a long time. In addition, the heap tables are also bloated by a significant amount. On the other hand, the undo worker discards the undo very quickly, and we don't have any bloat in the zheap relations. In brief, zheap clusters the bloats in undo segments. We just need to determine the how much undo can be discarded and remove it, which is cheap.
Scenario 3: A 15 minutes simple-update pgbench test with scale factor 100 shows 6% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor to 1000 achieving 11.8% with 64 clients.

Scale Factor
HEAP
ZHEAP (tables)*
Improvement
Before test
100
1281 MB
1149 MB
-10.3%
1000
13 GB
11 GB
-15.38%
After test
100
2.88 GB
2.20 GB
-23.61%
1000
13.9 GB
11.7 GB
-15.8%
* The size of zheap tables increase because of the insertions in pgbench_history table.
Scenario 4: To amplify the effect of bloats in scenario 3, we’ve performed another test similar to scenario, but a transaction is kept open for the first 15 minutes of a 30 minute test. This restricts HOT-pruning for heap and undo-discarding for zheap for the first half of the test.

Scale Factor
HEAP
ZHEAP (tables)*
Improvement
After test
1000
15.5 GB
12.4 GB
-20%
3000
40.2 GB
35 GB
-12.9%

Pros

--------

1. Zheap has better performance characteristics as it is smaller in size and it has an efficient mechanism to discard undo in the background which is cheaper than HOT-pruning.

2. The performance improvement is huge in cases where heap bloats and zheap bloats the undo.

3. We will also see a good performance boost for the cases where UPDATE statement updates few indexed columns.

4. The system slowdowns due to Vacuum (or Autovacuum) would be reduced to a great extent.

5. Due to fewer rewrites of the heap (like is no freezing, hot-pruning, hint-bits etc), the overall writes and the WAL volume will be lesser.

Cons

-----------

1. Deletes can be somewhat expensive.

2. Transaction aborts will be expensive.

3. Updates that update most of the indexed columns can be somewhat expensive.

Credits
------------

Robert did much of the basic design work. The design and development of various subsystems of zheap have been done by a team comprising of me, Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena Emerson, and Amit Khandekar. Thomas Munro wrote the undo storage system. Marc Linster has provided unfailing management support, and Andres Freund has provided some design input (and criticism). Neha Sharma and Tushar Ahuja are helping with the testing of this project.

[1] - https://github.com/EnterpriseDB/zheap
[2] - http://rhaas.blogspot.in/2018/01/do-or-undo-there-is-no-vacuum.html

[3] - https://wiki.postgresql.org/wiki/Zheap#Integration_with_Pluggable_Storage_API

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: David Steele
Date: 01 March 2018, 19:55:08
Subject: Re: 2018-03 Commitfest starts tomorrow

From: David Steele
Date: 01 March 2018, 20:13:48
Subject: Re: Re: Cast jsonb to numeric, int, float, bool

zheap: a new storage format for PostgreSQL - Mailing list pgsql-hackers

Previous

Next

	Scale Factor	HEAP	ZHEAP (tables)*	Improvement
Before test	100	1281 MB	1149 MB	-10.3%
Before test	1000	13 GB	11 GB	-15.38%
After test	100	4.08 GB	3 GB	-26.47%
After test	1000	15 GB	12.6 GB	-16%