zheap: a new storage format for PostgreSQL - Mailing list pgsql-hackers

From Amit Kapila
Subject zheap: a new storage format for PostgreSQL
Date
Msg-id CAA4eK1+YtM5vxzSM2NZm+pC37MCwyvtkmJrO_yRBQeZDp9Wa2w@mail.gmail.com
Whole thread Raw
Responses Re: zheap: a new storage format for PostgreSQL  (Amit Kapila <amit.kapila16@gmail.com>)
Re: zheap: a new storage format for PostgreSQL  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Re: zheap: a new storage format for PostgreSQL  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: zheap: a new storage format for PostgreSQL  (Fabien COELHO <coelho@cri.ensmp.fr>)
RE: zheap: a new storage format for PostgreSQL  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
Re: zheap: a new storage format for PostgreSQL  (Aleksander Alekseev <a.alekseev@postgrespro.ru>)
Re: zheap: a new storage format for PostgreSQL  (Mithun Cy <mithun.cy@enterprisedb.com>)
List pgsql-hackers
Sometime back Robert has proposed a solution to reduce the bloat in PostgreSQL [1] which has some other advantages of its own as well.  To recap, in the existing heap, we always create a new version of a tuple on an update which must eventually be removed by periodic vacuuming or by HOT-pruning, but still in many cases space is never reclaimed completely.  A similar problem occurs for tuples that are deleted.  This leads to bloat in the database.

At EnterpriseDB, we (me and some of my colleagues) are working from more than a year on the new storage format in which only the latest version of the data is kept in main storage and the old versions are moved to an undo log.  We call this new storage format "zheap".  To be clear, this proposal is for PG-12.  The purpose of posting this at this stage is that it can help as an example to be integrated with pluggable storage API patch and to get some early feedback on the design.  The purpose of this email is to introduce the overall project, however, I think going forward, we need to discuss some of the subsystems (like Indexing, Tuple locking, Vacuum for non-delete-marked indexes, Undo Log Storage, Undo Workers, etc. ) in separate threads.

The three main advantages of this new format are:
1. Provide better control over bloat (a) by allowing in-place updates in common cases and (b) by reusing space as soon as a transaction that has performed a delete or non-in-place-update has committed.  In short, with this new storage, whenever possible, we’ll avoid creating bloat in the first place.

2. Reduce write amplification both by avoiding rewrites of heap pages (for setting hint-bits, freezing, etc.) and by making it possible to do an update that touches indexed columns without updating every index.

3. Reduce the tuple size by (a) shrinking the tuple header and (b) eliminating most alignment padding.

You can check README.md in the project folder [1] to understand how to use it and also what are the open issues. The detailed design of the project is present at src/backend/access/zheap/README.  The code for this project is being developed in Github repository [1].  You can also read about this project from Robert's recent blog [2].  I have also added few notes on integration with pluggable API on zheap wiki page [3].

Preliminary performance results
-------------------------------------------

We’ve shown the performance improvement of zheap over heap in a few different pgbench scenarios.  All of these tests were run with data that fits in shared_buffers (32GB), and 16 transaction slots per zheap page. Scenario-1 and Scenario-2 has used synchronous_commit = off and Scenario-3 and Scenario-4 has used synchronous_commit = on

Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100 shows 5.13% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor; at scale factor 1000, it reaches11.5% with 64 clients.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

Before test

100

1281 MB

1149 MB

-10.3%

1000

13 GB

11 GB

-15.38%

After test

100

4.08 GB

3 GB

-26.47%

1000

15 GB

12.6 GB

-16%

* The size of zheap tables increase because of the insertions in pgbench_history table.

Scenario 2: To show the effect of bloat, we’ve performed another test similar to the previous scenario, but a transaction is kept open for the first 15 minutes of a 30-minute test. This restricts HOT-pruning for the heap and undo-discarding for zheap for the first half of the test. Scale factor 1000 - 75.86% TPS improvement for zheap at 64 client count.  

Scale factor 3000 - 98.18% TPS improvement for zheap at 64 client count.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

After test

1000

19 GB

14 GB

-26.3%

3000

45 GB

37 GB

-17.7%

* The size of zheap tables increase because of the insertions in pgbench_history table.

The reason for this huge performance improvement is that when the long-running transaction gets committed after 900 seconds, autovacuum workers start working and degrade the performance of heap for a long time. In addition, the heap tables are also bloated by a significant amount. On the other hand, the undo worker discards the undo very quickly, and we don't have any bloat in the zheap relations. In brief, zheap clusters the bloats in undo segments. We just need to determine the how much undo can be discarded and remove it, which is cheap.

Scenario 3: A 15 minutes simple-update pgbench test with scale factor 100 shows 6% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor to 1000 achieving 11.8% with 64 clients.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

Before test

100

1281 MB

1149 MB

-10.3%

1000

13 GB

11 GB

-15.38%

After test

100

2.88 GB

2.20 GB

-23.61%

1000

13.9 GB

11.7 GB

-15.8%

* The size of zheap tables increase because of the insertions in pgbench_history table.

Scenario 4: To amplify the effect of bloats in scenario 3, we’ve performed another test similar to scenario, but a transaction is kept open for the first 15 minutes of a 30 minute test. This restricts HOT-pruning for heap and undo-discarding for zheap for the first half of the test.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

After test

1000

15.5 GB

12.4 GB

-20%

3000

40.2 GB

35 GB

-12.9%



Pros
--------
1. Zheap has better performance characteristics as it is smaller in size and it has an efficient mechanism to discard undo in the background which is cheaper than HOT-pruning.
2. The performance improvement is huge in cases where heap bloats and zheap bloats the undo.
3. We will also see a good performance boost for the cases where UPDATE statement updates few indexed columns.
4. The system slowdowns due to Vacuum (or Autovacuum) would be reduced to a great extent.
5. Due to fewer rewrites of the heap (like is no freezing, hot-pruning, hint-bits etc), the overall writes and the WAL volume will be lesser.

Cons
-----------
1. Deletes can be somewhat expensive.
2. Transaction aborts will be expensive.
3. Updates that update most of the indexed columns can be somewhat expensive.

Credits
------------
Robert did much of the basic design work.  The design and development of various subsystems of zheap have been done by a team comprising of me, Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena Emerson, and Amit Khandekar.  Thomas Munro wrote the undo storage system.  Marc Linster has provided unfailing management support, and Andres Freund has provided some design input (and criticism).  Neha Sharma and Tushar Ahuja are helping with the testing of this project.

[1] - https://github.com/EnterpriseDB/zheap
[2] - http://rhaas.blogspot.in/2018/01/do-or-undo-there-is-no-vacuum.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: 2018-03 Commitfest starts tomorrow
Next
From: David Steele
Date:
Subject: Re: Re: Cast jsonb to numeric, int, float, bool