Re: zheap: a new storage format for PostgreSQL - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | Re: zheap: a new storage format for PostgreSQL |
Date | |
Msg-id | 42ebe808-b40f-dd90-c815-e074398992ee@catalyst.net.nz Whole thread Raw |
In response to | Re: zheap: a new storage format for PostgreSQL (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On 03/03/18 05:03, Robert Haas wrote: > On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: >> I would propose "zero-bloat heap" disambiguation of zheap. Seems like fair >> enough explanation for me without need to rename :) > It will be possible to bloat a zheap table in certain usage patterns. > For example, if you bulk-load the table with a ton of data, commit the > transaction, delete every other row, and then never insert any more > rows ever again, the table is bloated: it's twice as large as it > really needs to be, and we have no provision for shrinking it. In > general, I think it's very hard to keep bulk deletes from leaving > bloat in the table, and to the extent that it *is* possible, we're not > doing it. One could imagine, for example, an index-organized table > that automatically combines adjacent pages when they're empty enough, > and that also relocates data to physically lower-numbered pages > whenever possible. Such a storage engine might automatically shrink > the on-disk footprint after a large delete, but we have no plans to go > in that direction. > > Rather, our assumption is that the bloat most people care about comes > from updates. By performing updates in-place as often as possible, we > hope to avoid bloating both the heap (because we're not adding new row > versions to it which then have to be removed) and the indexes (because > if we don't add new row versions at some other TID, then we don't need > to add index pointers to that new TID either, or remove the old index > pointers to the old TID). Without delete-marking, we can basically > optimize the case that is currently handled via HOT updates: no > indexed columns have changed. However, the in-place update has a > major advantage that it still works even when the page is completely > full, provided that the row does not expand. As Amit's results show, > that can hugely reduce bloat and increase performance in the face of > long-running concurrent transactions. With delete-marking, we can > also optimize the case where indexed columns have been changed. We > don't know exactly how well this will work yet because the code isn't > written and therefore can't be benchmarked, but am hopeful that that > in-place updates will be a big win here too. > > So, I would not describe a zheap table as zero-bloat, but it should > involve a lot less bloat than our standard heap. > For folk doing ETL type data warehousing this should be great, as the typical workload tends to be like: COPY (or similar) from foreign data source, then do several sets of UPDATES to fix/check/scrub the data...which tends to result in huge bloat with the current heap design (despite telling people 'you can do it another way to' to avoid bloat - I guess it seems to be more intuitive to just to do it as described). regards Mark
pgsql-hackers by date: