Re: zheap: a new storage format for PostgreSQL - Mailing list pgsql-hackers

From Mark Kirkwood
Subject Re: zheap: a new storage format for PostgreSQL
Date
Msg-id 42ebe808-b40f-dd90-c815-e074398992ee@catalyst.net.nz
Whole thread Raw
In response to Re: zheap: a new storage format for PostgreSQL  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers

On 03/03/18 05:03, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
>> I would propose "zero-bloat heap" disambiguation of zheap.  Seems like fair
>> enough explanation for me without need to rename :)
> It will be possible to bloat a zheap table in certain usage patterns.
> For example, if you bulk-load the table with a ton of data, commit the
> transaction, delete every other row, and then never insert any more
> rows ever again, the table is bloated: it's twice as large as it
> really needs to be, and we have no provision for shrinking it.  In
> general, I think it's very hard to keep bulk deletes from leaving
> bloat in the table, and to the extent that it *is* possible, we're not
> doing it.  One could imagine, for example, an index-organized table
> that automatically combines adjacent pages when they're empty enough,
> and that also relocates data to physically lower-numbered pages
> whenever possible.  Such a storage engine might automatically shrink
> the on-disk footprint after a large delete, but we have no plans to go
> in that direction.
>
> Rather, our assumption is that the bloat most people care about comes
> from updates.  By performing updates in-place as often as possible, we
> hope to avoid bloating both the heap (because we're not adding new row
> versions to it which then have to be removed) and the indexes (because
> if we don't add new row versions at some other TID, then we don't need
> to add index pointers to that new TID either, or remove the old index
> pointers to the old TID).  Without delete-marking, we can basically
> optimize the case that is currently handled via HOT updates: no
> indexed columns have changed.  However, the in-place update has a
> major advantage that it still works even when the page is completely
> full, provided that the row does not expand.  As Amit's results show,
> that can hugely reduce bloat and increase performance in the face of
> long-running concurrent transactions.  With delete-marking, we can
> also optimize the case where indexed columns have been changed.  We
> don't know exactly how well this will work yet because the code isn't
> written and therefore can't be benchmarked, but am hopeful that that
> in-place updates will be a big win here too.
>
> So, I would not describe a zheap table as zero-bloat, but it should
> involve a lot less bloat than our standard heap.
>

For folk doing ETL type data warehousing this should be great, as the 
typical workload tends to be like: COPY (or similar) from foreign data 
source, then do several sets of UPDATES to fix/check/scrub the 
data...which tends to result in huge bloat with the current heap design 
(despite telling people 'you can do it another way to' to avoid bloat - 
I guess it seems to be more intuitive to just to do it as described).

regards
Mark



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: zheap: a new storage format for PostgreSQL
Next
From: Andres Freund
Date:
Subject: Re: non-bulk inserts and tuple routing