Physical append-only tables - Mailing list pgsql-hackers

From Magnus Hagander
Subject Physical append-only tables
Date
Msg-id CABUevEwLok4_pSMM-y7gRjWfdXcBZ+4a6+DUb2+rOma+VztUCQ@mail.gmail.com
Whole thread Raw
Responses Re: Physical append-only tables  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: Physical append-only tables  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Physical append-only tables  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
Scenario:

Large table, *mostly* insert+select only, either a sequence or a timestamp column set to current time. This is an ideal usecase for a BRIN index. Awesome.

But then consider the same table. Except rows are typically updated once or twice when they are new, and *then* go read only. And we also have a process that at some point deletes *some* old rows (but not all - in fact, only a small portion).

In this case, the next INSERT once VACUUM has run is likely to stick a "new" row somewhere very "far back" in the table, since there is now free space there. This more or less completely ruins the BRIN index usability, as the "old" blocks will now contain a single row from a "new" series.

For a scenario like this, would it make sense to have an option that could be set on an individual table making it physical append only? Basically VACUUM would run as normal and clean up the old space when rows are deleted back in history, but when new space is needed for a row the system would never look at the old blocks, and only append to the end.

Yes, this wastes space in the database. But not as much as trying to track the deleted rows somewhere else and do an anti-join or something like that. And it would obviously not be on by default.

Eventually the lost space might grow to a point where re-CLUSTERing the table might be worthwhile. But given the cost of that, I can see many scenarios where people are just willing to pay that extra overhead on large tables that are more or less never deleted from.

I've run into a number of cases recently where this would've made the BRIN indexes on huge tables *much* more efficient. At least, it seems to me they would :)

Or am I missing something that would make this not work?

--

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Do we need use more meaningful variables to replace 0 in catalog head files?
Next
From: Tom Lane
Date:
Subject: Re: Do we need use more meaningful variables to replace 0 in catalog head files?