Thread: Physical append-only tables

Physical append-only tables

From
Magnus Hagander
Date:
Scenario:

Large table, *mostly* insert+select only, either a sequence or a timestamp column set to current time. This is an ideal usecase for a BRIN index. Awesome.

But then consider the same table. Except rows are typically updated once or twice when they are new, and *then* go read only. And we also have a process that at some point deletes *some* old rows (but not all - in fact, only a small portion).

In this case, the next INSERT once VACUUM has run is likely to stick a "new" row somewhere very "far back" in the table, since there is now free space there. This more or less completely ruins the BRIN index usability, as the "old" blocks will now contain a single row from a "new" series.

For a scenario like this, would it make sense to have an option that could be set on an individual table making it physical append only? Basically VACUUM would run as normal and clean up the old space when rows are deleted back in history, but when new space is needed for a row the system would never look at the old blocks, and only append to the end.

Yes, this wastes space in the database. But not as much as trying to track the deleted rows somewhere else and do an anti-join or something like that. And it would obviously not be on by default.

Eventually the lost space might grow to a point where re-CLUSTERing the table might be worthwhile. But given the cost of that, I can see many scenarios where people are just willing to pay that extra overhead on large tables that are more or less never deleted from.

I've run into a number of cases recently where this would've made the BRIN indexes on huge tables *much* more efficient. At least, it seems to me they would :)

Or am I missing something that would make this not work?

--

Re: Physical append-only tables

From
Thomas Munro
Date:
On Mon, Nov 14, 2016 at 4:45 AM, Magnus Hagander <magnus@hagander.net> wrote:
> For a scenario like this, would it make sense to have an option that could
> be set on an individual table making it physical append only? Basically
> VACUUM would run as normal and clean up the old space when rows are deleted
> back in history, but when new space is needed for a row the system would
> never look at the old blocks, and only append to the end.

One thing I have idly wondered about: if you had an append-only mode
that guarantees that we never write new tuples anywhere but the end of
the heap, you could create a bsearch access method that uses zero
storage and simply checks that every key inserted is >= the previous
high key before allowing the insertion to proceed.  Then it could make
use of the guaranteed correlation with physical order to do scans
using a binary search of the heap.  Maybe that'd be useful for some
kinds of write-only time-series data that needs to be searched by time
range.  On the other hand, BRIN indexes are tiny, should be nearly as
good and are much less restrictive, so I haven't follow this thought
up.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Physical append-only tables

From
Alvaro Herrera
Date:
Magnus Hagander wrote:

> But then consider the same table. Except rows are typically updated once or
> twice when they are new, and *then* go read only. And we also have a
> process that at some point deletes *some* old rows (but not all - in fact,
> only a small portion).
> 
> In this case, the next INSERT once VACUUM has run is likely to stick a
> "new" row somewhere very "far back" in the table, since there is now free
> space there. This more or less completely ruins the BRIN index usability,
> as the "old" blocks will now contain a single row from a "new" series.

Yeah.  When we initially discussed BRIN, there was a mention of allowing
a BRIN index to guide new tuple location -- something like
auto-clustering, if you will.  We haven't discussed the exact details
but I think something along those lines is worth considering.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Physical append-only tables

From
Magnus Hagander
Date:
On Mon, Nov 14, 2016 at 2:35 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Magnus Hagander wrote:

> But then consider the same table. Except rows are typically updated once or
> twice when they are new, and *then* go read only. And we also have a
> process that at some point deletes *some* old rows (but not all - in fact,
> only a small portion).
>
> In this case, the next INSERT once VACUUM has run is likely to stick a
> "new" row somewhere very "far back" in the table, since there is now free
> space there. This more or less completely ruins the BRIN index usability,
> as the "old" blocks will now contain a single row from a "new" series.

Yeah.  When we initially discussed BRIN, there was a mention of allowing
a BRIN index to guide new tuple location -- something like
auto-clustering, if you will.  We haven't discussed the exact details
but I think something along those lines is worth considering.

What I'm talking about is something that would be a lot simpler than auto-clustering. I'm not even talking about trying to detect if the row was about to go into the right place -- this might be expensive and certainly more complicated. I'm only talking about a simple case where we *never* put anything anywhere other than at the end of the table, period. That should make the check both cheap and simple. 

--

Re: Physical append-only tables

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> What I'm talking about is something that would be a lot simpler than
> auto-clustering. I'm not even talking about trying to detect if the row was
> about to go into the right place -- this might be expensive and certainly
> more complicated. I'm only talking about a simple case where we *never* put
> anything anywhere other than at the end of the table, period. That should
> make the check both cheap and simple.

It also makes it so much of a corner case that even a cheap check could be
a net performance degradation, especially for people whose usage pattern
doesn't match this.
        regards, tom lane



Re: Physical append-only tables

From
Magnus Hagander
Date:
On Mon, Nov 14, 2016 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> What I'm talking about is something that would be a lot simpler than
> auto-clustering. I'm not even talking about trying to detect if the row was
> about to go into the right place -- this might be expensive and certainly
> more complicated. I'm only talking about a simple case where we *never* put
> anything anywhere other than at the end of the table, period. That should
> make the check both cheap and simple.

It also makes it so much of a corner case that even a cheap check could be
a net performance degradation, especially for people whose usage pattern
doesn't match this.


I agree that it definitely solves just one problem. But it seems to be a fairly common problem, particularly for users of BRIN users.

Full auto-clustering would cover many more usecases, but would also be a lot more expensive to maintain.

--

Re: Physical append-only tables

From
Greg Stark
Date:
On Sun, Nov 13, 2016 at 3:45 PM, Magnus Hagander <magnus@hagander.net> wrote:
> For a scenario like this, would it make sense to have an option that could
> be set on an individual table making it physical append only? Basically
> VACUUM would run as normal and clean up the old space when rows are deleted
> back in history, but when new space is needed for a row the system would
> never look at the old blocks, and only append to the end.

I don't think "appending" is the right way to think about this. It
happens to address the problem but only accidentally and only
partially. More generally what you have is two different kinds of data
with two different access patterns and storage requirements in the
same table. They're logically similar but have different practical
requirements.

If there was some way to teach the database that your table is made of
two different types of data and how to distinguish the two types then
when the update occurs it could move the row to the right section of
storage... This might be something the new partitioning could handle
or it might need something more low-level and implicit.

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

-- 
greg



Re: Physical append-only tables

From
Magnus Hagander
Date:
On Mon, Nov 14, 2016 at 9:43 PM, Greg Stark <stark@mit.edu> wrote:
On Sun, Nov 13, 2016 at 3:45 PM, Magnus Hagander <magnus@hagander.net> wrote:
> For a scenario like this, would it make sense to have an option that could
> be set on an individual table making it physical append only? Basically
> VACUUM would run as normal and clean up the old space when rows are deleted
> back in history, but when new space is needed for a row the system would
> never look at the old blocks, and only append to the end.

I don't think "appending" is the right way to think about this. It
happens to address the problem but only accidentally and only
partially. More generally what you have is two different kinds of data
with two different access patterns and storage requirements in the
same table. They're logically similar but have different practical
requirements.

If there was some way to teach the database that your table is made of
two different types of data and how to distinguish the two types then
when the update occurs it could move the row to the right section of
storage... This might be something the new partitioning could handle
or it might need something more low-level and implicit.

Agreed, though in the cases I've looked at this has not been a static thing. Some of it might be driven off dynamic data somewhere else, some of it may be "this data has to be deleted for regulatory reasons". That can show up on a case-by-case basis. I don't think the partitioning can really be *that* flexible, though it might be able to pick up some of the issues.

The problem with the partitioning is also that it only works if you can ensure the partitioning key is in every query, which often doesn't work out against this.

 
That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

In the use-cases I've had it's really the DELETE that's the problem. In all those cases UPDATEs only happen on fairly recent data, so it doesn't really screw with the BRIN. It's DELETEs of old data that's the big issue.
 
--

Re: Physical append-only tables

From
Bruce Momjian
Date:
On Mon, Nov 14, 2016 at 08:43:12PM +0000, Greg Stark wrote:
> That said, I don't think the "maintain clustering a bit better using
> BRIN" is a bad idea. It's just the bit about turning a table
> append-only to deal with update-once data that I think is overreach.

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space.  Only if that
fails do we put is somewhere else in the heap.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Physical append-only tables

From
Magnus Hagander
Date:
On Thu, Nov 24, 2016 at 2:26 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 14, 2016 at 08:43:12PM +0000, Greg Stark wrote:
> That said, I don't think the "maintain clustering a bit better using
> BRIN" is a bad idea. It's just the bit about turning a table
> append-only to deal with update-once data that I think is overreach.

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space.  Only if that
fails do we put is somewhere else in the heap.

That would certainly be useful. You'd have to figure out what to do in the case of multiple conflicting BRIN indexes (which you shouldn't have in the first place, but that won't keep people from having them), but other than that it would be quite good I think. 


--

Re: Physical append-only tables

From
Bruce Momjian
Date:
On Thu, Nov 24, 2016 at 10:13:30AM +0100, Magnus Hagander wrote:
> On Thu, Nov 24, 2016 at 2:26 AM, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Mon, Nov 14, 2016 at 08:43:12PM +0000, Greg Stark wrote:
>     > That said, I don't think the "maintain clustering a bit better using
>     > BRIN" is a bad idea. It's just the bit about turning a table
>     > append-only to deal with update-once data that I think is overreach.
> 
>     What if we used BRIN to find heap pages where the new row was in the
>     page BRIN min/max range, and the heap page had free space.  Only if that
>     fails do we put is somewhere else in the heap.
> 
> 
> That would certainly be useful. You'd have to figure out what to do in the case
> of multiple conflicting BRIN indexes (which you shouldn't have in the first
> place, but that won't keep people from having them), but other than that it
> would be quite good I think. 

This idea is only possible because the BRIN index is so small and easy
to scan, i.e. this wouldn't work for a btree index.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Physical append-only tables

From
Jim Nasby
Date:
On 11/24/16 8:18 AM, Bruce Momjian wrote:
>>     What if we used BRIN to find heap pages where the new row was in the
>>     page BRIN min/max range, and the heap page had free space.  Only if that
>>     fails do we put is somewhere else in the heap.
>>
>>
>> That would certainly be useful. You'd have to figure out what to do in the case
>> of multiple conflicting BRIN indexes (which you shouldn't have in the first
>> place, but that won't keep people from having them), but other than that it
>> would be quite good I think.
> This idea is only possible because the BRIN index is so small and easy
> to scan, i.e. this wouldn't work for a btree index.

ISTM a prerequisite for any of this is a way to override the default FSM 
behavior. A simple strategy that forces append-only would presumably be 
very cheap and easy to do after that. It could also be used to allow 
better clustering. It would also make it far easier to recover from a 
heavily bloated table that's too large to simply VACUUM FULL or CLUSTER, 
without resorting to the contortions that pg_repack/pg_reorg have to.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: Physical append-only tables

From
Masahiko Sawada
Date:
On Mon, Nov 28, 2016 at 9:05 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 11/24/16 8:18 AM, Bruce Momjian wrote:
>>>
>>>     What if we used BRIN to find heap pages where the new row was in the
>>>     page BRIN min/max range, and the heap page had free space.  Only if
>>> that
>>>     fails do we put is somewhere else in the heap.
>>>
>>>
>>> That would certainly be useful. You'd have to figure out what to do in
>>> the case
>>> of multiple conflicting BRIN indexes (which you shouldn't have in the
>>> first
>>> place, but that won't keep people from having them), but other than that
>>> it
>>> would be quite good I think.
>>
>> This idea is only possible because the BRIN index is so small and easy
>> to scan, i.e. this wouldn't work for a btree index.
>
>
> ISTM a prerequisite for any of this is a way to override the default FSM
> behavior. A simple strategy that forces append-only would presumably be very
> cheap and easy to do after that. It could also be used to allow better
> clustering. It would also make it far easier to recover from a heavily
> bloated table that's too large to simply VACUUM FULL or CLUSTER, without
> resorting to the contortions that pg_repack/pg_reorg have to.

Since building BRIN index doesn't take a long time, it would be good
enough to support the online-clustering (or clustering with minimal
lock like pg_repack/pg_reorg does) in most cases. And I'm not sure
that there are a lot of users who build only BRIN index on the table.
I think that many users want to build other indexes on same table
other columns.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center