Thread: Simple improvements to freespace allocation

Simple improvements to freespace allocation

From

Simon Riggs

Date:

08 January 2014, 06:56:42

Current freesapce code gives a new block insert target (NBT) from
anywhere in table. That isn't very useful with bigger tables and it
would be useful to be able to specify different algorithms for
producing NBTs.

ALTER TABLE foo WITH (freespace = XXXX);

Three simple and useful models come to mind

* CONCURRENT
This is the standard/current model. Naming it likes this emphasises
why we pick NBTs in the way we do.

* PACK
We want the table to be smaller, so rather than run a VACUUM FULL we
want to force the table to choose an NBT at start of table, even at
the expense of concurrency. By avoiding putting new data at the top of
the table we allow the possibility that VACUUM will shrink table size.
This is same as current except we always reset the FSM pointer to zero
and re-seek from there. This takes some time to have an effect, but is
much less invasive than VACUUM FULL.

* RECENT
For large tables that are append-mostly use case it would be easier to
prefer NBTs from the last two 1GB segments of a table, allowing them
to be more easily cached. This is same as current except when we wrap
we don't go to block 0 we go to first block of penultimate (max - 1)
segment. For tables <= 2 segments this is no change from existing
algorithm. For larger tables it would focus updates/inserts into a
much reduced and yet still large area and allow better cacheing.

These are small patches.

...other possibilities, though more complex are...

* IN-MEMORY
A large table may only have some of its blocks in memory. It would be
useful to force a NBT to be a block already in shared_buffers IFF a
table is above a certain size (use same threshold as seq scans, i.e.
25% of shared_buffers). That may be difficult to achieve in practice,
so not sure about this one. Like it? Any ideas?

We might also allow a custom NBT policy though allowing user code at
that point could be dangerous.

Thoughts?

-- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Simple improvements to freespace allocation

From

Heikki Linnakangas

Date:

08 January 2014, 07:44:01

On 01/08/2014 08:56 AM, Simon Riggs wrote:
> Current freesapce code gives a new block insert target (NBT) from
> anywhere in table. That isn't very useful with bigger tables and it
> would be useful to be able to specify different algorithms for
> producing NBTs.

I've actually been surprised how little demand there has been for 
alternative algorithms. When I wrote the current FSM implementation, I 
expected people to start coming up with all kinds of wishes, but it 
didn't happen. There has been very few complaints, everyone seems to be 
satisfied with the way it works now. So I'm not convinced there's much 
need for this.

> ALTER TABLE foo WITH (freespace = XXXX);
>
> Three simple and useful models come to mind
>
> * CONCURRENT
> This is the standard/current model. Naming it likes this emphasises
> why we pick NBTs in the way we do.
>
> * PACK
> We want the table to be smaller, so rather than run a VACUUM FULL we
> want to force the table to choose an NBT at start of table, even at
> the expense of concurrency. By avoiding putting new data at the top of
> the table we allow the possibility that VACUUM will shrink table size.
> This is same as current except we always reset the FSM pointer to zero
> and re-seek from there. This takes some time to have an effect, but is
> much less invasive than VACUUM FULL.

We already reset the FSM pointer to zero on vacuum. Would the above 
actually make any difference in practice?

> * RECENT
> For large tables that are append-mostly use case it would be easier to
> prefer NBTs from the last two 1GB segments of a table, allowing them
> to be more easily cached. This is same as current except when we wrap
> we don't go to block 0 we go to first block of penultimate (max - 1)
> segment. For tables <= 2 segments this is no change from existing
> algorithm. For larger tables it would focus updates/inserts into a
> much reduced and yet still large area and allow better cacheing.

Umm, wouldn't that bloat the table with no limit? Putting my 
DBA/developer hat on, I don't understand when I would want to use that 
setting.

> These are small patches.
>
> ...other possibilities, though more complex are...
>
> * IN-MEMORY
> A large table may only have some of its blocks in memory. It would be
> useful to force a NBT to be a block already in shared_buffers IFF a
> table is above a certain size (use same threshold as seq scans, i.e.
> 25% of shared_buffers). That may be difficult to achieve in practice,
> so not sure about this one. Like it? Any ideas?

Yeah, that seems nice, although I have feeling that it's not worth the 
complexity.

There's one policy that I'd like to see: maintaining cluster order. When 
inserting a new tuple, try to place it close to other tuples with 
similar keys, to keep the table clustered.

In practice, CLUSTER CONCURRENTLY might be more useful, though.

> We might also allow a custom NBT policy though allowing user code at
> that point could be dangerous.

Yeah, I don't think there's much need for that. Overall,

- Heikki

Re: Simple improvements to freespace allocation

From

Simon Riggs

Date:

08 January 2014, 08:17:22

On 8 January 2014 07:43, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 01/08/2014 08:56 AM, Simon Riggs wrote:
>>
>> Current freesapce code gives a new block insert target (NBT) from
>> anywhere in table. That isn't very useful with bigger tables and it
>> would be useful to be able to specify different algorithms for
>> producing NBTs.
>
>
> I've actually been surprised how little demand there has been for
> alternative algorithms. When I wrote the current FSM implementation, I
> expected people to start coming up with all kinds of wishes, but it didn't
> happen. There has been very few complaints, everyone seems to be satisfied
> with the way it works now. So I'm not convinced there's much need for this.

That would require someone to conduct detailed analysis on problems,
which few people are capable of doing at a level that we would accept.
No doubt I will soon be challenged to prove beyond doubt that anything
here is required, which becomes chicken and egg. For the vast majority
of cases, the general approach we have works well enough - this area
has had lots of very useful attention ove the years.

The problem is tables have multiple use cases and we support only one,
with no easy way for people to experiment with alternatives in
production.

Its been on my list for years... but its not been a top priority, for sure.

>> ALTER TABLE foo WITH (freespace = XXXX);
>>
>> Three simple and useful models come to mind
>>
>> * CONCURRENT
>> This is the standard/current model. Naming it likes this emphasises
>> why we pick NBTs in the way we do.
>>
>> * PACK
>> We want the table to be smaller, so rather than run a VACUUM FULL we
>> want to force the table to choose an NBT at start of table, even at
>> the expense of concurrency. By avoiding putting new data at the top of
>> the table we allow the possibility that VACUUM will shrink table size.
>> This is same as current except we always reset the FSM pointer to zero
>> and re-seek from there. This takes some time to have an effect, but is
>> much less invasive than VACUUM FULL.
>
>
> We already reset the FSM pointer to zero on vacuum. Would the above actually
> make any difference in practice?

The Pack algo would emphasise tight packing over assigning concurrent
blocks. It would be useful if that also included not doing HOT updates
in favour of migrating rows to an earlier block in the table. Emphasis
on avoiding VACUUM FULL in certain cases, not for general use.

>> * RECENT
>> For large tables that are append-mostly use case it would be easier to
>> prefer NBTs from the last two 1GB segments of a table, allowing them
>> to be more easily cached. This is same as current except when we wrap
>> we don't go to block 0 we go to first block of penultimate (max - 1)
>> segment. For tables <= 2 segments this is no change from existing
>> algorithm. For larger tables it would focus updates/inserts into a
>> much reduced and yet still large area and allow better cacheing.
>
>
> Umm, wouldn't that bloat the table with no limit? Putting my DBA/developer
> hat on, I don't understand when I would want to use that setting.

That would depend on the use case; no algo suggested works for all or
even general cases.

If you have a large table, allocating freespace in blocks unlikely to
be accessed by queries means we introduce additional cache pressure.
If we allocate NBTs from newer blocks we will more likely find them in
cache.

Allowing older data to become write-seldom allows us to consider
things like compressing particular segments or moving them onto
cheaper storage.

(Another suggestion might be to use the VM so that we tend to add data
to already dirty blocks.)

> There's one policy that I'd like to see: maintaining cluster order. When
> inserting a new tuple, try to place it close to other tuples with similar
> keys, to keep the table clustered.

Agreed. Do you have a particular algorithm in mind? I can think of a few.

> In practice, CLUSTER CONCURRENTLY might be more useful, though.

I think we want both wholesale and retail. Certainly in the absence of
the former, the latter seems good addition.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Simple improvements to freespace allocation

From

Tom Lane

Date:

08 January 2014, 16:13:31

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 01/08/2014 08:56 AM, Simon Riggs wrote:
>> * IN-MEMORY
>> A large table may only have some of its blocks in memory. It would be
>> useful to force a NBT to be a block already in shared_buffers IFF a
>> table is above a certain size (use same threshold as seq scans, i.e.
>> 25% of shared_buffers). That may be difficult to achieve in practice,
>> so not sure about this one. Like it? Any ideas?

> Yeah, that seems nice, although I have feeling that it's not worth the 
> complexity.

Not only would that be rather expensive to do, but I think it would be
self-defeating.  Pages that are in memory would be particularly likely
to have been modified by someone else recently, so that the FSM's info
about their available space is stale, and thus once you actually got
to the page it'd be more likely to not have the space you need.

The existing FSM algorithm is intentionally designed to hand out pages
that nobody else has tried to insert into lately, with one goal being to
minimize the number of retries needed because of stale info.  (Or at
least, it worked that way originally, and I don't think Heikki's rewrite
changed that aspect.)  I'm concerned that the alternatives Simon proposes
would lead to more processes ganging up on the same pages, with not only a
direct cost in concurrency but an indirect cost in repeated FSM searches
due to believing stale available-space data.  Indeed, a naive
implementation could easily get into infinite loops of handing back the
same page.

I realize that the point is exactly to sacrifice some insertion
performance in hopes of getting better table packing, but it's not clear
to me that there's an easy fix that makes packing better without a very
large hit on the other side.

Anyway, these fears could be proven or disproven with some benchmarking of
a trial patch, and I have no objection if Simon wants to do that
experimentation.  But I'd be hesitant to see such a feature committed
in advance of experimental proof that it's actually useful.
        regards, tom lane

Re: Simple improvements to freespace allocation

From

Jim Nasby

Date:

09 January 2014, 01:38:56

On 1/8/14, 1:43 AM, Heikki Linnakangas wrote:

I've wanted the cluster case for a long time. I also see the use for the RECENT scenario, especially if we had CLUSTER
CONCURRENTthat let you shrink the head of the table as needed.

I suspect the in-memory case would only be useful if it could look into the OS cache as well, at least until we can
recommendyou give Postgres 90% of memory instead of 25%. Even then, I'm not sure how useful it would ultimately be...

>> * PACK
>> We want the table to be smaller, so rather than run a VACUUM FULL we
>> want to force the table to choose an NBT at start of table, even at
>> the expense of concurrency. By avoiding putting new data at the top of
>> the table we allow the possibility that VACUUM will shrink table size.
>> This is same as current except we always reset the FSM pointer to zero
>> and re-seek from there. This takes some time to have an effect, but is
>> much less invasive than VACUUM FULL.
>
> We already reset the FSM pointer to zero on vacuum. Would the above actually make any difference in practice?

What if your first request is for a large chunk of free space? You could skip a lot of blocks, even if the FSM is
bucketized.

But there's probably a more important point to this one: for you to have any chance of packing you MUST get everything
outof the tail of the table. Resetting to zero on every request is one possible way to do that, though it might be
betterto do something like reset only once the pointer goes past block X. The other thing you'd want is a way to force
tuplesoff the last X pages. Due to a lack of ctid operators that was already hard, and HOT makes that even harder (BTW,
relatedto this you'd ideally want HOT to continue to operate on the front of the table, but not the back.)

All that said, I've definitely wanted the ability to shrink tables in the past, though TBH I've wanted that more for
indexes.

Ultimately, what I really want on this front is:

PACK TABLE blah BACKGROUND;
CLUSTER TABLE blah BACKGROUND;
REINDEX INDEX blah BACKGROUND;

where BACKGROUND would respect a throttle setting. (While I'm dreaming, it'd be nice to have DATABASE/TABLESPACE/SCHEMA
alternatespecifications too...)

-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Simple improvements to freespace allocation

From

Simon Riggs

Date:

11 January 2014, 12:25:19

On 9 January 2014 01:38, Jim Nasby <jim@nasby.net> wrote:
> On 1/8/14, 1:43 AM, Heikki Linnakangas wrote:
>
> I've wanted the cluster case for a long time. I also see the use for the
> RECENT scenario, especially if we had CLUSTER CONCURRENT that let you shrink
> the head of the table as needed.
>

> But there's probably a more important point to this one: for you to have any
> chance of packing you MUST get everything out of the tail of the table.
> Resetting to zero on every request is one possible way to do that, though it
> might be better to do something like reset only once the pointer goes past
> block X. The other thing you'd want is a way to force tuples off the last X
> pages. Due to a lack of ctid operators that was already hard, and HOT makes
> that even harder

Agreed

> (BTW, related to this you'd ideally want HOT to continue to
> operate on the front of the table, but not the back.)

That's a good idea.

> All that said, I've definitely wanted the ability to shrink tables in the
> past, though TBH I've wanted that more for indexes.
>
> Ultimately, what I really want on this front is:
>
> PACK TABLE blah BACKGROUND;
> CLUSTER TABLE blah BACKGROUND;
> REINDEX INDEX blah BACKGROUND;
>
> where BACKGROUND would respect a throttle setting. (While I'm dreaming, it'd
> be nice to have DATABASE/TABLESPACE/SCHEMA alternate specifications too...)

I like the idea of declarative deprioritisation.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services