Thread: Re: Lowering the ever-growing heap->pd_lower

Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > The case I was concerned about back when is that there are various bits of
> > code that may visit a page with a predetermined TID in mind to look at.
> > An index lookup is an obvious example, and another one is chasing an
> > update chain's t_ctid link.  You might argue that if the tuple was dead
> > enough to be removed, there should be no such in-flight references to
> > worry about, but I think such an assumption is unsafe.  There is not, for
> > example, any interlock that ensures that a process that has obtained a TID
> > from an index will have completed its heap visit before a VACUUM that
> > subsequently removed that index entry also removes the heap tuple.
>
> I am aware of this problem. I will admit that I did not detected all
> potentially problematic accesses, so I'll show you my work.

Attached is a trivial rebase of your v3, which I've called v4. I am
interested in getting this patch into Postgres 14.

> In my search for problematic accesses, I make the following assumptions:
> * PageRepairFragmentation as found in bufpage is only applicable to
> heap pages; this function is not applied to other pages in core
> postgres. So, any problems that occur are with due to access with an
> OffsetNumber > PageGetMaxOffsetNumber.
> * Items [on heap pages] are only extracted after using PageGetItemId
> for that item on the page, whilst holding a lock.

> The 3 problem cases were classified based on the origin of the
> potentially invalid pointer.
>
> Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

I think that it boils down to this: 100% of the cases where this could
be a problem all either involve old TIDs, or old line pointer -- in
principle these could be invalidated in some way, like your backwards
scan example. But that's it. Bear in mind that we always call
PageRepairFragmentation() with a super-exclusive lock.

> This is in the replay of transaction logs, in heap_xlog_freeze_page.
> As I am unaware whether or not pages to which these transaction logs
> are applied can contain changes from the xlog-generating instance, I
> flagged this as a potential problem. The application of the xlogs is
> probably safe (it assumes the existence of a HeapTupleHeader for that
> ItemId), but my limited knowledge put this on the 'potential problem'
> list.
>
> Please advise on this; I cannot find the right documentation

Are you aware of wal_consistency_checking?

I think that this should be fine. There are differences that are
possible between a replica and primary, but they're very minor and
don't seem relevant.

> Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
> heap_get_root_tuples
>
> The heap pruning mechanism currently assumes that all redirects are
> valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
> out of the loop, but doesn't actually fail. This code is already
> potentially problematic because it has no bounds or sanity checking
> for the nextoffnum, but with shrinking pd_linp it would also add the
> failure mode of HOT tuples pointing into what is now arbitrary data.

heap_prune_chain() is less trusting than heap_get_root_tuples(),
though -- it doesn't trust redirects (because there is a generic
offnum sanity check at the start of its loop). I think that the
inconsistency between these two functions probably isn't hugely
significant.

Ideally it would be 100% clear which of the defenses in code like this
is merely extra hardening. The assumptions should be formalized. There
is nothing wrong with hardening, but we should know it when we see it.

> This failure mode is now accompanied by an Assert, which fails when
> the redirect is to an invalid OffsetNumber. This is not enough to not
> exhibit arbitrary behaviour when accessing corrupted data with
> non-assert builds, but I think that that is fine; we already do not
> have a guaranteed behaviour for this failure mode.

What about my "set would-be LP_UNUSED space to all-0x7F bytes in
CLOBBER_FREED_MEMORY builds" idea? Did you think about that?

--
Peter Geoghegan

Attachment

Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Wed, 31 Mar 2021 at 05:35, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > The case I was concerned about back when is that there are various bits of
> > > code that may visit a page with a predetermined TID in mind to look at.
> > > An index lookup is an obvious example, and another one is chasing an
> > > update chain's t_ctid link.  You might argue that if the tuple was dead
> > > enough to be removed, there should be no such in-flight references to
> > > worry about, but I think such an assumption is unsafe.  There is not, for
> > > example, any interlock that ensures that a process that has obtained a TID
> > > from an index will have completed its heap visit before a VACUUM that
> > > subsequently removed that index entry also removes the heap tuple.
> >
> > I am aware of this problem. I will admit that I did not detected all
> > potentially problematic accesses, so I'll show you my work.
>
> Attached is a trivial rebase of your v3, which I've called v4. I am
> interested in getting this patch into Postgres 14.

Thanks, and that'd be great! PFA v5, which fixes 1 issue later
mentioned, and adds some comments on existing checks that are now in a
critical path.

> > In my search for problematic accesses, I make the following assumptions:
> > * PageRepairFragmentation as found in bufpage is only applicable to
> > heap pages; this function is not applied to other pages in core
> > postgres. So, any problems that occur are with due to access with an
> > OffsetNumber > PageGetMaxOffsetNumber.
> > * Items [on heap pages] are only extracted after using PageGetItemId
> > for that item on the page, whilst holding a lock.
>
> > The 3 problem cases were classified based on the origin of the
> > potentially invalid pointer.
> >
> > Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup
>
> I think that it boils down to this: 100% of the cases where this could
> be a problem all either involve old TIDs, or old line pointer -- in
> principle these could be invalidated in some way, like your backwards
> scan example. But that's it. Bear in mind that we always call
> PageRepairFragmentation() with a super-exclusive lock.

Yeah, that's the gist of what I found out. All accesses using old line
pointers need revalidation, and there were some cases in which this
was not yet done correctly.

> > This is in the replay of transaction logs, in heap_xlog_freeze_page.
> > As I am unaware whether or not pages to which these transaction logs
> > are applied can contain changes from the xlog-generating instance, I
> > flagged this as a potential problem. The application of the xlogs is
> > probably safe (it assumes the existence of a HeapTupleHeader for that
> > ItemId), but my limited knowledge put this on the 'potential problem'
> > list.
> >
> > Please advise on this; I cannot find the right documentation
>
> Are you aware of wal_consistency_checking?

I was vaguely aware that an option with that name exists, but that was
about the extent. Thanks for pointing me in that direction.

> I think that this should be fine. There are differences that are
> possible between a replica and primary, but they're very minor and
> don't seem relevant.

OK, then I'll assume that WAL replay shouldn't cause problems here.

> > Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
> > heap_get_root_tuples
> >
> > The heap pruning mechanism currently assumes that all redirects are
> > valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
> > out of the loop, but doesn't actually fail. This code is already
> > potentially problematic because it has no bounds or sanity checking
> > for the nextoffnum, but with shrinking pd_linp it would also add the
> > failure mode of HOT tuples pointing into what is now arbitrary data.
>
> heap_prune_chain() is less trusting than heap_get_root_tuples(),
> though -- it doesn't trust redirects (because there is a generic
> offnum sanity check at the start of its loop). I think that the
> inconsistency between these two functions probably isn't hugely
> significant.
>
> Ideally it would be 100% clear which of the defenses in code like this
> is merely extra hardening. The assumptions should be formalized. There
> is nothing wrong with hardening, but we should know it when we see it.

I realized one of my Assert()s was incorrectly asserting an actually
valid page state, so I've updated and documented that case.

> > This failure mode is now accompanied by an Assert, which fails when
> > the redirect is to an invalid OffsetNumber. This is not enough to not
> > exhibit arbitrary behaviour when accessing corrupted data with
> > non-assert builds, but I think that that is fine; we already do not
> > have a guaranteed behaviour for this failure mode.
>
> What about my "set would-be LP_UNUSED space to all-0x7F bytes in
> CLOBBER_FREED_MEMORY builds" idea? Did you think about that?

I had implemented it locally, but was waiting for some more feedback
before posting that and got busy with other stuff since, it's now
attached.

I've also played around with marking the free space on the page as
undefined for valgrind, but later realized that that would make the
test for defined memory in PageAddItemExtended fail. This is
documented in the commit message of the attached patch 0002.


With regards,

Matthias van de Meent

Attachment

Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Wed, Mar 31, 2021 at 2:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I had implemented it locally, but was waiting for some more feedback
> before posting that and got busy with other stuff since, it's now
> attached.
>
> I've also played around with marking the free space on the page as
> undefined for valgrind, but later realized that that would make the
> test for defined memory in PageAddItemExtended fail. This is
> documented in the commit message of the attached patch 0002.

I would like to deal with this work within the scope of the project
we're discussing over on the "New IndexAM API controlling index vacuum
strategies" thread. The latest revision of that patch series includes
a modified version of your patch:

https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com

Please take discussion around this project over to that other thread.
There are a variety of issues that can only really be discussed in
that context.

Note that I've revised the patch so that it runs during VACUUM's
second heap pass only -- not during pruning/defragmentation. This
means that the line pointer array truncation mechanism will only ever
kick-in during a VACUUM operation.

--
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
John Naylor
Date:
On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I would like to deal with this work within the scope of the project
> we're discussing over on the "New IndexAM API controlling index vacuum
> strategies" thread. The latest revision of that patch series includes
> a modified version of your patch:
>
> https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
>
> Please take discussion around this project over to that other thread.
> There are a variety of issues that can only really be discussed in
> that context.

Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Mon, 3 May 2021 at 16:26, John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I would like to deal with this work within the scope of the project
> > we're discussing over on the "New IndexAM API controlling index vacuum
> > strategies" thread. The latest revision of that patch series includes
> > a modified version of your patch:
> >
> > https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
> >
> > Please take discussion around this project over to that other thread.
> > There are a variety of issues that can only really be discussed in
> > that context.
>
> Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed.

I disagree that this work has been fully committed. A derivative was
committed that would solve part of the problem, but it doesn't cover
all problem cases. I believe that I voiced such concern in the other
thread as well. As such, I am planning on fixing this patch sometime
before the next commit fest so that we can truncate the LP array
during hot pruning as well, instead of only doing so in the 2nd VACUUM
pass. This is especially relevant on pages where hot is highly
effective, but vacuum can't keep up and many unused (former HOT) line
pointers now exist on the page.

With regards,

Matthias van de Meent



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Mon, 3 May 2021 at 16:39, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I am planning on fixing this patch sometime
> before the next commit fest so that we can truncate the LP array
> during hot pruning as well, instead of only doing so in the 2nd VACUUM
> pass.

PFA the updated version of this patch. Apart from adding line pointer
truncation in PageRepairFragmentation (as in the earlier patches), I
also altered PageTruncateLinePointerArray to clean up all trailing
line pointers, even if it was the last item on the page.

This means that for 32-bit systems, pages that have once had tuples
(but have been cleared since) can now be used again for
MaxHeapTupleSize insertions. Without this patch, an emptied page would
always have at least one line pointer left, which equates to
MaxHeapTupleSize actual free space, but PageGetFreeSpace always
subtracts sizeof(ItemIdData), leaving the perceived free space as
reported to the FSM less than MaxHeapTupleSize if the page has any
line pointers.

For 64-bit systems, this is not as much of a problem, because
MaxHeapTupleSize is 4 bytes smaller on those systems, which leaves us
with 1 line pointer as margin for the FSM to recognise the page as
free enough for one MaxHeapTupleSize item.

With regards,

Matthias van de Meent

Attachment

Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> PFA the updated version of this patch. Apart from adding line pointer
> truncation in PageRepairFragmentation (as in the earlier patches), I
> also altered PageTruncateLinePointerArray to clean up all trailing
> line pointers, even if it was the last item on the page.

Can you show a practical benefit to this patch, such as an improvement
in throughout or in efficiency for a given workload?

It was easy to see that having something was better than having
nothing at all. But things are of course different now that we have
PageTruncateLinePointerArray().

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Simon Riggs
Date:
On Tue, 18 May 2021 at 20:33, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > PFA the updated version of this patch. Apart from adding line pointer
> > truncation in PageRepairFragmentation (as in the earlier patches), I
> > also altered PageTruncateLinePointerArray to clean up all trailing
> > line pointers, even if it was the last item on the page.
>
> Can you show a practical benefit to this patch, such as an improvement
> in throughout or in efficiency for a given workload?
>
> It was easy to see that having something was better than having
> nothing at all. But things are of course different now that we have
> PageTruncateLinePointerArray().

There does seem to be utility in Matthias' patch, which currently does
two things:
1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
That is going to have a clear benefit for HOT workloads, which by
their nature will use a lot of line pointers.
Many applications are updated much more frequently than they are vacuumed.
Peter - what is your concern about doing this more frequently? Why
would we *not* do this?

2. Reduce number of line pointers to 0 in some cases.
Matthias - I don't think you've made a full case for doing this, nor
looked at the implications.
The comment clearly says "it seems like a good idea to avoid leaving a
PageIsEmpty()" page behind.
So I would be inclined to remove that from the patch and consider that
as a separate issue, or close this.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Tue, 3 Aug 2021 at 08:57, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>
> On Tue, 18 May 2021 at 20:33, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > > PFA the updated version of this patch. Apart from adding line pointer
> > > truncation in PageRepairFragmentation (as in the earlier patches), I
> > > also altered PageTruncateLinePointerArray to clean up all trailing
> > > line pointers, even if it was the last item on the page.
> >
> > Can you show a practical benefit to this patch, such as an improvement
> > in throughout or in efficiency for a given workload?
> >
> > It was easy to see that having something was better than having
> > nothing at all. But things are of course different now that we have
> > PageTruncateLinePointerArray().
>
> There does seem to be utility in Matthias' patch, which currently does
> two things:
> 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
> That is going to have a clear benefit for HOT workloads, which by
> their nature will use a lot of line pointers.
> Many applications are updated much more frequently than they are vacuumed.
> Peter - what is your concern about doing this more frequently? Why
> would we *not* do this?

One clear reason as to why we _do_ want this, is that the current
shrinking only happens in the second phase of vacuum. Shrinking the
LP-array in heap_page_prune decreases the chance that tuples that
could fit on the page due to removed HOT chain items don't currently
fit on the page due to lack of vacuum, whilst adding only little
overhead. Additionally, heap_page_prune is also executed if more empty
space on the page is required for a new tuple that currently doesn't
fit, and in such cases I think clearing as much space as possible is
useful.

> 2. Reduce number of line pointers to 0 in some cases.
> Matthias - I don't think you've made a full case for doing this, nor
> looked at the implications.

I have looked at the implications (see upthread), and I haven't found
any implications other than those mentioned below.

> The comment clearly says "it seems like a good idea to avoid leaving a
> PageIsEmpty()" page behind.

Do note that that comment is based on (to the best of my knowledge)
unmeasured, but somewhat informed, guesswork ('it seems like a good
idea'), which I also commented on in the thread discussing the patch
that resulted in that commit [0].

If I recall correctly, the decision to keep at least 1 line pointer on
the page was because this feature was to be committed late in the
development cycle of pg14, and as such there would be little time to
check the impact of fully clearing pages. To go forward with the
feature in pg14 at that point, it was safer to not completely empty
pages, so that we'd not be changing the paths we were hitting during
e.g. vacuum too significantly, reducing the chances on significant
bugs that would require the patch to be reverted [1].


I agreed at that point that that was a safer bet, but right now it's
early in the pg15 development cycle, and I've had the time to get more
experience around the vacuum and line pointer machinery. That being
the case, I consider this a re-visit of the topic 'is it OK to
truncate the LP-array to 0', where previously the answer was 'we don't
know, and it's late in the release cycle', and after looking through
the code base now I argue that the answer is Yes.

One more point for going to 0 is that for 32-bit systems, a single
line pointer is enough to block a page from being 'empty' enough to
fit a MaxHeapTupleSize-sized tuple (when requesting pages through the
FSM).

Additionally, there are some other optimizations we can only apply to
empty pages:

- vacuum (with disable_page_skipping = on) will process these empty
pages faster, as it won't need to do any pruning on that page. With
page skipping enabled this won't matter because empty pages are
all_visible and therefore vacuum won't access that page.
- the pgstattuple contrib extension processes emtpy pages (slightly)
faster in pgstattuple_approx
- various loops won't need to check the remaining item that it is
unused, saving some cycles in those loops when the page is accessed.

and further future optimizations might include

- Full-page WAL logging of empty pages produced in the checkpointer
could potentially be optimized to only log 'it's an empty page'
instead of writing out the full 8kb page, which would help in reducing
WAL volume. Previously this optimization would never be hit on
heapam-pages because pages could not become empty again, but right now
this has real potential for applying an optimization.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/CAEze2Wh-nXjkp0bLN_vQwgHttC8CRH%3D1ewcrWk%2B7RX5B93YQPQ%40mail.gmail.com
[1] https://www.postgresql.org/message-id/CAH2-WznCxtWL4B995y2KJWj-%2BjrjahH4n6gD2R74SyQJo6Y63w%40mail.gmail.com



Re: Lowering the ever-growing heap->pd_lower

From
Simon Riggs
Date:
On Tue, 3 Aug 2021 at 17:15, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

> and further future optimizations might include
>
> - Full-page WAL logging of empty pages produced in the checkpointer
> could potentially be optimized to only log 'it's an empty page'
> instead of writing out the full 8kb page, which would help in reducing
> WAL volume. Previously this optimization would never be hit on
> heapam-pages because pages could not become empty again, but right now
> this has real potential for applying an optimization.

So what you are saying is your small change will cause multiple
additional FPIs in WAL. I don't call that a future optimization, I
call that essential additional work.

+1 on committing the first part of the patch, -1 on the second. I
suggest you split the patch and investigate the second part further.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Tue, 3 Aug 2021 at 20:37, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>
> On Tue, 3 Aug 2021 at 17:15, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>
> > and further future optimizations might include
> >
> > - Full-page WAL logging of empty pages produced in the checkpointer
> > could potentially be optimized to only log 'it's an empty page'
> > instead of writing out the full 8kb page, which would help in reducing
> > WAL volume. Previously this optimization would never be hit on
> > heapam-pages because pages could not become empty again, but right now
> > this has real potential for applying an optimization.
>
> So what you are saying is your small change will cause multiple
> additional FPIs in WAL. I don't call that a future optimization, I
> call that essential additional work.

I think you misunderstood my statement. The patch does not change more
pages than before. The patch only ensures that empty heapam pages are
truly empty according to the relevant PageIsEmpty()-macro; which
hypothethically allows for optimizations in the checkpointer process
that currently (as in, since its inception) writes all changed pages
as full page writes (unless turned off).

This change makes it easier and more worthwile to implement a further
optimization for the checkpointer and/or buffer manager to determine
that 1.) this page is now empty, and that 2.) we can therefore write a
specialized WAL record specifically tuned for empty pages instead of
FPI records. No additional pages are changed, because each time the
line pointer array is shrunk, we've already either marked dead tuples
as unused (2nd phase vacuum) or removed HOT line pointers / truncated
dead tuples to lp_dead line pointers (heap_page_prune).

If, instead, you are suggesting that this checkpointer FPI
optimization should be part of the patch just because the potential is
there, then I disagree. Please pardon me if this was not your intended
message, but "you suggested this might be possible after your patch,
thus you must implement this" seems like a great way to burn
contributor goodwill.

The scope of the checkpointer is effectively PostgreSQL's WAL, plus
the page formats of all access methods that use the Page-based storage
manager (not just table access methods, but also those of indexes).
I'm not yet comfortable with hacking in those (sub)systems, nor do I
expect to have the time/effort capacity soon to go through all the
logic of these access methods to validate the hypothesis that such
optimization can be both correctly implemented and worthwile.


Patch 2, as I see it, just clears up some leftover stuff from the end
of the pg14 release cycle with new insights and research I didn't have
at that point in time. As it is a behaviour change for wal-logged
actions, it cannot realistically be backported; therefore it is
included as an improvement for pg15.

Kind regards,

Matthias van de Meent



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
> That is going to have a clear benefit for HOT workloads, which by
> their nature will use a lot of line pointers.

Why do you say that?

> Many applications are updated much more frequently than they are vacuumed.
> Peter - what is your concern about doing this more frequently? Why
> would we *not* do this?

What I meant before was that this argument worked back when we limited
the technique to VACUUM's second heap pass. Doing line pointer array
truncation at that point alone meant that it only ever happened
outside of VACUUM proper. Prior to that point we literally did nothing
about LP_UNUSED items at the end of each line pointer array, so we
were going from doing nothing to doing something.

This time it's quite different: we're truncating the line pointer
array during pruning. Pruning often occurs opportunistically, during
regular query processing. In fact I'd say that it's far more common
than pruning by VACUUM. So the chances of line pointer array
truncation hurting rather than helping seems higher. Plus now we might
break things like DDL, that would naturally not have been affected
before because we were only doing line pointer truncation during
VACUUM proper.

Of course it might well be true that there is a significant benefit to
this patch. I don't think that we should assume that that's the case,
though. We have yet to see a test case showing any benefit. Maybe
that's an easy thing to produce, and maybe Matthias has assumed that I
must already know what to look at. But I don't. It's not obvious to me
how to assess this patch now.

I don't claim to have any insight about what we should or should not
do. At least not right now. When I committed the original (commit
3c3b8a4b), I did so because I thought that it was very likely to
improve certain cases and very unlikely to hurt any other cases.
Nothing more.

> 2. Reduce number of line pointers to 0 in some cases.
> Matthias - I don't think you've made a full case for doing this, nor
> looked at the implications.
> The comment clearly says "it seems like a good idea to avoid leaving a
> PageIsEmpty()" page behind.
> So I would be inclined to remove that from the patch and consider that
> as a separate issue, or close this.

This was part of that earlier commit because of sheer paranoia;
nothing more. I actually think that it's highly unlikely to protect us
from bugs in practice. Though I am, in a certain sense, likely to be
wrong about "PageIsEmpty() defensiveness", it does not bother me in
the slightest. It seems like the right approach in the absence of new
information about a significant downside. If my paranoia really did
turn out to be justified, then I would expect that there'd be a
subtle, nasty bug. That possibility is what I was really thinking of.
And so it almost doesn't matter to me how unlikely we might think such
a bug is now, unless and until somebody can demonstrate a real
practical downside to my defensive approach.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Tue, Aug 3, 2021 at 12:27 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> This change makes it easier and more worthwile to implement a further
> optimization for the checkpointer and/or buffer manager to determine
> that 1.) this page is now empty, and that 2.) we can therefore write a
> specialized WAL record specifically tuned for empty pages instead of
> FPI records. No additional pages are changed, because each time the
> line pointer array is shrunk, we've already either marked dead tuples
> as unused (2nd phase vacuum) or removed HOT line pointers / truncated
> dead tuples to lp_dead line pointers (heap_page_prune).

We generate an FPI the first time a page is modified after a
checkpoint. The FPI consists of the page *after* it has been modified.
Presumably this optimization would need the heap page to be 100%
empty, so we're left with what seems to me to be a very narrow target
for optimization; something that is naturally rare.

A fully-empty page seems quite unlikely in the case of xl_heap_vacuum
records, and impossible in the case of xl_heap_prune records. Even
with all the patches, working together. Have I missed something?

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Wed, 4 Aug 2021 at 03:51, Peter Geoghegan <pg@bowt.ie> wrote:
>
> We generate an FPI the first time a page is modified after a
> checkpoint. The FPI consists of the page *after* it has been modified.

In that case, I misremembered when FPIs were written with relation to
checkpoints. Thanks for reminding me.

The point of non-FPIs as a replacement could still be valid, except
for the point below making this yet more unlikely.

> Presumably this optimization would need the heap page to be 100%
> empty, so we're left with what seems to me to be a very narrow target
> for optimization; something that is naturally rare.

Yes.

> A fully-empty page seems quite unlikely in the case of xl_heap_vacuum
> records, and impossible in the case of xl_heap_prune records. Even
> with all the patches, working together. Have I missed something?

No, you're correct. xl_heap_prune cannot ever empty pages, as it
leaves at least 1 dead, or 1 redirect + 1 normal, line pointer on the
page.

Furthermore, it is indeed quite unlikely that the 2nd pass of vacuum
will be the first page modification after a checkpoint; it is quite a
bit more likely that the page was first modified by the 1st vacuum
pass. Although, this chance on the first modification since checkpoint
being made by the second pass is increased by indexless tables
(however unlikely, they exist in practice) and the autovacuum index
cleanup delay mechanisms allowing pages with only dead item pointers
to remain dead for more than just this one vacuum run, but the chances
on fully clearing the page are indeed very, very slim.

Kind regards,

Matthias van de Meent



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Wed, 4 Aug 2021 at 02:43, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > 2. Reduce number of line pointers to 0 in some cases.
> > Matthias - I don't think you've made a full case for doing this, nor
> > looked at the implications.
> > The comment clearly says "it seems like a good idea to avoid leaving a
> > PageIsEmpty()" page behind.
> > So I would be inclined to remove that from the patch and consider that
> > as a separate issue, or close this.
>
> This was part of that earlier commit because of sheer paranoia;
> nothing more. I actually think that it's highly unlikely to protect us
> from bugs in practice. Though I am, in a certain sense, likely to be
> wrong about "PageIsEmpty() defensiveness", it does not bother me in
> the slightest. It seems like the right approach in the absence of new
> information about a significant downside. If my paranoia really did
> turn out to be justified, then I would expect that there'd be a
> subtle, nasty bug. That possibility is what I was really thinking of.
> And so it almost doesn't matter to me how unlikely we might think such
> a bug is now, unless and until somebody can demonstrate a real
> practical downside to my defensive approach.

As I believe I have mentioned before, there is one significant
downside: 32-bit systems cannot reuse pages that contain only a
singular unused line pointer for max-sized FSM-requests. A fresh page
has 8168 bytes free (8kB - 24B page header), which then becomes 8164
when returned from PageGetFreeSpace (it acocunts for space used by the
line pointer when inserting items onto the page).

On 64-bit systems, MaxHeapTupleSize is 8160, and for for 32-bit
systems the MaxHeapTupleSize is 8164. When we leave one more unused
line pointer on the page, this means the page will have a
PageGetFreeSpace of 8160, 4 bytes less than the MaxHeapTupleSize on
32-bit systems. As such, there will never be FSM entries of the
largest category for pages that have had data on those systems, and as
such, those systems will need to add pages for each request of the
largest category, meaning that all tuples larger than 8128 bytes
(largest request that would request the 254-category) will always be
put on a new page, regardless of the actual availability of free space
in the table.

You might argue that this is a problem in the FSM subsystem, but in
this case it actively hinders us from reusing pages in the largest
category of FSM-requests. If you would argue 'PageGetHeapFreeSpace
should keep free line pointers in mind when calculating free space',
then I would argue 'yes, but isn't it better then to also actually
fully mark that space as unused'.

All in all, I'd just rather remove the distinction between once-used
pages and fresh pages completely by truncating the LP-array to 0 than
to leave this bloating behaviour in the system.

Kind regards,

Matthias van de Meent.



Re: Lowering the ever-growing heap->pd_lower

From
Robert Haas
Date:
On Tue, Aug 3, 2021 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This time it's quite different: we're truncating the line pointer
> array during pruning. Pruning often occurs opportunistically, during
> regular query processing. In fact I'd say that it's far more common
> than pruning by VACUUM. So the chances of line pointer array
> truncation hurting rather than helping seems higher.

How would it hurt?

It's easy to see the harm caused by not shortening the line pointer
array when it is possible to do so: we're using up space in the page
that could have been made free. It's not so obvious to me what the
downside of shortening it might be. I suppose there is a risk that we
shorten it and get no benefit, or even shorten it and have to lengthen
it again almost immediately. But neither of those things really
matters unless shortening is expensive. If we can piggy-back it on an
existing WAL record, I don't really see what the problem is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the ever-growing heap->pd_lower

From
Simon Riggs
Date:
On Wed, 4 Aug 2021 at 01:43, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
> > That is going to have a clear benefit for HOT workloads, which by
> > their nature will use a lot of line pointers.
>
> Why do you say that?

Truncating line pointers can make extra space on the page, so it could
be the difference between a HOT and a non-HOT update. My understanding
is that these just-in-time actions have a beneficial effect in other
circumstances, so we can do that here also.

If we truncate line pointers more frequently then the root pointers
will tend to be lower in the array, which will make truncation even
more effective.

> > Many applications are updated much more frequently than they are vacuumed.
> > Peter - what is your concern about doing this more frequently? Why
> > would we *not* do this?
>
> What I meant before was that this argument worked back when we limited
> the technique to VACUUM's second heap pass. Doing line pointer array
> truncation at that point alone meant that it only ever happened
> outside of VACUUM proper. Prior to that point we literally did nothing
> about LP_UNUSED items at the end of each line pointer array, so we
> were going from doing nothing to doing something.
>
> This time it's quite different: we're truncating the line pointer
> array during pruning. Pruning often occurs opportunistically, during
> regular query processing. In fact I'd say that it's far more common
> than pruning by VACUUM. So the chances of line pointer array
> truncation hurting rather than helping seems higher.

If the only thing we do to a page is truncate line pointers then it
may not be worth it. But dirtying a page to reclaim space is also the
precise time when reclaiming line pointers makes sense also. So if the
page is dirtied by cleaning, then that is the time to reclaim line
pointers also.

Again, why would we reclaim space to avoid bloat but then ignore any
line pointer bloat?

It's not clear why truncating unused line pointers would break DDL.

--
Simon Riggs                http://www.EnterpriseDB.com/



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Wed, Aug 4, 2021 at 12:09 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> Truncating line pointers can make extra space on the page, so it could
> be the difference between a HOT and a non-HOT update. My understanding
> is that these just-in-time actions have a beneficial effect in other
> circumstances, so we can do that here also.

I would prefer if the arguments in favor were a little more concrete.
Maybe in general they don't have to be. But that would be my
preference, and what I would look if I was to commit such a patch.

> If the only thing we do to a page is truncate line pointers then it
> may not be worth it. But dirtying a page to reclaim space is also the
> precise time when reclaiming line pointers makes sense also. So if the
> page is dirtied by cleaning, then that is the time to reclaim line
> pointers also.
>
> Again, why would we reclaim space to avoid bloat but then ignore any
> line pointer bloat?

I don't think that my mental model is significantly different to yours
here. Like everybody else, I can easily imagine how this *might* have
value. All I'm really saying is that a burden of proof exists for this
patch (or any performance orientated patch). In my opinion that burden
has yet to be met by this patch. My guess is that Matthias can show a
clear, measurable benefit with a little more work. If that happens
then all of my marginal concerns about the risk of bugs become
relatively unimportant.

It might even be okay to commit the patch on the basis of "what could
the harm be?" if there was some effort to demonstrate empirically that
the performance downside really was zero.

> It's not clear why truncating unused line pointers would break DDL.

I'm just saying that it's obviously not possible now, with the
VACUUM-only PageTruncateLinePointerArray()/lazy_vacuum_heap_page()
code path added to Postgres 14 -- because VACUUM's relation-level lock
makes sure of that. That property has some value. Certainly not enough
value to block progress on a feature that is clearly useful, but
enough to give me pause.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Wed, Aug 4, 2021 at 7:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> How would it hurt?
>
> It's easy to see the harm caused by not shortening the line pointer
> array when it is possible to do so: we're using up space in the page
> that could have been made free. It's not so obvious to me what the
> downside of shortening it might be.

I think that that's probably true. That in itself doesn't seem like a
good enough reason to commit the patch.

Maybe this really won't be difficult for Matthias. I just want to see
some concrete testing, maybe with pgbench, or with an artificial test
case. Maybe something synthetic that shows a benefit measurable in
on-disk table size. Or at least the absence of any regressions.
Something.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Simon Riggs
Date:
On Wed, 4 Aug 2021 at 15:39, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 3, 2021 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > This time it's quite different: we're truncating the line pointer
> > array during pruning. Pruning often occurs opportunistically, during
> > regular query processing. In fact I'd say that it's far more common
> > than pruning by VACUUM. So the chances of line pointer array
> > truncation hurting rather than helping seems higher.
>
> How would it hurt?
>
> It's easy to see the harm caused by not shortening the line pointer
> array when it is possible to do so: we're using up space in the page
> that could have been made free. It's not so obvious to me what the
> downside of shortening it might be. I suppose there is a risk that we
> shorten it and get no benefit, or even shorten it and have to lengthen
> it again almost immediately. But neither of those things really
> matters unless shortening is expensive. If we can piggy-back it on an
> existing WAL record, I don't really see what the problem is.

Hmm, there is no information in WAL to describe the line pointers
being truncated by PageTruncateLinePointerArray(). We just truncate
every time we see a XLOG_HEAP2_VACUUM record and presume it does the
same thing as the original change.

If that is safe, then we don't need to put the truncation on a WAL
record at all, we just truncate after every XLOG_HEAP2_PRUNE record.

If that is not safe... then we have a PG14 bug.

If we do want to see this in WAL, both xl_heap_vacuum and
xl_heap_prune lend themselves to just adding one more OffsetNumber
onto the existing array, to represent the highest offset after
truncation. So we don't need to change the WAL format.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Thu, Aug 5, 2021 at 6:28 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> Hmm, there is no information in WAL to describe the line pointers
> being truncated by PageTruncateLinePointerArray(). We just truncate
> every time we see a XLOG_HEAP2_VACUUM record and presume it does the
> same thing as the original change.
>
> If that is safe, then we don't need to put the truncation on a WAL
> record at all, we just truncate after every XLOG_HEAP2_PRUNE record.

I agree that that's how we'd do it. This approach is no different to
assuming that PageRepairFragmentation() reliably produces a final
defragmented page image deterministically when called after we prune.

These days we automatically verify assumptions like this via
wal_consistency_checking. It would absolutely be able to catch any
bugs in PageTruncateLinePointerArray(), since the truncate code path
has plenty of coverage within the regression tests.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Daniel Gustafsson
Date:
This thread has stalled, and the request for benchmark/test has gone unanswered
so I'm marking this patch Returned with Feedback.  Please feel free to resubmit
this patch if it is picked up again.

--
Daniel Gustafsson        https://vmware.com/




Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Thu, 2 Dec 2021 at 11:17, Daniel Gustafsson <daniel@yesql.se> wrote:
>
> This thread has stalled, and the request for benchmark/test has gone unanswered
> so I'm marking this patch Returned with Feedback.  Please feel free to resubmit
> this patch if it is picked up again.

Well then, here we go. It took me some time to find the time and
examples, but here we are. Attached is v7 of the patchset, which is a
rebase on top of the latest release, with some updated comments.

Peter Geoghegan asked for good arguments for the two changes
implemented. Below are my arguments detailed, with adversarial loads
that show the problematic behaviour of the line pointer array that is
fixed with the patch.

Kind regards,

Matthias van de Meent


Truncating lp_array to 0 line pointers
===========================

On 32-bit pre-patch systems the heap grows without limit; post-patch
the relation doesn't grow beyond 16kiB (assuming no other sessions in
the database):

> -- setup
> CREATE TABLE tst (filler  text);
> ALTER TABLE tst SET (autovacuum_enabled = off, fillfactor = 10); -- disable autovacuum, and trigger pruning more
often
> ALTER TABLE tst ALTER COLUMN filler SET STORAGE PLAIN;
> INSERT INTO tst VALUES ('');
> -- processing load
> VACUUM tst; UPDATE tst SET filler = repeat('1', 8134); -- ~ max size heaptuple in MAXIMUM_ALIGNOF = 4 systems

On 64-bit systems this load is not an issue due to slightly larger
tolerances in the FSM.

# Truncating lp_array during pruning
===========================

The following adversarial load grows the heap relation, but with the
patch the relation keeps its size. The point being that HOT updates
can temporarily inflate the LP array significantly, and this patch can
actively combat that issue while we're waiting for the 2nd pass of
vacuum to arrive.

> -- Initialize experiment
> TRUNCATE tst;
> INSERT INTO tst SELECT null;
> UPDATE tst SET filler = null;
> VACUUM tst;
>
> -- start workload by filling all line pointer slots with minsized heap tuples
> do language plpgsql $$
> begin
>     for i in 1..289 loop
>         update tst set filler = null;
>     end loop;
> end;
> $$;
> -- all line pointers on page 0 are now filled with hot updates of 1st line pointer
>
> -- hot-update hits the page; pruning is first applied
> -- All but first and last LP are now empty. new tuple is inserted at
> -- offset=2
> UPDATE tst SET filler = null;
>
> -- Insert large tuple, filling most of the now free space between the end of
> -- the max size line pointer array
> UPDATE tst SET filler = repeat('1', 7918);
>
> -- insert slightly smaller tuple, that is ~ the size of the unused space in
> -- the LP array
> UPDATE tst SET filler = repeat('1', 1144);
>
> -- reset experiment to initialized state
> UPDATE tst SET filler = null;

Attachment

Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Peter Geoghegan asked for good arguments for the two changes
> implemented. Below are my arguments detailed, with adversarial loads
> that show the problematic behaviour of the line pointer array that is
> fixed with the patch.

Why is it okay that lazy_scan_prune() still calls
PageGetMaxOffsetNumber() once for the page, before it ever calls
heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff
now, if only so that its scan-page-items loop doesn't get confused
when it goes on to read "former line pointers"? This is certainly
possible with the CLOBBER_FREED_MEMORY stuff in place (which will
memset the truncated line pointer space with a 0x7F7F7F7F pattern).

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Wed, 16 Feb 2022 at 20:54, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Peter Geoghegan asked for good arguments for the two changes
> > implemented. Below are my arguments detailed, with adversarial loads
> > that show the problematic behaviour of the line pointer array that is
> > fixed with the patch.
>
> Why is it okay that lazy_scan_prune() still calls
> PageGetMaxOffsetNumber() once for the page, before it ever calls
> heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff
> now, if only so that its scan-page-items loop doesn't get confused
> when it goes on to read "former line pointers"? This is certainly
> possible with the CLOBBER_FREED_MEMORY stuff in place (which will
> memset the truncated line pointer space with a 0x7F7F7F7F pattern).

Good catch, it is not. Attached a version that re-establishes maxoff
after each prune operation.

-Matthias

Attachment

Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Wed, 16 Feb 2022 at 21:14, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 16 Feb 2022 at 20:54, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > > Peter Geoghegan asked for good arguments for the two changes
> > > implemented. Below are my arguments detailed, with adversarial loads
> > > that show the problematic behaviour of the line pointer array that is
> > > fixed with the patch.
> >
> > Why is it okay that lazy_scan_prune() still calls
> > PageGetMaxOffsetNumber() once for the page, before it ever calls
> > heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff
> > now, if only so that its scan-page-items loop doesn't get confused
> > when it goes on to read "former line pointers"? This is certainly
> > possible with the CLOBBER_FREED_MEMORY stuff in place (which will
> > memset the truncated line pointer space with a 0x7F7F7F7F pattern).
>
> Good catch, it is not. Attached a version that re-establishes maxoff
> after each prune operation.

I double-checked the changes, and to me it seems like that was the
only place in the code where PageGetMaxOffsetNumber was not handled
correctly. This was fixed in the latest patch (v8).

Peter, would you have time to further review this patch and/or commit it?

- Matthias



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2022 at 5:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I double-checked the changes, and to me it seems like that was the
> only place in the code where PageGetMaxOffsetNumber was not handled
> correctly. This was fixed in the latest patch (v8).
>
> Peter, would you have time to further review this patch and/or commit it?

I'll definitely review it some more before too long.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> # Truncating lp_array during pruning
> ===========================
>
> The following adversarial load grows the heap relation, but with the
> patch the relation keeps its size. The point being that HOT updates
> can temporarily inflate the LP array significantly, and this patch can
> actively combat that issue while we're waiting for the 2nd pass of
> vacuum to arrive.

I am sympathetic to the idea that giving the system a more accurate
picture of how much free space is available on each heap page is an
intrinsic good. This might help us in a few different areas. For
example, the FSM cares about relatively small differences in available
free space *among* heap pages that are "in competition" in
RelationGetBufferForTuple(). Plus we have a heuristic based on
PageGetHeapFreeSpace() in heap_page_prune_opt() to consider.

We should definitely increase MaxHeapTuplesPerPage before too long,
for a variety of reasons that I have talked about in the past. Its
current value is 291 on all mainstream platforms, a value that's
derived from accidental historic details -- which predate HOT.
Obviously an increase in MaxHeapTuplesPerPage is likely to make the
problem that the patch proposes to solve worse. I lean towards
committing the patch now as work in that direction, in fact.

It helps that this patch now seems relatively low risk.

--
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Mon, Apr 4, 2022 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I am sympathetic to the idea that giving the system a more accurate
> picture of how much free space is available on each heap page is an
> intrinsic good. This might help us in a few different areas. For
> example, the FSM cares about relatively small differences in available
> free space *among* heap pages that are "in competition" in
> RelationGetBufferForTuple(). Plus we have a heuristic based on
> PageGetHeapFreeSpace() in heap_page_prune_opt() to consider.

Pushed a slightly revised version of this just now. Differences:

* Rewrote the comments, and adjusted related comments in vacuumlazy.c.
Mostly just made them shorter.

* I eventually decided that it was fine to just accept the issue with
maxoff in lazy_scan_prune (the pruning path used by VACUUM).

There seemed to be no need to reestablish a maxoff for the page here
following further reflection. I changed my mind.

Setting reclaimed line pointer array space to a pattern of 0x7F bytes
wasn't adding much here. Pruning either runs in VACUUM, or
opportunistically. When it runs in VACUUM things are highly
constrained already. Opportunistic pruning for heap_page_prune_opt()
callers doesn't even require that the caller start out with a buffer
lock. Pruning only goes ahead when we successfully acquire a cleanup
lock -- so callers can't be relying on maxoff not changing.

* Didn't keep the changes to PageTruncateLinePointerArray().

There is at least no reason to tie this question about VACUUM to how
pruning behaves. I still see some small value in avoiding creating a
new path that allows PageIsEmpty() pages to come into existence in a
new way, which is no less true with the patch I pushed.

Thanks
--
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Andres Freund
Date:
Hi,

On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> We should definitely increase MaxHeapTuplesPerPage before too long,
> for a variety of reasons that I have talked about in the past. Its
> current value is 291 on all mainstream platforms, a value that's
> derived from accidental historic details -- which predate HOT.

I'm on-board with that - but I think we should rewrite a bunch of places that
use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
several KB of stack at the current the current value already (*), but if it grows
further...

Greetings,

Andres Freund


* lazy_scan_prune() has
    OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
    xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
which already works out to 4074 bytes.



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Thu, Apr 7, 2022 at 4:01 PM Andres Freund <andres@anarazel.de> wrote:
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it grows
> further...

No arguments here. There are probably quite a few places that won't
need to be fixed, because it just doesn't matter, but
lazy_scan_prune() will.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Matthias van de Meent
Date:
On Fri, 8 Apr 2022 at 01:01, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> > We should definitely increase MaxHeapTuplesPerPage before too long,
> > for a variety of reasons that I have talked about in the past. Its
> > current value is 291 on all mainstream platforms, a value that's
> > derived from accidental historic details -- which predate HOT.
>
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it grows
> further...

Yeah, I think we should definately support more line pointers on a
heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
the current value is the physical limit for heap tuples, as we have at
most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
won't change. A macro MaxHeapLinePointersPerPage would probably be
more useful, which could be as follows (assuming we don't want to
allow filling a page with effectively only dead line pointers):

#define MaxHeapLinePointersPerPage \
   ((int) (((BLCKSZ - SizeOfPageHeaderData) / \
         (MAXALIGN(SizeofHeapTupleHeader) + 2 * sizeof(ItemIdData))) * 2))

This accounts for the worst case of one redirect + one min-sized live
heap tuple, and fills the page with it. Although impossible to put a
page in such a state, that would be the worst case of live line
pointers on a page.
For the default BLCKSZ of 8kB, that results in 510 line pointers
used-but-not-dead, an increase of ~ 70% over what's currently
available.

-Matthias



Re: Lowering the ever-growing heap->pd_lower

From
Robert Haas
Date:
On Thu, Apr 7, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> > We should definitely increase MaxHeapTuplesPerPage before too long,
> > for a variety of reasons that I have talked about in the past. Its
> > current value is 291 on all mainstream platforms, a value that's
> > derived from accidental historic details -- which predate HOT.
>
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it grows
> further...

I agree that the value of 291 is pretty much accidental, but it also
seems fairly generous to me. The bigger you make it, the more space
you can waste. I must have missed (or failed to understand) previous
discussions about why raising it would be a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Yeah, I think we should definately support more line pointers on a
> heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
> the current value is the physical limit for heap tuples, as we have at
> most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
> won't change. A macro MaxHeapLinePointersPerPage would probably be
> more useful, which could be as follows (assuming we don't want to
> allow filling a page with effectively only dead line pointers):

That's a good point. Sounds like it might be the right approach.

I suppose that it will depend on how much use of MaxHeapTuplesPerPage
remains once it is split in two like this.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 6:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I agree that the value of 291 is pretty much accidental, but it also
> seems fairly generous to me. The bigger you make it, the more space
> you can waste. I must have missed (or failed to understand) previous
> discussions about why raising it would be a good idea.

What do you mean about wasting space? Wasting space on the stack? I
can't imagine you meant wasting space on the page, since being able to
accomodate more items on each heap page seems like it would be
strictly better, barring any unintended weird FSM issues.

As far as I know the only real downside to increasing it is the impact
on tidbitmap.c. Increasing the number of valid distinct TID values
might have a negative impact on performance during bitmap scans, which
will need to be managed. However, I don't think that increased stack
space usage will be a problem, with a little work. It either won't
matter at all (e.g. an array of offset numbers on the stack still
won't be very big), or it can be fixed locally where it turns out to
matter (like in lazy_scan_prune).

We used to routinely use MaxOffsetNumber for arrays of item offset
numbers. I cut down on that in the B-Tree code, reducing it to
MaxIndexTuplesPerPage (which is typically 407) in a few places. So
anything close to our current MaxIndexTuplesPerPage ought to be fine
for most individual arrays stored on the stack.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 9:44 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Yeah, I think we should definately support more line pointers on a
> > heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
> > the current value is the physical limit for heap tuples, as we have at
> > most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
> > won't change. A macro MaxHeapLinePointersPerPage would probably be
> > more useful, which could be as follows (assuming we don't want to
> > allow filling a page with effectively only dead line pointers):
>
> That's a good point. Sounds like it might be the right approach.
>
> I suppose that it will depend on how much use of MaxHeapTuplesPerPage
> remains once it is split in two like this.

Thinking about this some more, I wonder if it would make sense to
split MaxHeapTuplesPerPage into two new constants (a true
MaxHeapTuplesPerPage, plus MaxHeapLinePointersPerPage), for the
reasons discussed, but also as a way of getting a *smaller* effective
MaxHeapTuplesPerPage than 291 in some contexts only.

There are some ways in which the current MaxHeapTuplesPerPage isn't
enough, but also some ways in which it is excessive. It might be
useful if PageGetHeapFreeSpace() usually considered a heap page to
have no free space if the number of tuples with storage (or some cheap
proxy thereof) was about 227, which is the largest number of distinct
heap tuples that can *plausibly* ever be stored on an 8KiB page (it
ignores zero column tables). Most current PageGetHeapFreeSpace()
callers (including VACUUM) would continue to call that function in the
same way as today, and get this lower limit.

A few of the existing PageGetHeapFreeSpace() callers could store more
line pointers than that (MaxHeapLinePointersPerPage, which might be
510 in practice) -- but only those involved in updates. The overall
idea is to recognize that free space is not interchangeable -- updates
should have some kind of advantage over plain inserts when it comes to
the space on the page of the tuple that they're updating.

We might even want to make our newly defined, lower
MaxHeapTuplesPerPage into a tunable storage param.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Robert Haas
Date:
On Fri, Apr 8, 2022 at 12:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What do you mean about wasting space? Wasting space on the stack? I
> can't imagine you meant wasting space on the page, since being able to
> accomodate more items on each heap page seems like it would be
> strictly better, barring any unintended weird FSM issues.

I meant wasting space in the page. I think that's a real concern.
Imagine you allowed 1000 line pointers per page. Each one consumes 2
bytes. So now you could have ~25% of each page in the table storing
dead line pointers. That sounds awful, and just running VACUUM won't
fix it once it's happened, because the still-live line pointers are
likely to be at the end of the line pointer array and thus truncating
it won't necessarily be possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I meant wasting space in the page. I think that's a real concern.
> Imagine you allowed 1000 line pointers per page. Each one consumes 2
> bytes. So now you could have ~25% of each page in the table storing
> dead line pointers. That sounds awful, and just running VACUUM won't
> fix it once it's happened, because the still-live line pointers are
> likely to be at the end of the line pointer array and thus truncating
> it won't necessarily be possible.

I see. That's a legitimate concern, though one that I believe can be
addressed. I have learned to dread any kind of bloat that's
irreversible, no matter how minor it might seem when seen as an
isolated event, so I'm certainly sympathetic to these concerns. You
can make a similar argument in favor of a higher
MaxHeapLinePointersPerPage limit, though -- and that's why I believe
an increase of some kind makes sense. The argument goes like this:

What if we miss the opportunity to systematically keep successor
versions of a given logical row on the same heap page over time, due
only to the current low MaxHeapLinePointersPerPage limit of 291? If we
had only been able to "absorb" just a few extra versions in the short
term, we would have had stability (in the sense of being able to
preserve locality among related logical rows) in the long term. We
could have kept everything together, if only we didn't overreact to
what were actually short term, rare perturbations.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Robert Haas
Date:
On Fri, Apr 8, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What if we miss the opportunity to systematically keep successor
> versions of a given logical row on the same heap page over time, due
> only to the current low MaxHeapLinePointersPerPage limit of 291? If we
> had only been able to "absorb" just a few extra versions in the short
> term, we would have had stability (in the sense of being able to
> preserve locality among related logical rows) in the long term. We
> could have kept everything together, if only we didn't overreact to
> what were actually short term, rare perturbations.

Hmm. I wonder if we could teach the system to figure out which of
those things is happening. In the case that I'm worried about, when
we're considering growing the line pointer array, either the line
pointers will be dead or the line pointers will be used but the tuples
to which they point will be dead. In the case you describe here, there
should be very few dead tuples or line pointers in the page. Maybe
when the number of line pointers starts to get big, we refuse to add
more without checking the number of dead tuples and dead line pointers
and verifying that those numbers are still small. Or, uh, something.

One fly in the ointment is that if we refuse to expand the line
pointer array, we might extend the relation instead, which is another
kind of bloat and thus not great. But if the line pointer array is
somehow filling up with tons of dead tuples, we're going to have to
extend the relation anyway. I suspect that in some circumstances it's
better to just accept that outcome and hope that it leads to some
pages becoming empty, thus allowing their line pointer arrays to be
reset.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the ever-growing heap->pd_lower

From
Andres Freund
Date:
Hi,

On 2022-04-08 09:17:40 -0400, Robert Haas wrote:
> I agree that the value of 291 is pretty much accidental, but it also
> seems fairly generous to me. The bigger you make it, the more space
> you can waste. I must have missed (or failed to understand) previous
> discussions about why raising it would be a good idea.

It's not hard to hit scenarios where pages are effectively unusable, because
they have close to 291 dead items, without autovacuum triggering (or
autovacuum just taking a while). You basically just need updates / deletes to
concentrate in a certain range of the table and have indexing that prevents
HOT updates. Because the overall percentage of dead tuples is low, no
autovacuum is triggered, yet a range of the table contains little but dead
items.  At which point you basically waste 7k bytes (1164 bytes for dead items
IIRC) until a vacuum finally kicks in - way more than what what you'd waste if
the number of line items were limited at e.g. 2 x MaxHeapTuplesPerPage

This has become a bit more pronounced with vacuum skipping index cleanup when
there's "just a few" dead items - if all your updates concentrate in a small
region, 2% of the whole relation size isn't actually that small.


I wonder if we could reduce the real-world space wastage of the line pointer
array, if we changed the the logic about which OffsetNumbers to use during
inserts / updates and and made a few tweaks to to pruning.

1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can
   reclaim them during pruning once they're dead. They don't leave behind a
   dead item that's unreclaimable until the next vacuum with an index cleanup
   pass.

2) Arguably the OffsetNumber of a redirect target can be changed. It might
   break careless uses of WHERE ctid = ... though (which likely are already
   broken, just harder to hit).

These leads me to a few potential improvements:

a) heap_page_prune_prune() should take the number of used items into account
   when deciding whether to prune. Right now we trigger hot pruning based on
   the number of items only if PageGetMaxOffsetNumber(page) >=
   MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId
   used for a root tuple, we should trigger HOT pruning when it might lower
   which OffsetNumber get used.

b) heap_page_prune_prune() should be triggered in more paths. E.g. when
   inserting / updating, we should prune if it allows us to avoid using a high
   OffsetNumber.

c) What if we left some percentage of ItemIds unused, when looking for the
   OffsetNumber of a new HOT row version? That'd make it more likely for
   non-HOT updates and inserts to fit onto the page, without permanently
   increasing the size of the line pointer array.

d) If we think 2) is acceptable, we could move the targets of redirects to
   make space for new root tuples, without increasing the permanent size of
   the line pointer array.

Crazy?

Greetings,

Andres Freund



Re: Lowering the ever-growing heap->pd_lower

From
Andres Freund
Date:
Hi,

On 2022-04-08 15:04:37 -0400, Robert Haas wrote:
> I meant wasting space in the page. I think that's a real concern.
> Imagine you allowed 1000 line pointers per page. Each one consumes 2
> bytes.

It's 4 bytes per line pointer, right?

struct ItemIdData {
        unsigned int               lp_off:15;            /*     0: 0  4 */
        unsigned int               lp_flags:2;           /*     0:15  4 */
        unsigned int               lp_len:15;            /*     0:17  4 */

        /* size: 4, cachelines: 1, members: 3 */
        /* last cacheline: 4 bytes */
};

Or am I confusing myself somehow?


I do wish the length of the tuple weren't in ItemIdData, but part of the
tuple, so we'd not waste this space for dead items (I think it'd also simplify
more code than it'd complicate). But ...

- Andres



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> It's 4 bytes per line pointer, right?

Yeah, it's 4 bytes in Postgres. Most other DB systems only need 2
bytes, which is implemented in exactly the way that you're imagining.

-- 
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 2:06 PM Andres Freund <andres@anarazel.de> wrote:
> It's not hard to hit scenarios where pages are effectively unusable, because
> they have close to 291 dead items, without autovacuum triggering (or
> autovacuum just taking a while).

I think that this is mostly a problem with HOT-updates, and regular
updates to a lesser degree. Deletes seem less troublesome.

I find that it's useful to think in terms of the high watermark number
of versions required for a given logical row over time. It's probably
quite rare for most individual logical rows to truly require more than
2 or 3 versions per row at the same time, to serve queries. Even in
update-heavy tables. And without doing anything fancy with the
definition of HeapTupleSatisfiesVacuum(). There are important
exceptions, certainly, but overall I think that we're still not doing
good enough with these easier cases.

The high watermark number of versions is probably going to be
significantly greater than the typical number of versions for the same
row. So maybe we give up on keeping a row on its original heap block
today, all because of a once-off (or very rare) event where we needed
slightly more extra space for only a fraction of a second.

The tell-tale sign of these kinds of problems can sometimes be seen
with synthetic, rate-limited benchmarks. If it takes a very long time
for the problem to grow, but nothing about the workload really ever
changes, then that suggests problems that have this quality.  The
probability of any given logical row being moved to another heap block
is very low. And yet it is inevitable that many (even all) will, given
enough time, given enough opportunities to get unlucky.

> This has become a bit more pronounced with vacuum skipping index cleanup when
> there's "just a few" dead items - if all your updates concentrate in a small
> region, 2% of the whole relation size isn't actually that small.

The 2% threshold was chosen based on the observation that it was below
the effective threshold where autovacuum just won't ever launch
anything on a moderate sized table (unless you set
autovacuum_vacuum_scale_factor to something absurdly low). The real
problem is that IMV. That's why I think that we need to drive it based
primarily on page-level characteristics. While effectively ignoring
pages that are all-visible when deciding if enough bloat is present to
necessitate vacuuming.

> 1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can
>    reclaim them during pruning once they're dead. They don't leave behind a
>    dead item that's unreclaimable until the next vacuum with an index cleanup
>    pass.

I like the general direction here, but this particular idea doesn't
seem like a winner.

> 2) Arguably the OffsetNumber of a redirect target can be changed. It might
>    break careless uses of WHERE ctid = ... though (which likely are already
>    broken, just harder to hit).

That makes perfect sense to me, though.

> a) heap_page_prune_prune() should take the number of used items into account
>    when deciding whether to prune. Right now we trigger hot pruning based on
>    the number of items only if PageGetMaxOffsetNumber(page) >=
>    MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId
>    used for a root tuple, we should trigger HOT pruning when it might lower
>    which OffsetNumber get used.

Unsure about this.

> b) heap_page_prune_prune() should be triggered in more paths. E.g. when
>    inserting / updating, we should prune if it allows us to avoid using a high
>    OffsetNumber.

Unsure about this too.

I prototyped a design that gives individual backends soft ownership of
heap blocks that were recently allocated, and later prunes the heap
page when it fills [1]. Useful for aborted transactions, where it
preserves locality -- leaving aborted tuples behind makes their space
ultimately reused for unrelated inserts, which is bad. But eager
pruning allows the inserter to leave behind more or less pristine heap
pages, which don't need to be pruned later on.

> c) What if we left some percentage of ItemIds unused, when looking for the
>    OffsetNumber of a new HOT row version? That'd make it more likely for
>    non-HOT updates and inserts to fit onto the page, without permanently
>    increasing the size of the line pointer array.

That sounds promising.

[1] https://postgr.es/m/CAH2-Wzm-VhVeQYTH8hLyYho2wdG8Ecrm0uPQJWjap6BOVfe9Og@mail.gmail.com
--
Peter Geoghegan



Re: Lowering the ever-growing heap->pd_lower

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2022 at 1:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm. I wonder if we could teach the system to figure out which of
> those things is happening. In the case that I'm worried about, when
> we're considering growing the line pointer array, either the line
> pointers will be dead or the line pointers will be used but the tuples
> to which they point will be dead. In the case you describe here, there
> should be very few dead tuples or line pointers in the page. Maybe
> when the number of line pointers starts to get big, we refuse to add
> more without checking the number of dead tuples and dead line pointers
> and verifying that those numbers are still small. Or, uh, something.

It seems like the central idea is that we think in terms of "versions
per logical row", even in low level code that traditionally hasn't
made those kinds of distinctions.

Ideally we could structure pruning a little bit more like a B-Tree
page split, where there is an explicit "incoming tuple" that won't fit
(without splitting the page, or maybe doing some kind of deletion). If
the so-called incoming tuple that we'd rather like to fit on the page
is an insert of an unrelated row, don't allow it (don't prune, give
up). But if it's an update (especially a hot update), be much more
permissive about allowing it, and/or going ahead with pruning in order
to make sure it happens.

I like Andres' idea of altering LP_REDIRECTs just to be able to use up
lower line pointers first. Or preserving a few extra LP_UNUSED items
on insert. Those seem promising to me.

> One fly in the ointment is that if we refuse to expand the line
> pointer array, we might extend the relation instead, which is another
> kind of bloat and thus not great. But if the line pointer array is
> somehow filling up with tons of dead tuples, we're going to have to
> extend the relation anyway. I suspect that in some circumstances it's
> better to just accept that outcome and hope that it leads to some
> pages becoming empty, thus allowing their line pointer arrays to be
> reset.

I agree. Sometimes the problem is that we don't cut our losses when we
should -- sometimes just accepting a limited downside is the right
thing to do. Like with the FSM; we diligently use every last scrap of
free space, without concern for the bigger picture. It's penny-wise,
pound-foolish.

-- 
Peter Geoghegan