Thread: Improve heapgetpage() performance, overhead from serializable

Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed.  It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise.  All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987


FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.




Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies.  I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...


I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?


Greetings,

Andres Freund

Attachment

Re: Improve heapgetpage() performance, overhead from serializable

From
Zhang Mingli
Date:
Hi,

Regards,
Zhang Mingli
On Jul 16, 2023 at 09:57 +0800, Andres Freund <andres@anarazel.de>, wrote:
Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987


FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.




Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...


I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?


Greetings,

Andres Freund

LGTM and I have a fool question: 

if (likely(all_visible))
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 1, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 1, 1);
}
else
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 0, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 0, 1);


Does it make sense to combine if else condition and put it to the incline function’s param?

Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, all_visible, check_serializable);











Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:
> LGTM and I have a fool question:
>
>     if (likely(all_visible))
>     {
>         if (likely(!check_serializable))
>             scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
>                                                    block, lines, 1, 0);
>         else
>             scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
>                                                    block, lines, 1, 1);
>     }
>     else
>     {
>         if (likely(!check_serializable))
>             scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
>                                                    block, lines, 0, 0);
>         else
>             scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
>                                                    block, lines, 0, 1);
>
>
> Does it make sense to combine if else condition and put it to the incline function’s param?
>
> Like:
> scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
>                                                    block, lines, all_visible, check_serializable);

I think that makes it less likely that the compiler actually generates a
constant-folded version for each of the branches. Perhaps worth some
experimentation.

Greetings,

Andres Freund



Re: Improve heapgetpage() performance, overhead from serializable

From
Muhammad Malik
Date:
Hi,

Is there a plan to merge this patch in PG16?

Thanks,
Muhammad


From: Andres Freund <andres@anarazel.de>
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Cc: Thomas Munro <thomas.munro@gmail.com>
Subject: Improve heapgetpage() performance, overhead from serializable
 
Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed.  It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise.  All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987


FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.




Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies.  I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...


I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?


Greetings,

Andres Freund

Re: Improve heapgetpage() performance, overhead from serializable

From
tender wang
Date:
This thread [1] also Improving the heapgetpage function, and looks like this thread.


Muhammad Malik <muhammad.malik1@hotmail.com> 于2023年9月1日周五 04:04写道:
Hi,

Is there a plan to merge this patch in PG16?

Thanks,
Muhammad


From: Andres Freund <andres@anarazel.de>
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Cc: Thomas Munro <thomas.munro@gmail.com>
Subject: Improve heapgetpage() performance, overhead from serializable
 
Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed.  It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise.  All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987


FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.




Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies.  I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...


I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?


Greetings,

Andres Freund

Re: Improve heapgetpage() performance, overhead from serializable

From
John Naylor
Date:

On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

> FWIW, there's more we can do, with some hacky changes I got the time down to
> 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
> for something as core as this, is imo worth considering on its own.

Nice!

> On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:

> > Does it make sense to combine if else condition and put it to the incline function’s param?
> >
> > Like:
> > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
> >                                                                                                  block, lines, all_visible, check_serializable);
>
> I think that makes it less likely that the compiler actually generates a
> constant-folded version for each of the branches. Perhaps worth some
> experimentation.

Combining this way doesn't do so for me.

Minor style nit:

+ scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+   block, lines, 0, 1);

I believe we prefer true/false rather than numbers. 

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2023-09-05 14:42:57 +0700, John Naylor wrote:
> On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:
>
> > FWIW, there's more we can do, with some hacky changes I got the time down
> to
> > 273.261, but the tradeoffs start to be a bit more complicated. And
> 397->320ms
> > for something as core as this, is imo worth considering on its own.
>
> Nice!
>
> > On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:
>
> > > Does it make sense to combine if else condition and put it to the
> incline function’s param?
> > >
> > > Like:
> > > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
> > >
>                          block, lines, all_visible, check_serializable);
> >
> > I think that makes it less likely that the compiler actually generates a
> > constant-folded version for each of the branches. Perhaps worth some
> > experimentation.
>
> Combining this way doesn't do so for me.

Are you saying that the desired constant folding happened after combining the
branches, or that it didn't happen?

Greetings,

Andres Freund



Re: Improve heapgetpage() performance, overhead from serializable

From
John Naylor
Date:


On Wed, Sep 6, 2023 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:

> > > I think that makes it less likely that the compiler actually generates a
> > > constant-folded version for each of the branches. Perhaps worth some
> > > experimentation.
> >
> > Combining this way doesn't do so for me.
>
> Are you saying that the desired constant folding happened after combining the
> branches, or that it didn't happen?

Constant folding did not happen.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Improve heapgetpage() performance, overhead from serializable

From
John Naylor
Date:

On Fri, Sep 1, 2023 at 1:08 PM tender wang <tndrwang@gmail.com> wrote:
>
> This thread [1] also Improving the heapgetpage function, and looks like this thread.
>
> [1] https://www.postgresql.org/message-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5%40yeah.net

Please don't top-post.

For the archives: That CF entry has been withdrawn, after the author looked at this one and did some testing.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Improve heapgetpage() performance, overhead from serializable

From
John Naylor
Date:
On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:
> And 397->320ms
> for something as core as this, is imo worth considering on its own.

Hi Andres, this interesting work seems to have fallen off the radar --
are you still planning to move forward with this for v17?



Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2024-01-22 13:01:31 +0700, John Naylor wrote:
> On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:
> > And 397->320ms
> > for something as core as this, is imo worth considering on its own.
> 
> Hi Andres, this interesting work seems to have fallen off the radar --
> are you still planning to move forward with this for v17?

I had completely forgotten about this patch, but some discussion around
streaming read reminded me of it.  Here's a rebased version, with conflicts
resolved and very light comment polish and a commit message. Given that
there's been no changes otherwise in the last months, I'm inclined to push in
a few hours.

Greetings,

Andres Freund

Attachment

Re: Improve heapgetpage() performance, overhead from serializable

From
John Naylor
Date:
On Sun, Apr 7, 2024 at 11:49 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-01-22 13:01:31 +0700, John Naylor wrote:
> > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:
> > > And 397->320ms
> > > for something as core as this, is imo worth considering on its own.
> >
> > Hi Andres, this interesting work seems to have fallen off the radar --
> > are you still planning to move forward with this for v17?
>
> I had completely forgotten about this patch, but some discussion around
> streaming read reminded me of it.  Here's a rebased version, with conflicts
> resolved and very light comment polish and a commit message. Given that
> there's been no changes otherwise in the last months, I'm inclined to push in
> a few hours.

Just in time ;-) The commit message should also have "reviewed by
Zhang Mingli" and "tested by Quan Zongliang", who shared results in a
thread for a withrawn CF entry with a similar idea but covering fewer
cases:

https://www.postgresql.org/message-id/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce%40yeah.net



Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2024-04-07 12:07:22 +0700, John Naylor wrote:
> Just in time ;-) The commit message should also have "reviewed by
> Zhang Mingli" and "tested by Quan Zongliang", who shared results in a
> thread for a withrawn CF entry with a similar idea but covering fewer
> cases:

Good call. Added and pushed.

Thanks,

Andres



Re: Improve heapgetpage() performance, overhead from serializable

From
David Rowley
Date:
On Sun, 7 Apr 2024 at 19:30, Andres Freund <andres@anarazel.de> wrote:
> Good call. Added and pushed.

I understand you're already aware of the reference in the comment to
heapgetpage(), which no longer exists as of 44086b097.

Melanie and I had discussed the heap_prepare_pagescan() name while I
was reviewing that recent refactor. Aside from fixing the comment, how
about also renaming heapgetpage_collect() to
heap_prepare_pagescan_tuples()?

Patch attached for reference. Not looking for any credit.

I'm also happy to revisit the heap_prepare_pagescan() name and call
heapgetpage_collect() some appropriate derivative of whatever we'd
rename that to.

Copied Melanie as she may want to chime in too.

David

Attachment

Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2024-04-08 14:43:21 +1200, David Rowley wrote:
> On Sun, 7 Apr 2024 at 19:30, Andres Freund <andres@anarazel.de> wrote:
> > Good call. Added and pushed.
> 
> I understand you're already aware of the reference in the comment to
> heapgetpage(), which no longer exists as of 44086b097.

Yea, https://postgr.es/m/20240407172615.cocrsvboqm3ttqe4%40awork3.anarazel.de


> Melanie and I had discussed the heap_prepare_pagescan() name while I
> was reviewing that recent refactor. Aside from fixing the comment, how
> about also renaming heapgetpage_collect() to
> heap_prepare_pagescan_tuples()?

> Patch attached for reference. Not looking for any credit.
> 
> I'm also happy to revisit the heap_prepare_pagescan() name and call
> heapgetpage_collect() some appropriate derivative of whatever we'd
> rename that to.

I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I
don't have a great alternative suggestion.

Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
separate callsites (making long names annoying) and the fact that it's really
specific to one caller, I'm somewhat inclined to just go with
collect_visible_tuples() or page_collect_visible(), I think I prefer the
latter.

Greetings,

Andres Freund



Re: Improve heapgetpage() performance, overhead from serializable

From
David Rowley
Date:
On Mon, 8 Apr 2024 at 15:13, Andres Freund <andres@anarazel.de> wrote:
> I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I
> don't have a great alternative suggestion.

It came around from having nothing better.  I was keen not to have the
name indicate it was only for checking visibility as we're also
checking for serialization conflicts and pruning the page.  The word
"prepare" made it there as it seemed generic enough to not falsely
indicate it was only checking visibility.  Also, it seemed good to
keep it generic as if we one day end up with something new that needs
to be done before scanning a page in page mode then that new code is
more likely to be put in the function with a generic name rather than
a function that appears to be named for some other unrelated task.  It
would be nice not to end up with two functions to call before scanning
a page in page mode.

> Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
> separate callsites (making long names annoying) and the fact that it's really
> specific to one caller, I'm somewhat inclined to just go with
> collect_visible_tuples() or page_collect_visible(), I think I prefer the
> latter.

I understand wanting to avoid the long name.  I'd rather stay clear of
"visible", but don't feel as strongly about this as it's static.

David



Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2024-04-08 15:43:12 +1200, David Rowley wrote:
> On Mon, 8 Apr 2024 at 15:13, Andres Freund <andres@anarazel.de> wrote:
> > Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
> > separate callsites (making long names annoying) and the fact that it's really
> > specific to one caller, I'm somewhat inclined to just go with
> > collect_visible_tuples() or page_collect_visible(), I think I prefer the
> > latter.
> 
> I understand wanting to avoid the long name.  I'd rather stay clear of
> "visible", but don't feel as strongly about this as it's static.

I think visible would be ok, the serialization checks are IMO about
visibility too. But if you'd prefer I'd also be ok with something like
page_collect_tuples()?

Greetings,

Andres Freund



Re: Improve heapgetpage() performance, overhead from serializable

From
David Rowley
Date:
On Mon, 8 Apr 2024 at 16:08, Andres Freund <andres@anarazel.de> wrote:
>
> On 2024-04-08 15:43:12 +1200, David Rowley wrote:
> > I understand wanting to avoid the long name.  I'd rather stay clear of
> > "visible", but don't feel as strongly about this as it's static.
>
> I think visible would be ok, the serialization checks are IMO about
> visibility too. But if you'd prefer I'd also be ok with something like
> page_collect_tuples()?

That's ok for me.

David



Re: Improve heapgetpage() performance, overhead from serializable

From
Andres Freund
Date:
Hi,

On 2024-04-08 16:18:21 +1200, David Rowley wrote:
> On Mon, 8 Apr 2024 at 16:08, Andres Freund <andres@anarazel.de> wrote:
> > I think visible would be ok, the serialization checks are IMO about
> > visibility too. But if you'd prefer I'd also be ok with something like
> > page_collect_tuples()?
> 
> That's ok for me.

Cool, pushed that way.

Greetings,

Andres Freund