Thread: Improve heapgetpage() performance, overhead from serializable
Hi, Several loops which are important for query performance, like heapgetpage()'s loop over all tuples, have to call functions like HeapCheckForSerializableConflictOut() and PredicateLockTID() in every iteration. When serializable is not in use, all those functions do is to to return. But being situated in a different translation unit, the compiler can't inline (without LTO at least) the check whether serializability is needed. It's not just the function call overhead that's noticable, it's also that registers have to be spilled to the stack / reloaded from memory etc. On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres pinned to one core. Parallel workers disabled to reduce noise. All times are the average of 15 executions with pgbench, in a newly started, but prewarmed postgres. SELECT * FROM pgbench_accounts OFFSET 10000000; HEAD: 397.977 removing the HeapCheckForSerializableConflictOut() from heapgetpage() (incorrect!), to establish the baseline of what serializable costs: 336.695 pulling out CheckForSerializableConflictOutNeeded() from HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling HeapCheckForSerializableConflictOut() in the loop: 339.742 moving the loop into a static inline function, marked as pg_always_inline, called with static arguments for always_visible, check_serializable: 326.546 marking the always_visible, !check_serializable case likely(): 322.249 removing TestForOldSnapshot() calls, which we pretty much already decided on: 312.987 FWIW, there's more we can do, with some hacky changes I got the time down to 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms for something as core as this, is imo worth considering on its own. Now, this just affects the sequential scan case. heap_hot_search_buffer() shares many of the same pathologies. I find it a bit harder to improve, because the compiler's code generation seems to switch between good / bad with changes that seems unrelated... I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so far? Greetings, Andres Freund
Attachment
Hi,
Regards,
Zhang Mingli
On Jul 16, 2023 at 09:57 +0800, Andres Freund <andres@anarazel.de>, wrote:
Hi,
Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.
When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.
On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.
SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977
removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695
pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742
moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546
marking the always_visible, !check_serializable case likely():
322.249
removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987
FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.
Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...
I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?
Greetings,
Andres Freund
LGTM and I have a fool question:
if (likely(all_visible))
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
block, lines, 1, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
block, lines, 1, 1);
}
else
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
block, lines, 0, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
block, lines, 0, 1);
Does it make sense to combine if else condition and put it to the incline function’s param?
Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
block, lines, all_visible, check_serializable);
Hi, On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote: > LGTM and I have a fool question: > > if (likely(all_visible)) > { > if (likely(!check_serializable)) > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > block, lines, 1, 0); > else > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > block, lines, 1, 1); > } > else > { > if (likely(!check_serializable)) > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > block, lines, 0, 0); > else > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > block, lines, 0, 1); > > > Does it make sense to combine if else condition and put it to the incline function’s param? > > Like: > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > block, lines, all_visible, check_serializable); I think that makes it less likely that the compiler actually generates a constant-folded version for each of the branches. Perhaps worth some experimentation. Greetings, Andres Freund
Hi,
Is there a plan to merge this patch in PG16?
Thanks,
Muhammad
From: Andres Freund <andres@anarazel.de>
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Cc: Thomas Munro <thomas.munro@gmail.com>
Subject: Improve heapgetpage() performance, overhead from serializable
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Cc: Thomas Munro <thomas.munro@gmail.com>
Subject: Improve heapgetpage() performance, overhead from serializable
Hi,
Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.
When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.
On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.
SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977
removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695
pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742
moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546
marking the always_visible, !check_serializable case likely():
322.249
removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987
FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.
Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...
I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?
Greetings,
Andres Freund
Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.
When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.
On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.
SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977
removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695
pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742
moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546
marking the always_visible, !check_serializable case likely():
322.249
removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987
FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.
Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...
I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?
Greetings,
Andres Freund
This thread [1] also Improving the heapgetpage function, and looks like this thread.
Muhammad Malik <muhammad.malik1@hotmail.com> 于2023年9月1日周五 04:04写道:
Hi,Is there a plan to merge this patch in PG16?Thanks,MuhammadFrom: Andres Freund <andres@anarazel.de>
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Cc: Thomas Munro <thomas.munro@gmail.com>
Subject: Improve heapgetpage() performance, overhead from serializableHi,
Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.
When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.
On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.
SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977
removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695
pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742
moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546
marking the always_visible, !check_serializable case likely():
322.249
removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987
FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.
Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...
I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?
Greetings,
Andres Freund
On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:
> FWIW, there's more we can do, with some hacky changes I got the time down to
> 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
> for something as core as this, is imo worth considering on its own.
> 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
> for something as core as this, is imo worth considering on its own.
Nice!
> On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:
> > Does it make sense to combine if else condition and put it to the incline function’s param?
> >
> > Like:
> > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
> > block, lines, all_visible, check_serializable);
>
> I think that makes it less likely that the compiler actually generates a
> constant-folded version for each of the branches. Perhaps worth some
> experimentation.
Combining this way doesn't do so for me.
> >
> > Like:
> > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
> > block, lines, all_visible, check_serializable);
>
> I think that makes it less likely that the compiler actually generates a
> constant-folded version for each of the branches. Perhaps worth some
> experimentation.
Combining this way doesn't do so for me.
Minor style nit:
+ scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+ block, lines, 0, 1);
I believe we prefer true/false rather than numbers.
+ block, lines, 0, 1);
I believe we prefer true/false rather than numbers.
Hi, On 2023-09-05 14:42:57 +0700, John Naylor wrote: > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote: > > > FWIW, there's more we can do, with some hacky changes I got the time down > to > > 273.261, but the tradeoffs start to be a bit more complicated. And > 397->320ms > > for something as core as this, is imo worth considering on its own. > > Nice! > > > On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote: > > > > Does it make sense to combine if else condition and put it to the > incline function’s param? > > > > > > Like: > > > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > > > > block, lines, all_visible, check_serializable); > > > > I think that makes it less likely that the compiler actually generates a > > constant-folded version for each of the branches. Perhaps worth some > > experimentation. > > Combining this way doesn't do so for me. Are you saying that the desired constant folding happened after combining the branches, or that it didn't happen? Greetings, Andres Freund
On Wed, Sep 6, 2023 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:
> > > I think that makes it less likely that the compiler actually generates a
> > > constant-folded version for each of the branches. Perhaps worth some
> > > experimentation.
> >
> > Combining this way doesn't do so for me.
>
> Are you saying that the desired constant folding happened after combining the
> branches, or that it didn't happen?
Constant folding did not happen.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Sep 1, 2023 at 1:08 PM tender wang <tndrwang@gmail.com> wrote:
>
> This thread [1] also Improving the heapgetpage function, and looks like this thread.
>
> [1] https://www.postgresql.org/message-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5%40yeah.net
Please don't top-post.
For the archives: That CF entry has been withdrawn, after the author looked at this one and did some testing.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote: > And 397->320ms > for something as core as this, is imo worth considering on its own. Hi Andres, this interesting work seems to have fallen off the radar -- are you still planning to move forward with this for v17?
Hi, On 2024-01-22 13:01:31 +0700, John Naylor wrote: > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote: > > And 397->320ms > > for something as core as this, is imo worth considering on its own. > > Hi Andres, this interesting work seems to have fallen off the radar -- > are you still planning to move forward with this for v17? I had completely forgotten about this patch, but some discussion around streaming read reminded me of it. Here's a rebased version, with conflicts resolved and very light comment polish and a commit message. Given that there's been no changes otherwise in the last months, I'm inclined to push in a few hours. Greetings, Andres Freund
Attachment
On Sun, Apr 7, 2024 at 11:49 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2024-01-22 13:01:31 +0700, John Naylor wrote: > > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote: > > > And 397->320ms > > > for something as core as this, is imo worth considering on its own. > > > > Hi Andres, this interesting work seems to have fallen off the radar -- > > are you still planning to move forward with this for v17? > > I had completely forgotten about this patch, but some discussion around > streaming read reminded me of it. Here's a rebased version, with conflicts > resolved and very light comment polish and a commit message. Given that > there's been no changes otherwise in the last months, I'm inclined to push in > a few hours. Just in time ;-) The commit message should also have "reviewed by Zhang Mingli" and "tested by Quan Zongliang", who shared results in a thread for a withrawn CF entry with a similar idea but covering fewer cases: https://www.postgresql.org/message-id/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce%40yeah.net
Hi, On 2024-04-07 12:07:22 +0700, John Naylor wrote: > Just in time ;-) The commit message should also have "reviewed by > Zhang Mingli" and "tested by Quan Zongliang", who shared results in a > thread for a withrawn CF entry with a similar idea but covering fewer > cases: Good call. Added and pushed. Thanks, Andres
On Sun, 7 Apr 2024 at 19:30, Andres Freund <andres@anarazel.de> wrote: > Good call. Added and pushed. I understand you're already aware of the reference in the comment to heapgetpage(), which no longer exists as of 44086b097. Melanie and I had discussed the heap_prepare_pagescan() name while I was reviewing that recent refactor. Aside from fixing the comment, how about also renaming heapgetpage_collect() to heap_prepare_pagescan_tuples()? Patch attached for reference. Not looking for any credit. I'm also happy to revisit the heap_prepare_pagescan() name and call heapgetpage_collect() some appropriate derivative of whatever we'd rename that to. Copied Melanie as she may want to chime in too. David
Attachment
Hi, On 2024-04-08 14:43:21 +1200, David Rowley wrote: > On Sun, 7 Apr 2024 at 19:30, Andres Freund <andres@anarazel.de> wrote: > > Good call. Added and pushed. > > I understand you're already aware of the reference in the comment to > heapgetpage(), which no longer exists as of 44086b097. Yea, https://postgr.es/m/20240407172615.cocrsvboqm3ttqe4%40awork3.anarazel.de > Melanie and I had discussed the heap_prepare_pagescan() name while I > was reviewing that recent refactor. Aside from fixing the comment, how > about also renaming heapgetpage_collect() to > heap_prepare_pagescan_tuples()? > Patch attached for reference. Not looking for any credit. > > I'm also happy to revisit the heap_prepare_pagescan() name and call > heapgetpage_collect() some appropriate derivative of whatever we'd > rename that to. I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I don't have a great alternative suggestion. Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the separate callsites (making long names annoying) and the fact that it's really specific to one caller, I'm somewhat inclined to just go with collect_visible_tuples() or page_collect_visible(), I think I prefer the latter. Greetings, Andres Freund
On Mon, 8 Apr 2024 at 15:13, Andres Freund <andres@anarazel.de> wrote: > I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I > don't have a great alternative suggestion. It came around from having nothing better. I was keen not to have the name indicate it was only for checking visibility as we're also checking for serialization conflicts and pruning the page. The word "prepare" made it there as it seemed generic enough to not falsely indicate it was only checking visibility. Also, it seemed good to keep it generic as if we one day end up with something new that needs to be done before scanning a page in page mode then that new code is more likely to be put in the function with a generic name rather than a function that appears to be named for some other unrelated task. It would be nice not to end up with two functions to call before scanning a page in page mode. > Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the > separate callsites (making long names annoying) and the fact that it's really > specific to one caller, I'm somewhat inclined to just go with > collect_visible_tuples() or page_collect_visible(), I think I prefer the > latter. I understand wanting to avoid the long name. I'd rather stay clear of "visible", but don't feel as strongly about this as it's static. David
Hi, On 2024-04-08 15:43:12 +1200, David Rowley wrote: > On Mon, 8 Apr 2024 at 15:13, Andres Freund <andres@anarazel.de> wrote: > > Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the > > separate callsites (making long names annoying) and the fact that it's really > > specific to one caller, I'm somewhat inclined to just go with > > collect_visible_tuples() or page_collect_visible(), I think I prefer the > > latter. > > I understand wanting to avoid the long name. I'd rather stay clear of > "visible", but don't feel as strongly about this as it's static. I think visible would be ok, the serialization checks are IMO about visibility too. But if you'd prefer I'd also be ok with something like page_collect_tuples()? Greetings, Andres Freund
On Mon, 8 Apr 2024 at 16:08, Andres Freund <andres@anarazel.de> wrote: > > On 2024-04-08 15:43:12 +1200, David Rowley wrote: > > I understand wanting to avoid the long name. I'd rather stay clear of > > "visible", but don't feel as strongly about this as it's static. > > I think visible would be ok, the serialization checks are IMO about > visibility too. But if you'd prefer I'd also be ok with something like > page_collect_tuples()? That's ok for me. David
Hi, On 2024-04-08 16:18:21 +1200, David Rowley wrote: > On Mon, 8 Apr 2024 at 16:08, Andres Freund <andres@anarazel.de> wrote: > > I think visible would be ok, the serialization checks are IMO about > > visibility too. But if you'd prefer I'd also be ok with something like > > page_collect_tuples()? > > That's ok for me. Cool, pushed that way. Greetings, Andres Freund