Thread: Damage control for planner's get_actual_variable_endpoint() runaway

Damage control for planner's get_actual_variable_endpoint() runaway

From
Jakub Wartak
Date:
Hi hackers,

the case of planner's
src/backend/utils/adt/selfuncs.c:get_actual_variable_endpoint()
spending literally seconds seems to be well known fact across hackers
(in the extreme wild case it can be over 1+ hour on VLDBs). For those
unfamiliar it is planner estimation that tries to read real table
index (including deadrows) until min/max. It is blackbox mechanism
that works without any warning which often is hugely affected by
number of dead tuples in indexes and there's no on/off switch or
built-in limitation of how far it can go. It was discussed on pgsql
mailing lists several times [1]-[5]. It almost seems like it works
fine in 99.9% cases, until it doesn't and blows up big time on larger
systems and from there operator doesn't have a lot of choices [a lot
of time being already wasted on identifying the root-cause being the
planner]:
1) one can properly VACUUM (which everybody seem to agree is the
proper way to go, but it is often problematic due to other various
circumstances, especially on big tables without serious partitioning
strategies) - again this might be very time consuming
2) one cannot trade a lot CPU/IO burning on planning (actually
fetching indexes on multi-TB tables) to less accurate plans, and
realistically speaking rewriting queries is often impossible
3) application might not support enable prepared statements and even
if then simple queries/reports are also affected
4) there is no visibility into how much activity is spent on btree
index get_actual_variable_endpoint() alone, so one cannot estimate the
system-wide impact

I would like to trigger the discussion on how to give at least partial
control to the end-user of what the optimizer performs. I was thinking
about several things and each of those has pros and cons:

a) the attached quick patch (by Simon Riggs) that put maximum allowed
cost constraints on the index-lookup machinery as a safeguard (that
#define is debatable; in my testing it reduced the hit from ~850ms to
0.6ms +/- 0.3ms at the current value of 20).
b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.
c) Any other idea, e.g. see [3] or [5] (cache was being proposed).
d) For completeness : a new GUC/knob to completely disable the
functionality (debug_optimizer_minmax_est), but that's actually
trimmed functionality of the patch.
e) I was also wondering about some DEBUG/NOTICE elog() when taking
more than let's say arbitrary 10s, but that could easily spam the log
file

Reproducer on a very small dataset follows. Please note the reproducer
here shows the effect on 1st run EXPLAIN, however in real observed
situation (multi-TB unpartitioned table) each consecutive planner
operation (just EXPLAIN) on that index was affected (I don't know why
LP_DEAD/hints cleaning was not kicking in, but maybe it was, but given
the scale of the problem it was not helping much).

-Jakub Wartak.

[1] - https://www.postgresql.org/message-id/flat/54446AE2.6080909%40BlueTreble.com#f436bb41cf044b30eeec29472a13631e
[2] - https://www.postgresql.org/message-id/flat/db7111f2-05ef-0ceb-c013-c34adf4f4121%40gmail.com
[3] - https://www.postgresql.org/message-id/flat/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru
(SnapshotAny vs SnapshotDirty discussions between Tom and Robert)
[4] - https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.com
[5] - https://postgrespro.com/list/thread-id/2436130 (cache)

s1:
=# drop table test;
=# create table test (id bigint primary key) with (autovacuum_enabled = 'off');
=# insert into test select generate_series(1,10000000); -- ~310MB
table, ~210MB index

s2/start the long running transaction:
=# begin;
=# select min(id) from test;

s1:
=# delete from test where id>1000000;
=# analyze test;
=# set enable_indexonlyscan = 'off'; -- just in case to force
BitmapHeapScans which according to backend/access/nbtree/README
won'tset LP_DEAD, but my bt_page_items() tests indicate that it does
(??)
=# set enable_indexscan = 'off';
=# explain (buffers, verbose) select * from test where id > 11000000;
=>  Planning: Buffers: shared hit=9155 read=55276 dirtied=55271
written=53617 / Time: 851.761 ms
=# explain (buffers, verbose) select * from test where id > 11000000;
=>  Planning: Buffers: shared read=24594 / Time: 49.172 ms
=# vacuum (verbose) test; => index scan needed: 39824 pages from table
(90.00% of total) had 9000000 dead item identifiers removed
=# explain (buffers, verbose) select * from test where id > 11000000;
=>  Planning: Buffers: shared hit=14 read=3 / Time: 0.550 ms

with patch, the results are:
p=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: / Buffers: shared hit=17 read=6 dirtied=3 written=5 =>
Time: 0.253 ms
p=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: / Buffers: shared hit=11 read=2 dirtied=2 => Time: 0.576
ms
so there's no dramatic hit.

Attachment

Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Alvaro Herrera
Date:
On 2022-Nov-21, Jakub Wartak wrote:

> b) I was wondering about creating a new wait class "Planner" with the
> event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
> naming/categorization as wait events are ... well.. as the name "WAIT"
> implies, while those btree lookups could easily be ON-CPU activity.

I think we should definitely do this, regardless of any other fixes we
add, so that this condition can be identified more easily.  I wonder if
we can backpatch it safely.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."                                (Julien PUYDT)



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 7:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Nov-21, Jakub Wartak wrote:
> > b) I was wondering about creating a new wait class "Planner" with the
> > event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
> > naming/categorization as wait events are ... well.. as the name "WAIT"
> > implies, while those btree lookups could easily be ON-CPU activity.
>
> I think we should definitely do this, regardless of any other fixes we
> add, so that this condition can be identified more easily.  I wonder if
> we can backpatch it safely.

I don't think this is safe at all. Wait events can only bracket
individual operations, like the reads of the individual index blocks,
not report on which phase of a larger operation is in progress. If we
try to make them do the latter, we will have a hot mess on our hands.
It might not be a bad capability to have, but it's a different system.

But that doesn't mean we can't do anything about this problem, and I
think we should do something about this problem. It's completely crazy
that after this many years of people getting hosed by this, we haven't
taken more than half measures to fix the problem. I think commit
3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd was the last time we poked at
this, and before that there was commit
fccebe421d0c410e6378fb281419442c84759213, but neither of those
prevented us from scanning an unbounded number of index tuples before
finding one that we're willing to use, as the commit messages pretty
much admit.

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something? If we don't find a
match, we could either try to use a dead tuple, or we could just
return false which, I think, would end up using the value from
pg_statistic rather than any updated value. That is of course not a
great outcome, but it is WAY WAY BETTER than spending OVER AN HOUR
looking for a more suitable tuple, as Jakub describes having seen on a
production system.

I really can't understand why this is even mildly controversial. What
exactly to do here may be debatable, but the idea that it's OK to
spend an unbounded amount of resources during any phase of planning is
clearly wrong. We can say that at the time we wrote the code we didn't
know it was going to actually ever happen, and that is fine and true.
But there have been multiple reports of this over the years and we
know *for sure* that spending totally unreasonable amounts of time
inside this function is a real-world problem that actually brings down
production systems. Unless and until we find a way of putting a tight
bound on the amount of effort that can be expended here, that's going
to keep happening.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I don't think this is safe at all. Wait events can only bracket
> individual operations, like the reads of the individual index blocks,
> not report on which phase of a larger operation is in progress. If we
> try to make them do the latter, we will have a hot mess on our hands.

Agreed.

> What we need is a solution that avoids reading an unbounded number of
> tuples under any circumstances. I previously suggested using
> SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> safety issues there or if Tom was just concerned about the results
> being misleading. Either way, maybe there's some variant on that theme
> that could work. For instance, could we teach the index scan to stop
> if the first 100 tuples that it finds are all invisible? Or to reach
> at most 1 page, or at most 10 pages, or something?

A hard limit on the number of index pages examined seems like it
might be a good idea.

> If we don't find a
> match, we could either try to use a dead tuple, or we could just
> return false which, I think, would end up using the value from
> pg_statistic rather than any updated value.

Yeah, the latter seems like the best bet.  Returning a definitely-dead
value could be highly misleading.  In the end this is meant to be
an incremental improvement on what we could get from pg_statistic,
so it's reasonable to limit how hard we'll work on it.

If we do install such a thing, should we undo any of the previous
changes that backed off the reliability of the result?

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Mon, 21 Nov 2022 at 15:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > What we need is a solution that avoids reading an unbounded number of
> > tuples under any circumstances. I previously suggested using
> > SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> > safety issues there or if Tom was just concerned about the results
> > being misleading. Either way, maybe there's some variant on that theme
> > that could work. For instance, could we teach the index scan to stop
> > if the first 100 tuples that it finds are all invisible? Or to reach
> > at most 1 page, or at most 10 pages, or something?
>
> A hard limit on the number of index pages examined seems like it
> might be a good idea.

Good, that is what the patch does.

> > If we don't find a
> > match, we could either try to use a dead tuple, or we could just
> > return false which, I think, would end up using the value from
> > pg_statistic rather than any updated value.
>
> Yeah, the latter seems like the best bet.

Yes, just breaking out of the loop is enough to use the default value.

> If we do install such a thing, should we undo any of the previous
> changes that backed off the reliability of the result?

Not sure.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> > > What we need is a solution that avoids reading an unbounded number of
> > > tuples under any circumstances. I previously suggested using
> > > SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> > > safety issues there or if Tom was just concerned about the results
> > > being misleading. Either way, maybe there's some variant on that theme
> > > that could work. For instance, could we teach the index scan to stop
> > > if the first 100 tuples that it finds are all invisible? Or to reach
> > > at most 1 page, or at most 10 pages, or something?
> >
> > A hard limit on the number of index pages examined seems like it
> > might be a good idea.
>
> Good, that is what the patch does.

<looks at patch>

Oh, that's surprisingly simple. Nice!

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.
And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Mon, 21 Nov 2022 at 15:23, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > > > What we need is a solution that avoids reading an unbounded number of
> > > > tuples under any circumstances. I previously suggested using
> > > > SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> > > > safety issues there or if Tom was just concerned about the results
> > > > being misleading. Either way, maybe there's some variant on that theme
> > > > that could work. For instance, could we teach the index scan to stop
> > > > if the first 100 tuples that it finds are all invisible? Or to reach
> > > > at most 1 page, or at most 10 pages, or something?
> > >
> > > A hard limit on the number of index pages examined seems like it
> > > might be a good idea.
> >
> > Good, that is what the patch does.
>
> <looks at patch>
>
> Oh, that's surprisingly simple. Nice!
>
> Is there any reason to tie this into page costs? I'd be more inclined
> to just make it a hard limit on the number of pages. I think that
> would be more predictable and less prone to surprising (bad) behavior.
> And to be honest I would be inclined to make it quite a small number.
> Perhaps 5 or 10. Is there a good argument for going any higher?

+1, that makes the patch smaller and the behavior more predictable.

(Just didn't want to do anything too simple, in case it looked like a kluge.)

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Is there any reason to tie this into page costs? I'd be more inclined
> to just make it a hard limit on the number of pages. I think that
> would be more predictable and less prone to surprising (bad) behavior.

Agreed, a simple limit of N pages fetched seems appropriate.

> And to be honest I would be inclined to make it quite a small number.
> Perhaps 5 or 10. Is there a good argument for going any higher?

Sure: people are not complaining until it gets into the thousands.
And you have to remember that the entire mechanism exists only
because of user complaints about inaccurate estimates.  We shouldn't
be too eager to resurrect that problem.

I'd be happy with a limit of 100 pages.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Is there any reason to tie this into page costs? I'd be more inclined
> > to just make it a hard limit on the number of pages. I think that
> > would be more predictable and less prone to surprising (bad) behavior.
>
> Agreed, a simple limit of N pages fetched seems appropriate.
>
> > And to be honest I would be inclined to make it quite a small number.
> > Perhaps 5 or 10. Is there a good argument for going any higher?
>
> Sure: people are not complaining until it gets into the thousands.
> And you have to remember that the entire mechanism exists only
> because of user complaints about inaccurate estimates.  We shouldn't
> be too eager to resurrect that problem.
>
> I'd be happy with a limit of 100 pages.

OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Jakub Wartak
Date:
Hi,

Draft version of the patch attached (it is based on Simon's)
I would be happier if we could make that #define into GUC (just in
case), although I do understand the effort to reduce the number of
various knobs (as their high count causes their own complexity).

-Jakub Wartak.

On Mon, Nov 21, 2022 at 4:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > Is there any reason to tie this into page costs? I'd be more inclined
> > > to just make it a hard limit on the number of pages. I think that
> > > would be more predictable and less prone to surprising (bad) behavior.
> >
> > Agreed, a simple limit of N pages fetched seems appropriate.
> >
> > > And to be honest I would be inclined to make it quite a small number.
> > > Perhaps 5 or 10. Is there a good argument for going any higher?
> >
> > Sure: people are not complaining until it gets into the thousands.
> > And you have to remember that the entire mechanism exists only
> > because of user complaints about inaccurate estimates.  We shouldn't
> > be too eager to resurrect that problem.
> >
> > I'd be happy with a limit of 100 pages.
>
> OK.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Attachment

Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Andres Freund
Date:
Hi,

On 2022-11-21 17:06:16 +0100, Jakub Wartak wrote:
> @@ -6213,14 +6216,26 @@ get_actual_variable_endpoint(Relation heapRel,
>      /* Fetch first/next tuple in specified direction */
>      while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
>      {
> +        BlockNumber block = ItemPointerGetBlockNumber(tid);
>          if (!VM_ALL_VISIBLE(heapRel,
> -                            ItemPointerGetBlockNumber(tid),
> +                            block,
>                              &vmbuffer))
>          {
>              /* Rats, we have to visit the heap to check visibility */
>              if (!index_fetch_heap(index_scan, tableslot))
>                  continue;        /* no visible tuple, try next index entry */
>  
> +            {
> +                CHECK_FOR_INTERRUPTS();
> +                if (block != last_block)
> +                    visited_pages++;
> +#define VISITED_PAGES_LIMIT 100
> +                if (visited_pages > VISITED_PAGES_LIMIT)
> +                    break;
> +                else
> +                    continue; /* no visible tuple, try next index entry */
> +            }
> +
>              /* We don't actually need the heap tuple for anything */
>              ExecClearTuple(tableslot);
>  
> -- 
> 2.30.2

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

Greetings,

Andres Freund



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Justin Pryzby
Date:
This patch version runs "continue" unconditionally (rather than
conditionally, like the previous version).

                        if (!index_fetch_heap(index_scan, tableslot))
                                continue;               /* no visible tuple, try next index entry */




Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:
> This can't quite be right - isn't this only applying the limit if we found a
> visible tuple?

It doesn't look that way to me, but perhaps I'm just too dense to see
the problem?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> This can't quite be right - isn't this only applying the limit if we found a
> visible tuple?

What it's restricting is the number of heap page fetches, which
might be good enough.  We don't have a lot of visibility here
into how many index pages were scanned before returning the next
not-dead index entry, so I'm not sure how hard it'd be to do better.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 12:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > This can't quite be right - isn't this only applying the limit if we found a
> > visible tuple?
>
> What it's restricting is the number of heap page fetches, which
> might be good enough.  We don't have a lot of visibility here
> into how many index pages were scanned before returning the next
> not-dead index entry, so I'm not sure how hard it'd be to do better.

Oh. That's really sad. Because I think the whole problem here is that
the number of dead index entries can be huge.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 21, 2022 at 12:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What it's restricting is the number of heap page fetches, which
>> might be good enough.  We don't have a lot of visibility here
>> into how many index pages were scanned before returning the next
>> not-dead index entry, so I'm not sure how hard it'd be to do better.

> Oh. That's really sad. Because I think the whole problem here is that
> the number of dead index entries can be huge.

If they're *actually* dead, we have enough mitigations in place I think,
as explained by the long comment in get_actual_variable_endpoint.
The problem here is with still-visible-to-somebody tuples.  At least,
Jakub's test case sets it up that way.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Andres Freund
Date:
Hi,

On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:
>On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:
>> This can't quite be right - isn't this only applying the limit if we found a
>> visible tuple?
>
>It doesn't look that way to me, but perhaps I'm just too dense to see
>the problem?

The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found.
Notethe continue; when fetching a heap tuple fails. 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Mon, 21 Nov 2022 at 18:17, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:
> >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:
> >> This can't quite be right - isn't this only applying the limit if we found a
> >> visible tuple?
> >
> >It doesn't look that way to me, but perhaps I'm just too dense to see
> >the problem?
>
> The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found.
Notethe continue; when fetching a heap tuple fails.
 

Agreed, resolved in this version.


Robert, something like this perhaps? limit on both the index and the heap.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Attachment

Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Andres Freund
Date:
Hi,

On November 21, 2022 10:44:17 AM PST, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>Robert, something like this perhaps? limit on both the index and the heap.

I don't think we should add additional code / struct members into very common good paths for these limits.

I don't really understand the point of limiting in the index - where would the large number of pages accessed come
from?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On November 21, 2022 10:44:17 AM PST, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>> Robert, something like this perhaps? limit on both the index and the heap.

> I don't think we should add additional code / struct members into very common good paths for these limits.

Yeah, I don't like that either: for one thing, it seems completely
unsafe to back-patch.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 1:17 PM Andres Freund <andres@anarazel.de> wrote:
> On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:
> >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:
> >> This can't quite be right - isn't this only applying the limit if we found a
> >> visible tuple?
> >
> >It doesn't look that way to me, but perhaps I'm just too dense to see
> >the problem?
>
> The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found.
Notethe continue; when fetching a heap tuple fails.
 

Oh, that line was removed in Simon's patch but not in Jakub's version,
I guess. Jakub's version also leaves out the last_block = block line
which seems pretty critical.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 2:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On November 21, 2022 10:44:17 AM PST, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> >> Robert, something like this perhaps? limit on both the index and the heap.
>
> > I don't think we should add additional code / struct members into very common good paths for these limits.
>
> Yeah, I don't like that either: for one thing, it seems completely
> unsafe to back-patch.

I have mixed feelings about back-patching this. On the one hand, the
lack of any limit here can have *really* bad consequences. On the
other hand, it's also possible to imagine someone getting hosed by the
fix. No matter what limit we enforce, someone could in theory get a
much better estimate by searching just a little further.

I agree that adding members to IndexScanDescData doesn't seem very
appealing, but I'm still trying to wrap my head around what exposure
that creates. In the patches thus far, we're basically counting the
number of times that we get an index tuple that's not on the same page
as the previous index tuple. That's not quite the same thing as the
number of heap pages, because we might be just ping-ponging back and
forth between 2 pages and it looks the same. I'm not sure that's a
problem, but it's something to note. If the limit is 100 pages, then
we'll try to fetch at least 100 index tuples, if they're all on
different pages, and perhaps as much as two orders of magnitude more,
if they're all on the same page. That doesn't seem too bad, because we
won't really be doing 100 times as much work. Following 10,000 index
tuples that all point at the same 100 heap pages is probably more work
than following 100 index tuples that each point to a separate heap
page, but it's not anywhere near 100x as much work, especially if real
I/O is involved. All in all, my first reaction is to think that it
sounds fairly OK.

The real sticky wicket is that we don't know how dense the index is.
In a non-pathological case, we expect to find quite a few index tuples
in each index page, so if we're fetching 100 heap pages the number of
index pages fetched is probably much less, like 1 or 2. And even if
we're fetching 10000 index tuples to visit 100 heap pages, they should
still be concentrated in a relatively reasonable number of index
pages. But ... what if they're not? Could the index contain a large
number of pages containing just 1 tuple each, or no tuples at all? If
so, maybe we can read ten bazillion index pages trying to find each
heap tuple and still end up in trouble.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Andres Freund
Date:
Hi,

On 2022-11-21 16:17:56 -0500, Robert Haas wrote:
> But ... what if they're not? Could the index contain a large number of
> pages containing just 1 tuple each, or no tuples at all? If so, maybe
> we can read ten bazillion index pages trying to find each heap tuple
> and still end up in trouble.

ISTM that if you have an index in such a poor condition that a single
value lookup reads thousands of pages inside the index, planner
estimates taking long is going to be the smallest of your worries...

Greetings,

Andres Freund



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-11-21 16:17:56 -0500, Robert Haas wrote:
>> But ... what if they're not? Could the index contain a large number of
>> pages containing just 1 tuple each, or no tuples at all? If so, maybe
>> we can read ten bazillion index pages trying to find each heap tuple
>> and still end up in trouble.

> ISTM that if you have an index in such a poor condition that a single
> value lookup reads thousands of pages inside the index, planner
> estimates taking long is going to be the smallest of your worries...

Yeah, that sort of situation is going to make any operation on the
index slow, not only get_actual_variable_endpoint().

I think we should content ourselves with improving the demonstrated
case, which is where we're forced to do a lot of heap fetches due
to lots of not-all-visible tuples.  Whether we can spend a lot of
time scanning the index without ever finding a tuple at all seems
hypothetical.  Without more evidence of a real problem, I do not
wish to inject warts as horrid as this one into the index AM API.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think we should content ourselves with improving the demonstrated
> case, which is where we're forced to do a lot of heap fetches due
> to lots of not-all-visible tuples.  Whether we can spend a lot of
> time scanning the index without ever finding a tuple at all seems
> hypothetical.  Without more evidence of a real problem, I do not
> wish to inject warts as horrid as this one into the index AM API.

All right. I've been bitten by this problem enough that I'm a little
gun-shy about accepting anything that doesn't feel like a 100%
solution, but I admit that the scenario I described does seem a little
bit far-fetched.

I won't be completely shocked if somebody finds a way to hit it, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 21, 2022 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think we should content ourselves with improving the demonstrated
>> case, which is where we're forced to do a lot of heap fetches due
>> to lots of not-all-visible tuples.

> All right. I've been bitten by this problem enough that I'm a little
> gun-shy about accepting anything that doesn't feel like a 100%
> solution, but I admit that the scenario I described does seem a little
> bit far-fetched.
> I won't be completely shocked if somebody finds a way to hit it, though.

Well, if we see a case where the time is indeed spent completely
within the index AM, then we'll have to do something more or less
like what Simon sketched.  But I don't want to go there without
evidence that it's a live problem.  API warts are really hard to
get rid of once instituted.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Mon, Nov 21, 2022 at 6:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> evidence that it's a live problem.  API warts are really hard to
> get rid of once instituted.

Yeah, agreed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Jakub Wartak
Date:
Hi all,

apologies the patch was rushed too quickly - my bad. I'm attaching a
fixed one as v0004 (as it is the 4th patch floating around here).

-Jakub Wartak

On Mon, Nov 21, 2022 at 9:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 1:17 PM Andres Freund <andres@anarazel.de> wrote:
> > On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:
> > >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:
> > >> This can't quite be right - isn't this only applying the limit if we found a
> > >> visible tuple?
> > >
> > >It doesn't look that way to me, but perhaps I'm just too dense to see
> > >the problem?
> >
> > The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found.
Notethe continue; when fetching a heap tuple fails.
 
>
> Oh, that line was removed in Simon's patch but not in Jakub's version,
> I guess. Jakub's version also leaves out the last_block = block line
> which seems pretty critical.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Attachment

Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Mon, 21 Nov 2022 at 23:34, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 6:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > evidence that it's a live problem.  API warts are really hard to
> > get rid of once instituted.
>
> Yeah, agreed.

Agreed, happy not to; that version was just a WIP to see how it might work.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Mon, 21 Nov 2022 at 22:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-11-21 16:17:56 -0500, Robert Haas wrote:
> >> But ... what if they're not? Could the index contain a large number of
> >> pages containing just 1 tuple each, or no tuples at all? If so, maybe
> >> we can read ten bazillion index pages trying to find each heap tuple
> >> and still end up in trouble.
>
> > ISTM that if you have an index in such a poor condition that a single
> > value lookup reads thousands of pages inside the index, planner
> > estimates taking long is going to be the smallest of your worries...
>
> Yeah, that sort of situation is going to make any operation on the
> index slow, not only get_actual_variable_endpoint().

That was also my conclusion: this is actually a common antipattern for
our indexes, not anything specific to the planner.

In another recent situation, I saw a very bad case of performance for
a "queue table". In that use case the rows are inserted at head and
removed from tail. Searching for the next item to be removed from the
queue involves an increasingly long tail search, in the case that a
long running transaction prevents us from marking the index entries
killed. Many tables exhibit such usage, for example, the neworder
table in TPC-C.

We optimized the case of frequent insertions into the rightmost index
page; now we also need to optimize the case of a long series of
deletions from the leftmost index pages. Not sure how, just framing
the problem.


--
Simon Riggs                http://www.EnterpriseDB.com/



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Mon, 21 Nov 2022 at 20:55, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 1:17 PM Andres Freund <andres@anarazel.de> wrote:
> > On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:
> > >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:
> > >> This can't quite be right - isn't this only applying the limit if we found a
> > >> visible tuple?
> > >
> > >It doesn't look that way to me, but perhaps I'm just too dense to see
> > >the problem?
> >
> > The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found.
Notethe continue; when fetching a heap tuple fails.
 
>
> Oh, that line was removed in Simon's patch but not in Jakub's version,
> I guess. Jakub's version also leaves out the last_block = block line
> which seems pretty critical.

New patch version reporting for duty, sir. Please take it from here!

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Attachment

Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Simon Riggs <simon.riggs@enterprisedb.com> writes:
> New patch version reporting for duty, sir. Please take it from here!

Why the CHECK_FOR_INTERRUPTS?  I'd supposed that there's going to be
one somewhere down inside the index or heap access --- do you have
reason to think there isn't?

Is it appropriate to count distinct pages, rather than just the
number of times we have to visit a heap tuple?  That seems to
complicate the logic a good deal, and I'm not sure it's buying
much, especially since (as you noted) it's imprecise anyway.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Tue, Nov 22, 2022 at 11:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon.riggs@enterprisedb.com> writes:
> > New patch version reporting for duty, sir. Please take it from here!
>
> Why the CHECK_FOR_INTERRUPTS?  I'd supposed that there's going to be
> one somewhere down inside the index or heap access --- do you have
> reason to think there isn't?
>
> Is it appropriate to count distinct pages, rather than just the
> number of times we have to visit a heap tuple?  That seems to
> complicate the logic a good deal, and I'm not sure it's buying
> much, especially since (as you noted) it's imprecise anyway.

FWW, the same question also occurred to me. But after mulling it over,
what Simon did seems kinda reasonable to me. Although it's imprecise,
it will generally cause us to stop sooner if we're bouncing all over
the heap and be willing to explore further if we're just hitting the
same heap page. I feel like that's pretty reasonable behavior.
Stopping early could hurt, so if we know that continuing isn't costing
much, why not?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Nov 22, 2022 at 11:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Is it appropriate to count distinct pages, rather than just the
>> number of times we have to visit a heap tuple?  That seems to
>> complicate the logic a good deal, and I'm not sure it's buying
>> much, especially since (as you noted) it's imprecise anyway.

> FWW, the same question also occurred to me. But after mulling it over,
> what Simon did seems kinda reasonable to me. Although it's imprecise,
> it will generally cause us to stop sooner if we're bouncing all over
> the heap and be willing to explore further if we're just hitting the
> same heap page. I feel like that's pretty reasonable behavior.
> Stopping early could hurt, so if we know that continuing isn't costing
> much, why not?

Fair I guess --- and I did say that I wanted it to be based on number
of pages visited not number of tuples.  So objection withdrawn to that
aspect.

Still wondering if there's really no CHECK_FOR_INTERRUPT anywhere
else in this loop.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Tom Lane
Date:
I wrote:
> Still wondering if there's really no CHECK_FOR_INTERRUPT anywhere
> else in this loop.

I did some experimentation using the test case Jakub presented
to start with, and verified that that loop does respond promptly
to control-C even in HEAD.  So there are CFI(s) in the loop as
I thought, and we don't need another.

What we do need is some more work on nearby comments.  I'll
see about that and push it.

            regards, tom lane



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Robert Haas
Date:
On Tue, Nov 22, 2022 at 1:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > Still wondering if there's really no CHECK_FOR_INTERRUPT anywhere
> > else in this loop.
>
> I did some experimentation using the test case Jakub presented
> to start with, and verified that that loop does respond promptly
> to control-C even in HEAD.  So there are CFI(s) in the loop as
> I thought, and we don't need another.

OK. Although an extra CFI isn't such a bad thing, either.

> What we do need is some more work on nearby comments.  I'll
> see about that and push it.

Great!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Damage control for planner's get_actual_variable_endpoint() runaway

From
Simon Riggs
Date:
On Tue, 22 Nov 2022 at 18:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Still wondering if there's really no CHECK_FOR_INTERRUPT anywhere
> > else in this loop.
>
> I did some experimentation using the test case Jakub presented
> to start with, and verified that that loop does respond promptly
> to control-C even in HEAD.  So there are CFI(s) in the loop as
> I thought, and we don't need another.

Thanks for checking. Sorry for not responding earlier.

> What we do need is some more work on nearby comments.  I'll
> see about that and push it.

Thanks; nicely streamlined.

-- 
Simon Riggs                http://www.EnterpriseDB.com/