Thread: Weird planner issue on a standby

Weird planner issue on a standby

From

Guillaume Lelarge

Date:

11 October 2022, 15:51:46

Hello,

My customer has a really weird planner issue on a standby.

First, the context. There's a primary, and two standbys, all on a 11.8 release. (I know this isn't the latest release for the v11 branch.) The cluster is 5.5TB. The PostgreSQL configuration is the same on all servers. pg_db_role_setting is almost empty, and there's nothing specific to planning and replication.

Here is the issue. Queries on both standbys take a lot more time than on the primary. We eventually found that the queries take a lot of time to be planned, not to be executed. For example:

On the primary:

Planning Time: 1806.574 ms

Execution Time: 771.888 ms

On any of the standbys:

Planning Time: 41566.439 ms

Execution Time: 1125.352 ms

A simple EXPLAIN already shows this difference in duration.

The query in itself isn't really that big. 8 tables (without partitions), a few predicates. Nothing fancy. Nothing huge to plan. Here it is, a bit anonymized:

select ...

from t1

left outer join t2 on ...

left outer join t3 on ...

left outer join t4 on ...

left outer join t5 on ...

left outer join t6 on ...

left outer join t7 on ...

left outer join t8 on ...

where c1='...' and c2='...'

and c3>='...' and c4<='...' ;

Join conditions are really simple. There's no function called anywhere.

Plans on the three servers are exactly the same. Same nodes, same statistics, same calculations' results.

Moreover, we've looked into what the planner was doing with strace, and we've been surprised that it involved massive reading on tables (more than 130,000 calls to read() on 15 files). We found that most of these files are the files for one of the tables on this query. Also, this massive reading happens only on standbys, not on the primary.

Our customer took the time to drop and create the standbys yesterday night, but the new standbys still have the issue.

And finally, we discovered that promoting a standby brings back the old performance. We have the same performance between the primary and the promoted standby.

To be honest, I'm lost, I don't know what to do next. But I have some questions :)

* The planner seems to read tables to generate the plans and/or select the right plan. Why does it do that? I thought it only reads indexes if needed, but it is quite clear it reads tables also.

* How can the replication have an impact on the planner ? (I'm not sure I'm asking the right question... I mean, why do I see a difference in behaviour between a standby and an old standby, new autonomous server?)

* Do you have any idea on how to fix this?

Thank you.

Regards.

Guillaume.

Re: Weird planner issue on a standby

From

Tom Lane

Date:

11 October 2022, 16:17:28

Guillaume Lelarge <guillaume@lelarge.info> writes:
> * The planner seems to read tables to generate the plans and/or select the
> right plan. Why does it do that? I thought it only reads indexes if needed,
> but it is quite clear it reads tables also.

I'm guessing to some extent, but it seems plausible that this is an
artifact of the logic that tries to get the current min and/or max column
values to be sure we are estimating mergejoin costs accurately.  Normally,
yeah, that is only supposed to consult the extremal values in indexes
and therefore not take very long.  However, we've had to go back to the
drawing board several times to fix cases where it made a lot of expensive
table accesses because the extremal index entries were of uncertain
liveness.  That behavior could very easily depend on whether you're in
a primary or standby server.

Are there any tables in this query where extremal values of the join
key are likely to be in recently-added or recently-dead rows?  Does
VACUUM'ing on the primary help?

And, not to put too fine a point on it, but does updating to a recent
minor release help?  In a quick scan of the git history I don't see any
related bugfixes later than 11.5, but I might've missed something.

            regards, tom lane

Re: Weird planner issue on a standby

From

Alvaro Herrera

Date:

11 October 2022, 16:27:36

On 2022-Oct-11, Tom Lane wrote:

> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows?  Does
> VACUUM'ing on the primary help?

I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs.  I do not remember if we proved
that true or not.  I do not remember observing that tables were being
read, however.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Weird planner issue on a standby

From

Guillaume Lelarge

Date:

11 October 2022, 17:42:55

Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :

On 2022-Oct-11, Tom Lane wrote:

> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows? Does
> VACUUM'ing on the primary help?

I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.

Thanks for your answers.

The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet.

On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary.

Actually, there are two things that really bug me:

* why the difference between primary and both standbys?

* why now? (it worked great before this weekend, and the only thing I know happened before is a batch delete on sunday... which may be a good-enough reason for things to get screwed, but once again, why only both standbys?)

Julien Rouhaud also told me about killtuples, but I have no idea what they are. I suppose this is different from dead tuples. Anyway, if you can enlighten me, I'll be happy :)

Guillaume.

Re: Weird planner issue on a standby

From

Julien Rouhaud

Date:

12 October 2022, 03:35:02

On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a
> écrit :
> 
> > On 2022-Oct-11, Tom Lane wrote:
> >
> > > Are there any tables in this query where extremal values of the join
> > > key are likely to be in recently-added or recently-dead rows?  Does
> > > VACUUM'ing on the primary help?
> >
> > I remember having an hypothesis, upon getting a report of this exact
> > problem on a customer system once, that it could be due to killtuple not
> > propagating to standbys except by FPIs.  I do not remember if we proved
> > that true or not.  I do not remember observing that tables were being
> > read, however.
> >
> >
> Thanks for your answers.
> 
> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
> yesterday) is much probably recently-added. I can ask my customer if you
> want but this looks like a pretty safe bet.
> 
> On the VACUUM question, I didn't say, but we're kind of wondering if it was
> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
> database (and especially on the 1.6TB table which is part of the query).
> I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
> it should be the same on the primary.
> 
> Actually, there are two things that really bug me:
> * why the difference between primary and both standbys?
> * why now? (it worked great before this weekend, and the only thing I know
> happened before is a batch delete on sunday... which may be a good-enough
> reason for things to get screwed, but once again, why only both standbys?)
> 
> Julien Rouhaud also told me about killtuples, but I have no idea what they
> are. I suppose this is different from dead tuples. Anyway, if you can
> enlighten me, I'll be happy :)

That's an optimisation where an index scan can mark an index entry as dead
(LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
dead, so further scans won't have to check again (you can grep kill_prior_tuple
in the source for more details).  As that's a hint bit, it may not be
replicated unless you enable wal_log_hints or data_checksums (or write it as a
FPI indeed), which could explain discrepancy between primary (after a first
slow index scan) and standby nodes.

But since your customer recreated their standbys from scratch *after* that
delete, all the nodes should have those hint bits set (Guillaume confirmed
off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
confirmed off-list that the customer has checksums enabled, which means that
MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
out of ideas to explain the different behavior on standbys.

Re: Weird planner issue on a standby

From

Ron

Date:

12 October 2022, 04:07:50

On 10/11/22 22:35, Julien Rouhaud wrote:
> On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
>> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a
>> écrit :
>>
>>> On 2022-Oct-11, Tom Lane wrote:
>>>
>>>> Are there any tables in this query where extremal values of the join
>>>> key are likely to be in recently-added or recently-dead rows?  Does
>>>> VACUUM'ing on the primary help?
>>> I remember having an hypothesis, upon getting a report of this exact
>>> problem on a customer system once, that it could be due to killtuple not
>>> propagating to standbys except by FPIs.  I do not remember if we proved
>>> that true or not.  I do not remember observing that tables were being
>>> read, however.
>>>
>>>
>> Thanks for your answers.
>>
>> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
>> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
>> yesterday) is much probably recently-added. I can ask my customer if you
>> want but this looks like a pretty safe bet.
>>
>> On the VACUUM question, I didn't say, but we're kind of wondering if it was
>> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
>> database (and especially on the 1.6TB table which is part of the query).
>> I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
>> it should be the same on the primary.
>>
>> Actually, there are two things that really bug me:
>> * why the difference between primary and both standbys?
>> * why now? (it worked great before this weekend, and the only thing I know
>> happened before is a batch delete on sunday... which may be a good-enough
>> reason for things to get screwed, but once again, why only both standbys?)
>>
>> Julien Rouhaud also told me about killtuples, but I have no idea what they
>> are. I suppose this is different from dead tuples. Anyway, if you can
>> enlighten me, I'll be happy :)
> That's an optimisation where an index scan can mark an index entry as dead
> (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
> dead, so further scans won't have to check again (you can grep kill_prior_tuple
> in the source for more details).  As that's a hint bit, it may not be
> replicated unless you enable wal_log_hints or data_checksums (or write it as a
> FPI indeed), which could explain discrepancy between primary (after a first
> slow index scan) and standby nodes.
>
> But since your customer recreated their standbys from scratch *after* that
> delete, all the nodes should have those hint bits set (Guillaume confirmed
> off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
> confirmed off-list that the customer has checksums enabled, which means that
> MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
> out of ideas to explain the different behavior on standbys.

Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any 
differences?

-- 
Angular momentum makes the world go 'round.

Re: Weird planner issue on a standby

From

Tom Lane

Date:

12 October 2022, 05:03:51

Julien Rouhaud <rjuju123@gmail.com> writes:
> But since your customer recreated their standbys from scratch *after* that
> delete, all the nodes should have those hint bits set (Guillaume confirmed
> off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
> confirmed off-list that the customer has checksums enabled, which means that
> MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
> out of ideas to explain the different behavior on standbys.

Do we propagate visibility-map bits to standbys?

            regards, tom lane

Re: Weird planner issue on a standby

From

Peter Geoghegan

Date:

12 October 2022, 06:14:47

On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I remember having an hypothesis, upon getting a report of this exact
> problem on a customer system once, that it could be due to killtuple not
> propagating to standbys except by FPIs.  I do not remember if we proved
> that true or not.  I do not remember observing that tables were being
> read, however.

That's true, but it doesn't matter whether or not there are LP_DEAD
bits set on the standby, since in any case they cannot be trusted when
in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be
set to false on the standby.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

From

Peter Geoghegan

Date:

12 October 2022, 06:24:56

On Tue, Oct 11, 2022 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Do we propagate visibility-map bits to standbys?

Yes.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

From

Guillaume Lelarge

Date:

12 October 2022, 06:39:43

Le mer. 12 oct. 2022 à 06:08, Ron <ronljohnsonjr@gmail.com> a écrit :

On 10/11/22 22:35, Julien Rouhaud wrote:
> On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
>> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a
>> écrit :
>>
>>> On 2022-Oct-11, Tom Lane wrote:
>>>
>>>> Are there any tables in this query where extremal values of the join
>>>> key are likely to be in recently-added or recently-dead rows? Does
>>>> VACUUM'ing on the primary help?
>>> I remember having an hypothesis, upon getting a report of this exact
>>> problem on a customer system once, that it could be due to killtuple not
>>> propagating to standbys except by FPIs. I do not remember if we proved
>>> that true or not. I do not remember observing that tables were being
>>> read, however.
>>>
>>>
>> Thanks for your answers.
>>
>> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
>> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
>> yesterday) is much probably recently-added. I can ask my customer if you
>> want but this looks like a pretty safe bet.
>>
>> On the VACUUM question, I didn't say, but we're kind of wondering if it was
>> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
>> database (and especially on the 1.6TB table which is part of the query).
>> I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
>> it should be the same on the primary.
>>
>> Actually, there are two things that really bug me:
>> * why the difference between primary and both standbys?
>> * why now? (it worked great before this weekend, and the only thing I know
>> happened before is a batch delete on sunday... which may be a good-enough
>> reason for things to get screwed, but once again, why only both standbys?)
>>
>> Julien Rouhaud also told me about killtuples, but I have no idea what they
>> are. I suppose this is different from dead tuples. Anyway, if you can
>> enlighten me, I'll be happy :)
> That's an optimisation where an index scan can mark an index entry as dead
> (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
> dead, so further scans won't have to check again (you can grep kill_prior_tuple
> in the source for more details). As that's a hint bit, it may not be
> replicated unless you enable wal_log_hints or data_checksums (or write it as a
> FPI indeed), which could explain discrepancy between primary (after a first
> slow index scan) and standby nodes.
>
> But since your customer recreated their standbys from scratch *after* that
> delete, all the nodes should have those hint bits set (Guillaume confirmed
> off-list that they used a fresh BASE_BACKUP). Note that Guillaume also
> confirmed off-list that the customer has checksums enabled, which means that
> MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
> out of ideas to explain the different behavior on standbys.

Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any
differences?

No differences.

Guillaume.

Re: Weird planner issue on a standby

From

Guillaume Lelarge

Date:

12 October 2022, 06:56:12

Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge <guillaume@lelarge.info> a écrit :

Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :
On 2022-Oct-11, Tom Lane wrote:

> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows? Does
> VACUUM'ing on the primary help?

I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.

Thanks for your answers.

The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet.

On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary.

It appears that I was wrong. I just got an email from my customer saying they got their performance back after a VACUUM on the two main tables of the query. I'll have them on the phone in about an hour. I'll probably know more then. Still wondering why it was an issue on the standby and not on the primary. VACUUM cleans up tables and indexes, and this activity goes through WAL, doesn't it?

Guillaume.

Re: Weird planner issue on a standby

From

Guillaume Lelarge

Date:

12 October 2022, 08:51:36

Le mer. 12 oct. 2022 à 08:56, Guillaume Lelarge <guillaume@lelarge.info> a écrit :

Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge <guillaume@lelarge.info> a écrit :
Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :
On 2022-Oct-11, Tom Lane wrote:

> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows? Does
> VACUUM'ing on the primary help?

I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.

Thanks for your answers.

The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet.

On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary.

It appears that I was wrong. I just got an email from my customer saying they got their performance back after a VACUUM on the two main tables of the query. I'll have them on the phone in about an hour. I'll probably know more then. Still wondering why it was an issue on the standby and not on the primary. VACUUM cleans up tables and indexes, and this activity goes through WAL, doesn't it?

Just finished my phone call. So, they definitely have their performance back. All they did was a VACUUM on two tables.

If I understand correctly, during "normal" operations, some information is stored on the primary and sent to standbys. For some reason, only the primary take them into account, standbys ignore them. That would explain why, when we promoted a standby without doing anything else, it had much better performance. VACUUM fixes the issue on a standby, probably by storing this information in a different way. After VACUUM, standbys stop ignoring this information, which helps get the performance back.

That sounds like a plausible explanation. I still have questions if you don't mind:

* what is this information?

* where is it stored? my guess would be indexes

* why is it ignored on standbys and used on primary?

We didn't talk much about releases, so I guess that the "standby-ignores-some-information" part is currently on all available releases?

Thank you.

Guillaume.

Re: Weird planner issue on a standby

From

Laurenz Albe

Date:

12 October 2022, 13:43:51

On Wed, 2022-10-12 at 10:51 +0200, Guillaume Lelarge wrote:
> Just finished my phone call. So, they definitely have their performance back. All they did was a VACUUM on two
tables.
> 
> If I understand correctly, during "normal" operations, some information is stored on the primary
> and sent to standbys. For some reason, only the primary take them into account, standbys ignore them.
> That would explain why, when we promoted a standby without doing anything else, it had much better
> performance. VACUUM fixes the issue on a standby, probably by storing this information in a different
> way. After VACUUM, standbys stop ignoring this information, which helps get the performance back.
> 
> That sounds like a plausible explanation. I still have questions if you don't mind:
> * what is this information?
> * where is it stored? my guess would be indexes
> * why is it ignored on standbys and used on primary?

That sounds indeed like killed (LP_DEAD) index tuples on the primary.
Peter says they are ignored on the standby anyway, so on the standby
PostgreSQL went through a lot of index entries pointing to dead table
tuples, and it took a long time to find the maximal entry in the table,
which is done by the optimizer.

VACUUM removed those dead tuples and their associated index entries
on both primary and standby.

Yours,
Laurenz Albe

Re: Weird planner issue on a standby

From

Tom Lane

Date:

12 October 2022, 13:47:14

Peter Geoghegan <pg@bowt.ie> writes:
> That's true, but it doesn't matter whether or not there are LP_DEAD
> bits set on the standby, since in any case they cannot be trusted when
> in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be
> set to false on the standby.

Hmm.  I think that might break this argument in get_actual_variable_endpoint:

     * A crucial point here is that SnapshotNonVacuumable, with
     * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the
     * condition that the indexscan will use to decide that index entries are
     * killable (see heap_hot_search_buffer()).  Therefore, if the snapshot
     * rejects a tuple (or more precisely, all tuples of a HOT chain) and we
     * have to continue scanning past it, we know that the indexscan will mark
     * that index entry killed.  That means that the next
     * get_actual_variable_endpoint() call will not have to re-consider that
     * index entry.  In this way we avoid repetitive work when this function
     * is used a lot during planning.

However, that doesn't explain the downthread report that a
VACUUM on the primary fixed it.  What I suspect is that that
caused some in-fact-dead index entries to get cleaned out.

But ... if the primary is allowed to vacuum away an index
entry that it thinks is dead, exactly what is the point of
making standbys ignore LP_DEAD bits?  There's no additional
interlock that guarantees the tuple will be there at all.

            regards, tom lane

Re: Weird planner issue on a standby

From

Peter Geoghegan

Date:

12 October 2022, 16:12:42

On Wed, Oct 12, 2022 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, that doesn't explain the downthread report that a
> VACUUM on the primary fixed it.  What I suspect is that that
> caused some in-fact-dead index entries to get cleaned out.

Seems likely.

> But ... if the primary is allowed to vacuum away an index
> entry that it thinks is dead, exactly what is the point of
> making standbys ignore LP_DEAD bits?  There's no additional
> interlock that guarantees the tuple will be there at all.

The interlock doesn't really protect the leaf page or its index tuples
so much as the referenced TIDs themselves. In other words it's a TID
recycling interlock.

That's why we don't need a cleanup lock to perform index tuple
deletions, even though the WAL records for those are almost identical
to the WAL records used by index vacuuming (in the case of nbtree the
only difference is the extra latestRemovedXid field in the deletion
variant WAL record). We know that there is no VACUUM process involved,
and no question of heap vacuuming going ahead for the same TIDs once
index vacuuming is allowed to complete.

We can get away with not having the interlock at all in the case of
nbtree index scans with MVCC snapshots -- but *not* with index-only
scans. See "Making concurrent TID recycling safe" in the nbtree
README. I only got around to documenting all of the details here quite
recently. The index-only scan thing dates back to 9.5.

-- 
Peter Geoghegan