Thread: Weird planner issue on a standby
Hello,
My customer has a really weird planner issue on a standby.
First, the context. There's a primary, and two standbys, all on a 11.8 release. (I know this isn't the latest release for the v11 branch.) The cluster is 5.5TB. The PostgreSQL configuration is the same on all servers. pg_db_role_setting is almost empty, and there's nothing specific to planning and replication.
Here is the issue. Queries on both standbys take a lot more time than on the primary. We eventually found that the queries take a lot of time to be planned, not to be executed. For example:
On the primary:
Planning Time: 1806.574 ms
Execution Time: 771.888 ms
On any of the standbys:
Planning Time: 41566.439 ms
Execution Time: 1125.352 ms
A simple EXPLAIN already shows this difference in duration.
The query in itself isn't really that big. 8 tables (without partitions), a few predicates. Nothing fancy. Nothing huge to plan. Here it is, a bit anonymized:
select ...
from t1
left outer join t2 on ...
left outer join t3 on ...
left outer join t4 on ...
left outer join t5 on ...
left outer join t6 on ...
left outer join t7 on ...
left outer join t8 on ...
where c1='...' and c2='...'
and c3>='...' and c4<='...' ;
Join conditions are really simple. There's no function called anywhere.
Plans on the three servers are exactly the same. Same nodes, same statistics, same calculations' results.
Moreover, we've looked into what the planner was doing with strace, and we've been surprised that it involved massive reading on tables (more than 130,000 calls to read() on 15 files). We found that most of these files are the files for one of the tables on this query. Also, this massive reading happens only on standbys, not on the primary.
Our customer took the time to drop and create the standbys yesterday night, but the new standbys still have the issue.
And finally, we discovered that promoting a standby brings back the old performance. We have the same performance between the primary and the promoted standby.
To be honest, I'm lost, I don't know what to do next. But I have some questions :)
* The planner seems to read tables to generate the plans and/or select the right plan. Why does it do that? I thought it only reads indexes if needed, but it is quite clear it reads tables also.
* How can the replication have an impact on the planner ? (I'm not sure I'm asking the right question... I mean, why do I see a difference in behaviour between a standby and an old standby, new autonomous server?)
* Do you have any idea on how to fix this?
Thank you.
Regards.
--
Guillaume.
Guillaume Lelarge <guillaume@lelarge.info> writes: > * The planner seems to read tables to generate the plans and/or select the > right plan. Why does it do that? I thought it only reads indexes if needed, > but it is quite clear it reads tables also. I'm guessing to some extent, but it seems plausible that this is an artifact of the logic that tries to get the current min and/or max column values to be sure we are estimating mergejoin costs accurately. Normally, yeah, that is only supposed to consult the extremal values in indexes and therefore not take very long. However, we've had to go back to the drawing board several times to fix cases where it made a lot of expensive table accesses because the extremal index entries were of uncertain liveness. That behavior could very easily depend on whether you're in a primary or standby server. Are there any tables in this query where extremal values of the join key are likely to be in recently-added or recently-dead rows? Does VACUUM'ing on the primary help? And, not to put too fine a point on it, but does updating to a recent minor release help? In a quick scan of the git history I don't see any related bugfixes later than 11.5, but I might've missed something. regards, tom lane
On 2022-Oct-11, Tom Lane wrote: > Are there any tables in this query where extremal values of the join > key are likely to be in recently-added or recently-dead rows? Does > VACUUM'ing on the primary help? I remember having an hypothesis, upon getting a report of this exact problem on a customer system once, that it could be due to killtuple not propagating to standbys except by FPIs. I do not remember if we proved that true or not. I do not remember observing that tables were being read, however. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :
On 2022-Oct-11, Tom Lane wrote:
> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows? Does
> VACUUM'ing on the primary help?
I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.
Thanks for your answers.
The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet.
On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary.
Actually, there are two things that really bug me:
* why the difference between primary and both standbys?
* why now? (it worked great before this weekend, and the only thing I know happened before is a batch delete on sunday... which may be a good-enough reason for things to get screwed, but once again, why only both standbys?)
Julien Rouhaud also told me about killtuples, but I have no idea what they are. I suppose this is different from dead tuples. Anyway, if you can enlighten me, I'll be happy :)
Guillaume.
On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote: > Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a > écrit : > > > On 2022-Oct-11, Tom Lane wrote: > > > > > Are there any tables in this query where extremal values of the join > > > key are likely to be in recently-added or recently-dead rows? Does > > > VACUUM'ing on the primary help? > > > > I remember having an hypothesis, upon getting a report of this exact > > problem on a customer system once, that it could be due to killtuple not > > propagating to standbys except by FPIs. I do not remember if we proved > > that true or not. I do not remember observing that tables were being > > read, however. > > > > > Thanks for your answers. > > The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no > idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is > yesterday) is much probably recently-added. I can ask my customer if you > want but this looks like a pretty safe bet. > > On the VACUUM question, I didn't say, but we're kind of wondering if it was > lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the > database (and especially on the 1.6TB table which is part of the query). > I'm kind of skeptical because if the VACUUM wasn't enough on the standby, > it should be the same on the primary. > > Actually, there are two things that really bug me: > * why the difference between primary and both standbys? > * why now? (it worked great before this weekend, and the only thing I know > happened before is a batch delete on sunday... which may be a good-enough > reason for things to get screwed, but once again, why only both standbys?) > > Julien Rouhaud also told me about killtuples, but I have no idea what they > are. I suppose this is different from dead tuples. Anyway, if you can > enlighten me, I'll be happy :) That's an optimisation where an index scan can mark an index entry as dead (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all dead, so further scans won't have to check again (you can grep kill_prior_tuple in the source for more details). As that's a hint bit, it may not be replicated unless you enable wal_log_hints or data_checksums (or write it as a FPI indeed), which could explain discrepancy between primary (after a first slow index scan) and standby nodes. But since your customer recreated their standbys from scratch *after* that delete, all the nodes should have those hint bits set (Guillaume confirmed off-list that they used a fresh BASE_BACKUP). Note that Guillaume also confirmed off-list that the customer has checksums enabled, which means that MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm out of ideas to explain the different behavior on standbys.
On 10/11/22 22:35, Julien Rouhaud wrote: > On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote: >> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a >> écrit : >> >>> On 2022-Oct-11, Tom Lane wrote: >>> >>>> Are there any tables in this query where extremal values of the join >>>> key are likely to be in recently-added or recently-dead rows? Does >>>> VACUUM'ing on the primary help? >>> I remember having an hypothesis, upon getting a report of this exact >>> problem on a customer system once, that it could be due to killtuple not >>> propagating to standbys except by FPIs. I do not remember if we proved >>> that true or not. I do not remember observing that tables were being >>> read, however. >>> >>> >> Thanks for your answers. >> >> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no >> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is >> yesterday) is much probably recently-added. I can ask my customer if you >> want but this looks like a pretty safe bet. >> >> On the VACUUM question, I didn't say, but we're kind of wondering if it was >> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the >> database (and especially on the 1.6TB table which is part of the query). >> I'm kind of skeptical because if the VACUUM wasn't enough on the standby, >> it should be the same on the primary. >> >> Actually, there are two things that really bug me: >> * why the difference between primary and both standbys? >> * why now? (it worked great before this weekend, and the only thing I know >> happened before is a batch delete on sunday... which may be a good-enough >> reason for things to get screwed, but once again, why only both standbys?) >> >> Julien Rouhaud also told me about killtuples, but I have no idea what they >> are. I suppose this is different from dead tuples. Anyway, if you can >> enlighten me, I'll be happy :) > That's an optimisation where an index scan can mark an index entry as dead > (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all > dead, so further scans won't have to check again (you can grep kill_prior_tuple > in the source for more details). As that's a hint bit, it may not be > replicated unless you enable wal_log_hints or data_checksums (or write it as a > FPI indeed), which could explain discrepancy between primary (after a first > slow index scan) and standby nodes. > > But since your customer recreated their standbys from scratch *after* that > delete, all the nodes should have those hint bits set (Guillaume confirmed > off-list that they used a fresh BASE_BACKUP). Note that Guillaume also > confirmed off-list that the customer has checksums enabled, which means that > MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm > out of ideas to explain the different behavior on standbys. Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any differences? -- Angular momentum makes the world go 'round.
Julien Rouhaud <rjuju123@gmail.com> writes: > But since your customer recreated their standbys from scratch *after* that > delete, all the nodes should have those hint bits set (Guillaume confirmed > off-list that they used a fresh BASE_BACKUP). Note that Guillaume also > confirmed off-list that the customer has checksums enabled, which means that > MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm > out of ideas to explain the different behavior on standbys. Do we propagate visibility-map bits to standbys? regards, tom lane
On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I remember having an hypothesis, upon getting a report of this exact > problem on a customer system once, that it could be due to killtuple not > propagating to standbys except by FPIs. I do not remember if we proved > that true or not. I do not remember observing that tables were being > read, however. That's true, but it doesn't matter whether or not there are LP_DEAD bits set on the standby, since in any case they cannot be trusted when in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be set to false on the standby. -- Peter Geoghegan
On Tue, Oct 11, 2022 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Do we propagate visibility-map bits to standbys? Yes. -- Peter Geoghegan
Le mer. 12 oct. 2022 à 06:08, Ron <ronljohnsonjr@gmail.com> a écrit :
On 10/11/22 22:35, Julien Rouhaud wrote:
> On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
>> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a
>> écrit :
>>
>>> On 2022-Oct-11, Tom Lane wrote:
>>>
>>>> Are there any tables in this query where extremal values of the join
>>>> key are likely to be in recently-added or recently-dead rows? Does
>>>> VACUUM'ing on the primary help?
>>> I remember having an hypothesis, upon getting a report of this exact
>>> problem on a customer system once, that it could be due to killtuple not
>>> propagating to standbys except by FPIs. I do not remember if we proved
>>> that true or not. I do not remember observing that tables were being
>>> read, however.
>>>
>>>
>> Thanks for your answers.
>>
>> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
>> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
>> yesterday) is much probably recently-added. I can ask my customer if you
>> want but this looks like a pretty safe bet.
>>
>> On the VACUUM question, I didn't say, but we're kind of wondering if it was
>> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
>> database (and especially on the 1.6TB table which is part of the query).
>> I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
>> it should be the same on the primary.
>>
>> Actually, there are two things that really bug me:
>> * why the difference between primary and both standbys?
>> * why now? (it worked great before this weekend, and the only thing I know
>> happened before is a batch delete on sunday... which may be a good-enough
>> reason for things to get screwed, but once again, why only both standbys?)
>>
>> Julien Rouhaud also told me about killtuples, but I have no idea what they
>> are. I suppose this is different from dead tuples. Anyway, if you can
>> enlighten me, I'll be happy :)
> That's an optimisation where an index scan can mark an index entry as dead
> (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
> dead, so further scans won't have to check again (you can grep kill_prior_tuple
> in the source for more details). As that's a hint bit, it may not be
> replicated unless you enable wal_log_hints or data_checksums (or write it as a
> FPI indeed), which could explain discrepancy between primary (after a first
> slow index scan) and standby nodes.
>
> But since your customer recreated their standbys from scratch *after* that
> delete, all the nodes should have those hint bits set (Guillaume confirmed
> off-list that they used a fresh BASE_BACKUP). Note that Guillaume also
> confirmed off-list that the customer has checksums enabled, which means that
> MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
> out of ideas to explain the different behavior on standbys.
Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any
differences?
No differences.
--
Guillaume.
Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge <guillaume@lelarge.info> a écrit :
Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :On 2022-Oct-11, Tom Lane wrote:
> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows? Does
> VACUUM'ing on the primary help?
I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.Thanks for your answers.The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet.On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary.
It appears that I was wrong. I just got an email from my customer saying they got their performance back after a VACUUM on the two main tables of the query. I'll have them on the phone in about an hour. I'll probably know more then. Still wondering why it was an issue on the standby and not on the primary. VACUUM cleans up tables and indexes, and this activity goes through WAL, doesn't it?
--
Guillaume.
Le mer. 12 oct. 2022 à 08:56, Guillaume Lelarge <guillaume@lelarge.info> a écrit :
Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge <guillaume@lelarge.info> a écrit :Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :On 2022-Oct-11, Tom Lane wrote:
> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows? Does
> VACUUM'ing on the primary help?
I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.Thanks for your answers.The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet.On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary.It appears that I was wrong. I just got an email from my customer saying they got their performance back after a VACUUM on the two main tables of the query. I'll have them on the phone in about an hour. I'll probably know more then. Still wondering why it was an issue on the standby and not on the primary. VACUUM cleans up tables and indexes, and this activity goes through WAL, doesn't it?
Just finished my phone call. So, they definitely have their performance back. All they did was a VACUUM on two tables.
If I understand correctly, during "normal" operations, some information is stored on the primary and sent to standbys. For some reason, only the primary take them into account, standbys ignore them. That would explain why, when we promoted a standby without doing anything else, it had much better performance. VACUUM fixes the issue on a standby, probably by storing this information in a different way. After VACUUM, standbys stop ignoring this information, which helps get the performance back.
That sounds like a plausible explanation. I still have questions if you don't mind:
* what is this information?
* where is it stored? my guess would be indexes
* why is it ignored on standbys and used on primary?
We didn't talk much about releases, so I guess that the "standby-ignores-some-information" part is currently on all available releases?
Thank you.
--
Guillaume.
On Wed, 2022-10-12 at 10:51 +0200, Guillaume Lelarge wrote: > Just finished my phone call. So, they definitely have their performance back. All they did was a VACUUM on two tables. > > If I understand correctly, during "normal" operations, some information is stored on the primary > and sent to standbys. For some reason, only the primary take them into account, standbys ignore them. > That would explain why, when we promoted a standby without doing anything else, it had much better > performance. VACUUM fixes the issue on a standby, probably by storing this information in a different > way. After VACUUM, standbys stop ignoring this information, which helps get the performance back. > > That sounds like a plausible explanation. I still have questions if you don't mind: > * what is this information? > * where is it stored? my guess would be indexes > * why is it ignored on standbys and used on primary? That sounds indeed like killed (LP_DEAD) index tuples on the primary. Peter says they are ignored on the standby anyway, so on the standby PostgreSQL went through a lot of index entries pointing to dead table tuples, and it took a long time to find the maximal entry in the table, which is done by the optimizer. VACUUM removed those dead tuples and their associated index entries on both primary and standby. Yours, Laurenz Albe
Peter Geoghegan <pg@bowt.ie> writes: > That's true, but it doesn't matter whether or not there are LP_DEAD > bits set on the standby, since in any case they cannot be trusted when > in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be > set to false on the standby. Hmm. I think that might break this argument in get_actual_variable_endpoint: * A crucial point here is that SnapshotNonVacuumable, with * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the * condition that the indexscan will use to decide that index entries are * killable (see heap_hot_search_buffer()). Therefore, if the snapshot * rejects a tuple (or more precisely, all tuples of a HOT chain) and we * have to continue scanning past it, we know that the indexscan will mark * that index entry killed. That means that the next * get_actual_variable_endpoint() call will not have to re-consider that * index entry. In this way we avoid repetitive work when this function * is used a lot during planning. However, that doesn't explain the downthread report that a VACUUM on the primary fixed it. What I suspect is that that caused some in-fact-dead index entries to get cleaned out. But ... if the primary is allowed to vacuum away an index entry that it thinks is dead, exactly what is the point of making standbys ignore LP_DEAD bits? There's no additional interlock that guarantees the tuple will be there at all. regards, tom lane
On Wed, Oct 12, 2022 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, that doesn't explain the downthread report that a > VACUUM on the primary fixed it. What I suspect is that that > caused some in-fact-dead index entries to get cleaned out. Seems likely. > But ... if the primary is allowed to vacuum away an index > entry that it thinks is dead, exactly what is the point of > making standbys ignore LP_DEAD bits? There's no additional > interlock that guarantees the tuple will be there at all. The interlock doesn't really protect the leaf page or its index tuples so much as the referenced TIDs themselves. In other words it's a TID recycling interlock. That's why we don't need a cleanup lock to perform index tuple deletions, even though the WAL records for those are almost identical to the WAL records used by index vacuuming (in the case of nbtree the only difference is the extra latestRemovedXid field in the deletion variant WAL record). We know that there is no VACUUM process involved, and no question of heap vacuuming going ahead for the same TIDs once index vacuuming is allowed to complete. We can get away with not having the interlock at all in the case of nbtree index scans with MVCC snapshots -- but *not* with index-only scans. See "Making concurrent TID recycling safe" in the nbtree README. I only got around to documenting all of the details here quite recently. The index-only scan thing dates back to 9.5. -- Peter Geoghegan