Thread: Intermittent test plan change in "privileges" test on BF animal prion
Hello, Two recent failures show plan changes in RLS queries on master. Based on nearby comments, the choice plan is being used to verify access (or lack of access) to row estimates, so I guess that means something could be amiss here. (Or it could be due to the dropped UDP flaky stats problem, but then why in the same place twice, and why twice in a week, only on master, and not for months before that?) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-06-08%2002%3A58%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-06-06%2003%3A18%3A03
Thomas Munro <thomas.munro@gmail.com> writes: > Two recent failures show plan changes in RLS queries on master. Yeah. I assume this is related to 0c882e52a, but I'm not sure how. The fact that we've only seen it on prion (which runs -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE) is suggestive, but it's not clear why those options would lead to unstable planner estimates. I've been waiting to see if we start to get similar reports from the full-fledged CLOBBER_CACHE_ALWAYS critters. regards, tom lane
On Tue, 9 Jun 2020 at 14:27, Thomas Munro <thomas.munro@gmail.com> wrote: > Two recent failures show plan changes in RLS queries on master. Based > on nearby comments, the choice plan is being used to verify access (or > lack of access) to row estimates, so I guess that means something > could be amiss here. (Or it could be due to the dropped UDP flaky > stats problem, but then why in the same place twice, and why twice in > a week, only on master, and not for months before that?) I see 0c882e52a did change the number of statistics targets on that table. The first failure was on the commit directly after that one. I'm not sure what instability Tom meant when he wrote "-- results below depend on having quite accurate stats for atest12". It does seem plausible, given how slow prion is that autovacuum might be trigger after the manual vacuum somehow and building stats with just 1k buckets instead of 10k. 0936d1b6 made some changes to disable autovacuum because it was sometimes coming in and messing with the statistics, maybe we need to do the same here, or at least do something less temporary than changing default_statistics_target. select attname,array_length(histogram_bounds,1) from pg_stats where tablename = 'atest12' order by attname; should mention the array length is 10000 if it's working as intended. Is it worth sticking that query in there before and after the failures to ensure we're working with the stats we think we are? David
David Rowley <dgrowleyml@gmail.com> writes: > I see 0c882e52a did change the number of statistics targets on that > table. The first failure was on the commit directly after that one. > I'm not sure what instability Tom meant when he wrote "-- results > below depend on having quite accurate stats for atest12". See [1], particularly the para about "When I went to test 0002". At least one of those test cases fails if the planner estimates more than one row being selected by the user-defined operator, and since the table has 10K rows, that means we need 1/10000 selectivity precision. > It does seem plausible, given how slow prion is that autovacuum might > be trigger after the manual vacuum somehow and building stats with > just 1k buckets instead of 10k. Hmm ... that's a plausible theory, perhaps. I forget: does autovac recheck, after acquiring the requisite table lock, whether the table still needs to be processed? > 0936d1b6 made some changes to disable > autovacuum because it was sometimes coming in and messing with the > statistics, maybe we need to do the same here, or at least do > something less temporary than changing default_statistics_target. Yeah, setting that as a table parameter seems like a better idea than setting default_statistics_target. regards, tom lane
On Tue, 9 Jun 2020 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > It does seem plausible, given how slow prion is that autovacuum might > > be trigger after the manual vacuum somehow and building stats with > > just 1k buckets instead of 10k. > > Hmm ... that's a plausible theory, perhaps. I forget: does autovac > recheck, after acquiring the requisite table lock, whether the table > still needs to be processed? It does, but I wondered if there was a window after the manual vacuum resets n_ins_since_vacuum and between when autovacuum looks at it. David
David Rowley <dgrowleyml@gmail.com> writes: > On Tue, 9 Jun 2020 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm ... that's a plausible theory, perhaps. I forget: does autovac >> recheck, after acquiring the requisite table lock, whether the table >> still needs to be processed? > It does, but I wondered if there was a window after the manual vacuum > resets n_ins_since_vacuum and between when autovacuum looks at it. Oh, there surely is, because of the lag in the stats collection mechanism. I'm trying to reproduce this now, but it's sounding pretty plausible. BTW, it looks like I managed to trim the reference off my prior message, but I meant [1] to refer to https://www.postgresql.org/message-id/666679.1591138428%40sss.pgh.pa.us regards, tom lane
I wrote: > I'm trying to reproduce this now, but it's sounding pretty plausible. Yeah, that's definitely it. I was able to reproduce the failure semi reliably (every two or three tries) after adding -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE and inserting a "pg_sleep(1)" just after the manual vacuum in privileges.sql; and as you'd guessed, the stats arrays were just normal size in the failing runs. After disabling autovac on the table, the failure went away. Thanks for the insight! regards, tom lane