Thread: Something flaky in the "relfilenode mapping" infrastructure
Buildfarm member prairiedog thinks there's something unreliable about commit f01d1ae3a104019d6d68aeff85c4816a275130b3: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-27%2008%3A12%3A11 ================== pgsql.13462/src/test/regress/regression.diffs =================== *** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out Thu Mar 27 04:12:40 2014 --- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out Thu Mar 27 04:52:02 2014 *************** *** 2333,2339 **** ) mapped; incorrectly_mapped | have_mappings --------------------+--------------- ! 0 | t (1 row) -- Checks on creating and manipulation of user defined relations in --- 2333,2339 ---- ) mapped; incorrectly_mapped | have_mappings --------------------+--------------- ! 1 | t (1 row) -- Checks on creating and manipulation of user defined relations in ====================================================================== regards, tom lane
Hi, On 2014-03-27 08:02:35 -0400, Tom Lane wrote: > Buildfarm member prairiedog thinks there's something unreliable about > commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > *** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out Thu Mar 27 04:12:40 2014 > --- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out Thu Mar 27 04:52:02 2014 > *************** > *** 2333,2339 **** > ) mapped; > incorrectly_mapped | have_mappings > --------------------+--------------- > ! 0 | t > (1 row) That's rather odd. It has survived for a couple of months on the other buildfarm animals now... Could one of you apply the attached patch adding more details to eventual failures? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-03-27 08:02:35 -0400, Tom Lane wrote: >> Buildfarm member prairiedog thinks there's something unreliable about >> commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > That's rather odd. It has survived for a couple of months on the other > buildfarm animals now... Could one of you apply the attached patch > adding more details to eventual failures? Any objection to separating out the have_mappings bit? It wasn't terribly appropriate before, but it seems really out of place in this formulation. regards, tom lane
On 2014-03-28 16:41:15 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-03-27 08:02:35 -0400, Tom Lane wrote: > >> Buildfarm member prairiedog thinks there's something unreliable about > >> commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > > > That's rather odd. It has survived for a couple of months on the other > > buildfarm animals now... Could one of you apply the attached patch > > adding more details to eventual failures? > > Any objection to separating out the have_mappings bit? It wasn't terribly > appropriate before, but it seems really out of place in this formulation. The patch I sent removed the have_mapping thing entirely? Do you mean it should be there, but as a separate query? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-03-28 16:41:15 -0400, Tom Lane wrote: >> Any objection to separating out the have_mappings bit? It wasn't terribly >> appropriate before, but it seems really out of place in this formulation. > The patch I sent removed the have_mapping thing entirely? Do you mean it > should be there, but as a separate query? Oh, so it did. Well, do you think we need a query checking that? I hadn't questioned the need to do so, but if you feel it's unnecessary I'm certainly willing to pull it. regards, tom lane
On 2014-03-28 16:45:28 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-03-28 16:41:15 -0400, Tom Lane wrote: > >> Any objection to separating out the have_mappings bit? It wasn't terribly > >> appropriate before, but it seems really out of place in this formulation. > > > The patch I sent removed the have_mapping thing entirely? Do you mean it > > should be there, but as a separate query? > > Oh, so it did. Well, do you think we need a query checking that? > I hadn't questioned the need to do so, but if you feel it's unnecessary > I'm certainly willing to pull it. I don't think it's necessary. As far as I understand LATERAL, a join to a function returning NULL will still return the row. So, the test now would only test whether there are rows in pg_class which seems a bit pointless. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I don't think it's necessary. As far as I understand LATERAL, a join to > a function returning NULL will still return the row. So, the test now > would only test whether there are rows in pg_class which seems a bit > pointless. Yeah, after looking closer I'd come to the same conclusion. If the lateral function call could generate zero rows it'd perhaps be a risk, but not in this formulation. Will commit in a moment. regards, tom lane
On 2014-03-28 21:36:11 +0100, Andres Freund wrote: > Hi, > > On 2014-03-27 08:02:35 -0400, Tom Lane wrote: > > Buildfarm member prairiedog thinks there's something unreliable about > > commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > > > *** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out Thu Mar 27 04:12:40 2014 > > --- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out Thu Mar 27 04:52:02 2014 > > *************** > > *** 2333,2339 **** > > ) mapped; > > incorrectly_mapped | have_mappings > > --------------------+--------------- > > ! 0 | t > > (1 row) > > That's rather odd. It has survived for a couple of months on the other > buildfarm animals now... Could one of you apply the attached patch > adding more details to eventual failures? So I had made a notice to recheck on this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD indicates there haven't been any further failures... So, for now I assume this was caused by some problem fixed elsewhere. Greetings, Andres Freund
Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-03-27 08:02:35 -0400, Tom Lane wrote: >>> Buildfarm member prairiedog thinks there's something unreliable about >>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > So I had made a notice to recheck on > this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD > indicates there haven't been any further failures... So, for now I > assume this was caused by some problem fixed elsewhere. Hard to say. In any case, I agree we can't make any progress unless we see it again. regards, tom lane
On Tue, Apr 15, 2014 at 03:28:41PM -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > >> On 2014-03-27 08:02:35 -0400, Tom Lane wrote: > >>> Buildfarm member prairiedog thinks there's something unreliable about > >>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > > > So I had made a notice to recheck on > > this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD > > indicates there haven't been any further failures... So, for now I > > assume this was caused by some problem fixed elsewhere. > > Hard to say. In any case, I agree we can't make any progress unless we > see it again. The improved test just tripped: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07 -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: > The improved test just tripped: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07 Ugh. If the MTBF is circa three months, how will we catch this before we're dead? A quick look around the machine's buildfarm directory says there's nothing left behind that's not included in the buildfarm server report, so I can't offer any immediate insight. But I can certainly load it up running some additional tests if anyone has an idea what to look for. regards, tom lane
On 2014-06-12 00:38:36 -0400, Noah Misch wrote: > On Tue, Apr 15, 2014 at 03:28:41PM -0400, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > >> On 2014-03-27 08:02:35 -0400, Tom Lane wrote: > > >>> Buildfarm member prairiedog thinks there's something unreliable about > > >>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3: > > > > > So I had made a notice to recheck on > > > this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD > > > indicates there haven't been any further failures... So, for now I > > > assume this was caused by some problem fixed elsewhere. > > > > Hard to say. In any case, I agree we can't make any progress unless we > > see it again. > > The improved test just tripped: Hrmpf. Just one of these days I was happy thinking it was gone... > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07 Hm. My guess it's that it's just a 'harmless' concurrency issue. The test currently run in concurrency with others: I think what happens is that the table gets dropped in the other relation after the query has acquired the mvcc snapshot (used for the pg_class) test. But why is it triggering on such a 'unusual' system and not on others? That's what worries me a bit. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-06-12 00:38:36 -0400, Noah Misch wrote: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07 > Hm. My guess it's that it's just a 'harmless' concurrency issue. The > test currently run in concurrency with others: I think what happens is > that the table gets dropped in the other relation after the query has > acquired the mvcc snapshot (used for the pg_class) test. > But why is it triggering on such a 'unusual' system and not on others? > That's what worries me a bit. prairiedog is pretty damn slow by modern standards. OTOH, I think it is not the slowest machine in the buildfarm; hamster for instance seems to be at least a factor of 2 slower. So I'm not sure whether to believe it's just a timing issue. regards, tom lane
On Thu, Jun 12, 2014 at 02:44:10AM -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-06-12 00:38:36 -0400, Noah Misch wrote: > >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07 > > > Hm. My guess it's that it's just a 'harmless' concurrency issue. The > > test currently run in concurrency with others: I think what happens is > > that the table gets dropped in the other relation after the query has > > acquired the mvcc snapshot (used for the pg_class) test. > > But why is it triggering on such a 'unusual' system and not on others? > > That's what worries me a bit. I can reproduce a similar disturbance in the test query using gdb and a concurrent table drop, and the table reported in the prairiedog failure is a table dropped in a concurrent test group. That explanation may not be the full story behind these particular failures, but it certainly could cause similar failures in the future. Let's prevent this by only reporting rows for relations that still exist after the query is complete. > prairiedog is pretty damn slow by modern standards. OTOH, I think it > is not the slowest machine in the buildfarm; hamster for instance seems > to be at least a factor of 2 slower. So I'm not sure whether to believe > it's just a timing issue. That kernel's process scheduler could be a factor. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Attachment
Noah Misch <noah@leadboat.com> writes: > On Thu, Jun 12, 2014 at 02:44:10AM -0400, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> On 2014-06-12 00:38:36 -0400, Noah Misch wrote: >>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07 >>> Hm. My guess it's that it's just a 'harmless' concurrency issue. The >>> test currently run in concurrency with others: I think what happens is >>> that the table gets dropped in the other relation after the query has >>> acquired the mvcc snapshot (used for the pg_class) test. >>> But why is it triggering on such a 'unusual' system and not on others? >>> That's what worries me a bit. > I can reproduce a similar disturbance in the test query using gdb and a > concurrent table drop, and the table reported in the prairiedog failure is a > table dropped in a concurrent test group. That explanation may not be the > full story behind these particular failures, but it certainly could cause > similar failures in the future. Yeah, that seems like a plausible explanation, since the table shown in the failure report is one that would be getting dropped concurrently, and the discrepancy is that we get NULL rather than the expected value for the pg_filenode_relation result, which is expected if the table is already dropped when the mapping function is called. > Let's prevent this by only reporting rows for relations that still exist after > the query is complete. I think this is a bad solution though; it risks masking actual problems. What seems like a better fix to me is to change the test mapped_oid IS DISTINCT FROM oid to mapped_oid <> oid pg_class.oid will certainly never read as NULL, so what this will do is allow the single case where the function returns NULL. AFAIK there is no reason to suppose that a NULL result would mean anything except "the table's been dropped", so changing it this way will allow only that case and not any others. Alternatively, we could do something like you suggest but adjust the second join so that it suppresses only rows in which mapped_oid is null *and* there's no longer a matching OID in pg_class. That would provide additional confidence that the null result is a valid indicator of a just-dropped table. regards, tom lane
On Thu, Jun 12, 2014 at 10:50:44PM -0400, Tom Lane wrote: > Alternatively, we could do something like you suggest but adjust the > second join so that it suppresses only rows in which mapped_oid is null > *and* there's no longer a matching OID in pg_class. That would provide > additional confidence that the null result is a valid indicator of a > just-dropped table. Can't hurt; I adjusted it along those lines and committed. For the record, the failure on prairiedog has appeared a third time: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-13%2005%3A19%3A35 -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: > On Thu, Jun 12, 2014 at 10:50:44PM -0400, Tom Lane wrote: >> Alternatively, we could do something like you suggest but adjust the >> second join so that it suppresses only rows in which mapped_oid is null >> *and* there's no longer a matching OID in pg_class. That would provide >> additional confidence that the null result is a valid indicator of a >> just-dropped table. > Can't hurt; I adjusted it along those lines and committed. Looks good. > For the record, > the failure on prairiedog has appeared a third time: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-13%2005%3A19%3A35 Yeah, I saw that. I wonder if we recently changed something that improved the odds of the timing being just so. Two failures in two days seems out of line with the critter's previous history. regards, tom lane