Thread: Something flaky in the "relfilenode mapping" infrastructure

Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-27%2008%3A12%3A11

================== pgsql.13462/src/test/regress/regression.diffs ===================
*** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out    Thu Mar 27 04:12:40 2014
--- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out    Thu Mar 27 04:52:02 2014
***************
*** 2333,2339 ****     ) mapped;  incorrectly_mapped | have_mappings  --------------------+---------------
!                   0 | t (1 row)  -- Checks on creating and manipulation of user defined relations in
--- 2333,2339 ----     ) mapped;  incorrectly_mapped | have_mappings  --------------------+---------------
!                   1 | t (1 row)  -- Checks on creating and manipulation of user defined relations in

======================================================================
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Andres Freund
Date:
Hi,

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
> Buildfarm member prairiedog thinks there's something unreliable about
> commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

> *** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out    Thu Mar 27 04:12:40 2014
> --- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out    Thu Mar 27 04:52:02 2014
> ***************
> *** 2333,2339 ****
>       ) mapped;
>    incorrectly_mapped | have_mappings
>   --------------------+---------------
> !                   0 | t
>   (1 row)

That's rather odd. It has survived for a couple of months on the other
buildfarm animals now... Could one of you apply the attached patch
adding more details to eventual failures?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
>> Buildfarm member prairiedog thinks there's something unreliable about
>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

> That's rather odd. It has survived for a couple of months on the other
> buildfarm animals now... Could one of you apply the attached patch
> adding more details to eventual failures?

Any objection to separating out the have_mappings bit?  It wasn't terribly
appropriate before, but it seems really out of place in this formulation.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Andres Freund
Date:
On 2014-03-28 16:41:15 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
> >> Buildfarm member prairiedog thinks there's something unreliable about
> >> commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
> 
> > That's rather odd. It has survived for a couple of months on the other
> > buildfarm animals now... Could one of you apply the attached patch
> > adding more details to eventual failures?
> 
> Any objection to separating out the have_mappings bit?  It wasn't terribly
> appropriate before, but it seems really out of place in this formulation.

The patch I sent removed the have_mapping thing entirely? Do you mean it
should be there, but as a separate query?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-03-28 16:41:15 -0400, Tom Lane wrote:
>> Any objection to separating out the have_mappings bit?  It wasn't terribly
>> appropriate before, but it seems really out of place in this formulation.

> The patch I sent removed the have_mapping thing entirely? Do you mean it
> should be there, but as a separate query?

Oh, so it did.  Well, do you think we need a query checking that?
I hadn't questioned the need to do so, but if you feel it's unnecessary
I'm certainly willing to pull it.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Andres Freund
Date:
On 2014-03-28 16:45:28 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-03-28 16:41:15 -0400, Tom Lane wrote:
> >> Any objection to separating out the have_mappings bit?  It wasn't terribly
> >> appropriate before, but it seems really out of place in this formulation.
> 
> > The patch I sent removed the have_mapping thing entirely? Do you mean it
> > should be there, but as a separate query?
> 
> Oh, so it did.  Well, do you think we need a query checking that?
> I hadn't questioned the need to do so, but if you feel it's unnecessary
> I'm certainly willing to pull it.

I don't think it's necessary. As far as I understand LATERAL, a join to
a function returning NULL will still return the row. So, the test now
would only test whether there are rows in pg_class which seems a bit
pointless.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I don't think it's necessary. As far as I understand LATERAL, a join to
> a function returning NULL will still return the row. So, the test now
> would only test whether there are rows in pg_class which seems a bit
> pointless.

Yeah, after looking closer I'd come to the same conclusion.  If the
lateral function call could generate zero rows it'd perhaps be a risk,
but not in this formulation.

Will commit in a moment.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Andres Freund
Date:
On 2014-03-28 21:36:11 +0100, Andres Freund wrote:
> Hi,
> 
> On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
> > Buildfarm member prairiedog thinks there's something unreliable about
> > commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
> 
> > *** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out    Thu Mar 27 04:12:40
2014
> > --- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out    Thu Mar 27 04:52:02 2014
> > ***************
> > *** 2333,2339 ****
> >       ) mapped;
> >    incorrectly_mapped | have_mappings 
> >   --------------------+---------------
> > !                   0 | t
> >   (1 row)
> 
> That's rather odd. It has survived for a couple of months on the other
> buildfarm animals now... Could one of you apply the attached patch
> adding more details to eventual failures?

So I had made a notice to recheck on
this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
indicates there haven't been any further failures... So, for now I
assume this was caused by some problem fixed elsewhere.

Greetings,

Andres Freund



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
>> On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
>>> Buildfarm member prairiedog thinks there's something unreliable about
>>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

> So I had made a notice to recheck on
> this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
> indicates there haven't been any further failures... So, for now I
> assume this was caused by some problem fixed elsewhere.

Hard to say.  In any case, I agree we can't make any progress unless we
see it again.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Noah Misch
Date:
On Tue, Apr 15, 2014 at 03:28:41PM -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> >> On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
> >>> Buildfarm member prairiedog thinks there's something unreliable about
> >>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
> 
> > So I had made a notice to recheck on
> > this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
> > indicates there haven't been any further failures... So, for now I
> > assume this was caused by some problem fixed elsewhere.
> 
> Hard to say.  In any case, I agree we can't make any progress unless we
> see it again.

The improved test just tripped:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> The improved test just tripped:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Ugh.  If the MTBF is circa three months, how will we catch this before
we're dead?

A quick look around the machine's buildfarm directory says there's nothing
left behind that's not included in the buildfarm server report, so I can't
offer any immediate insight.  But I can certainly load it up running some
additional tests if anyone has an idea what to look for.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Andres Freund
Date:
On 2014-06-12 00:38:36 -0400, Noah Misch wrote:
> On Tue, Apr 15, 2014 at 03:28:41PM -0400, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > >> On 2014-03-27 08:02:35 -0400, Tom Lane wrote:
> > >>> Buildfarm member prairiedog thinks there's something unreliable about
> > >>> commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
> > 
> > > So I had made a notice to recheck on
> > > this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
> > > indicates there haven't been any further failures... So, for now I
> > > assume this was caused by some problem fixed elsewhere.
> > 
> > Hard to say.  In any case, I agree we can't make any progress unless we
> > see it again.
> 
> The improved test just tripped:

Hrmpf. Just one of these days I was happy thinking it was gone...

> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Hm. My guess it's that it's just a 'harmless' concurrency issue. The
test currently run in concurrency with others: I think what happens is
that the table gets dropped in the other relation after the query has
acquired the mvcc snapshot (used for the pg_class) test.
But why is it triggering on such a 'unusual' system and not on others?
That's what worries me a bit.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-06-12 00:38:36 -0400, Noah Misch wrote:
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

> Hm. My guess it's that it's just a 'harmless' concurrency issue. The
> test currently run in concurrency with others: I think what happens is
> that the table gets dropped in the other relation after the query has
> acquired the mvcc snapshot (used for the pg_class) test.
> But why is it triggering on such a 'unusual' system and not on others?
> That's what worries me a bit.

prairiedog is pretty damn slow by modern standards.  OTOH, I think it
is not the slowest machine in the buildfarm; hamster for instance seems
to be at least a factor of 2 slower.  So I'm not sure whether to believe
it's just a timing issue.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Noah Misch
Date:
On Thu, Jun 12, 2014 at 02:44:10AM -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-06-12 00:38:36 -0400, Noah Misch wrote:
> >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07
>
> > Hm. My guess it's that it's just a 'harmless' concurrency issue. The
> > test currently run in concurrency with others: I think what happens is
> > that the table gets dropped in the other relation after the query has
> > acquired the mvcc snapshot (used for the pg_class) test.
> > But why is it triggering on such a 'unusual' system and not on others?
> > That's what worries me a bit.

I can reproduce a similar disturbance in the test query using gdb and a
concurrent table drop, and the table reported in the prairiedog failure is a
table dropped in a concurrent test group.  That explanation may not be the
full story behind these particular failures, but it certainly could cause
similar failures in the future.

Let's prevent this by only reporting rows for relations that still exist after
the query is complete.

> prairiedog is pretty damn slow by modern standards.  OTOH, I think it
> is not the slowest machine in the buildfarm; hamster for instance seems
> to be at least a factor of 2 slower.  So I'm not sure whether to believe
> it's just a timing issue.

That kernel's process scheduler could be a factor.

--
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Attachment

Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Jun 12, 2014 at 02:44:10AM -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2014-06-12 00:38:36 -0400, Noah Misch wrote:
>>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

>>> Hm. My guess it's that it's just a 'harmless' concurrency issue. The
>>> test currently run in concurrency with others: I think what happens is
>>> that the table gets dropped in the other relation after the query has
>>> acquired the mvcc snapshot (used for the pg_class) test.
>>> But why is it triggering on such a 'unusual' system and not on others?
>>> That's what worries me a bit.

> I can reproduce a similar disturbance in the test query using gdb and a
> concurrent table drop, and the table reported in the prairiedog failure is a
> table dropped in a concurrent test group.  That explanation may not be the
> full story behind these particular failures, but it certainly could cause
> similar failures in the future.

Yeah, that seems like a plausible explanation, since the table shown
in the failure report is one that would be getting dropped concurrently,
and the discrepancy is that we get NULL rather than the expected value
for the pg_filenode_relation result, which is expected if the table is
already dropped when the mapping function is called.

> Let's prevent this by only reporting rows for relations that still exist after
> the query is complete.

I think this is a bad solution though; it risks masking actual problems.

What seems like a better fix to me is to change the test
    mapped_oid IS DISTINCT FROM oid

to
    mapped_oid <> oid

pg_class.oid will certainly never read as NULL, so what this will do is
allow the single case where the function returns NULL.  AFAIK there is
no reason to suppose that a NULL result would mean anything except "the
table's been dropped", so changing it this way will allow only that case
and not any others.

Alternatively, we could do something like you suggest but adjust the
second join so that it suppresses only rows in which mapped_oid is null
*and* there's no longer a matching OID in pg_class.  That would provide
additional confidence that the null result is a valid indicator of a
just-dropped table.
        regards, tom lane



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Noah Misch
Date:
On Thu, Jun 12, 2014 at 10:50:44PM -0400, Tom Lane wrote:
> Alternatively, we could do something like you suggest but adjust the
> second join so that it suppresses only rows in which mapped_oid is null
> *and* there's no longer a matching OID in pg_class.  That would provide
> additional confidence that the null result is a valid indicator of a
> just-dropped table.

Can't hurt; I adjusted it along those lines and committed.  For the record,
the failure on prairiedog has appeared a third time:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-13%2005%3A19%3A35

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: Something flaky in the "relfilenode mapping" infrastructure

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Jun 12, 2014 at 10:50:44PM -0400, Tom Lane wrote:
>> Alternatively, we could do something like you suggest but adjust the
>> second join so that it suppresses only rows in which mapped_oid is null
>> *and* there's no longer a matching OID in pg_class.  That would provide
>> additional confidence that the null result is a valid indicator of a
>> just-dropped table.

> Can't hurt; I adjusted it along those lines and committed.

Looks good.

> For the record,
> the failure on prairiedog has appeared a third time:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-13%2005%3A19%3A35

Yeah, I saw that.  I wonder if we recently changed something that improved
the odds of the timing being just so.  Two failures in two days seems out
of line with the critter's previous history.
        regards, tom lane