Thread: Some other odd buildfarm failures

Some other odd buildfarm failures

From

Tom Lane

Date:

25 December 2014, 23:41:22

These two recent failures look suspiciously similar:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi&dt=2014-12-24%2021%3A03%3A05
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kouprey&dt=2014-12-25%2018%3A43%3A17

to wit:

*** /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/expected/brin.out    Wed Dec 24 22:03:07 2014
--- /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/results/brin.out    Wed Dec 24 22:54:26 2014
***************
*** 146,151 ****
--- 146,154 ----         end loop; end; $x$;
+ ERROR:  cache lookup failed for function 30281
+ CONTEXT:  SQL statement "create temp table qry_2_ss (tid tid) ON COMMIT DROP"
+ PL/pgSQL function inline_code_block line 26 at EXECUTE statement INSERT INTO brintest SELECT     repeat(stringu1,
42)::bytea,    substr(stringu1, 1, 1)::"char",
 


*** /home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/expected/matview.out    Thu Dec 25 18:43:31 2014
--- /home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/results/matview.out    Thu Dec 25 18:45:25 2014
***************
*** 90,97 ****
--- 90,102 ----  CREATE MATERIALIZED VIEW tvvm AS SELECT * FROM tvv; CREATE VIEW tvvmv AS SELECT * FROM tvvm;
+ ERROR:  cache lookup failed for function 30311 CREATE MATERIALIZED VIEW bb AS SELECT * FROM tvvmv;


When I saw jagarundi's failure yesterday, I figured it was something to do
with CLOBBER_CACHE_ALWAYS ... but kouprey isn't using that option AFAICS,
so that idea fails to hold water.

I don't believe that the referenced function OIDs would ever actually
exist in the database in the current regression tests; and the two failing
statements have no reason to be accessing any user-defined functions
anyway.  So those OIDs are probably bogus.  It seems likely that something
is clobbering storage that is later expected to hold an OID.  Whatever's
going on, it's likely that this is a recently-introduced bug, because
I don't recall seeing reports like these before.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 00:01:56

I wrote:
> These two recent failures look suspiciously similar:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi&dt=2014-12-24%2021%3A03%3A05
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kouprey&dt=2014-12-25%2018%3A43%3A17

And I'd barely finished posting that before this one arrived:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shearwater&dt=2014-12-25%2020%3A24%3A32

*** /home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/expected/privileges.out    Fri Dec 26 00:24:38 2014
--- /home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/results/privileges.out    Fri Dec 26 00:25:54 2014
***************
*** 197,202 ****
--- 197,203 ---- CREATE VIEW atestv1 AS SELECT * FROM atest1; -- ok /* The next *should* fail, but it's not implemented
thatway yet. */ CREATE VIEW atestv2 AS SELECT * FROM atest2;
 
+ ERROR:  cache lookup failed for function 30274 CREATE VIEW atestv3 AS SELECT * FROM atest3; -- ok /* Empty view is a
cornercase that failed in 9.2. */ CREATE VIEW atestv0 AS SELECT 0 as x WHERE false; -- ok
 

I find it suspicious that all three examples are in the same group of
parallel tests.  A possible theory is that one of these tests:

test: brin gin gist spgist privileges security_label collate matview lock replica_identity rowsecurity object_address

is doing something that has bad side-effects on concurrent sessions.

In any case, it now seems dead certain that this is a recently introduced
bug.  Andres is fortunate that the first instance occurred before his
recent batch of commits, or I'd be on him to revert them.  As is, though,
I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly
be related.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Andres Freund

Date:

26 December 2014, 04:19:38

Hi,

On 2014-12-25 16:01:47 -0500, Tom Lane wrote:
> In any case, it now seems dead certain that this is a recently introduced
> bug.  Andres is fortunate that the first instance occurred before his
> recent batch of commits, or I'd be on him to revert them.

Yes, Phew. These look rather odd.

> As is, though,
> I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly
> be related.

I really can't imagine how. If a additional barrier in that place can
cause such problems we'd surely have more reports by an accidental wait
in the right place than these.

My guess is that it's related to d7ee82e50f. It seems realistic that the
event trigger added by it to the object_address test can cause errors at
varying times.

I wonder if it'd not be a good idea if the event trigger code installed
a error context callback? Since they can be called in situations we
don't routinely expect that'd make diagnosis in many cases easier.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 04:31:53

Andres Freund <andres@2ndquadrant.com> writes:
> I wonder if it'd not be a good idea if the event trigger code installed
> a error context callback? Since they can be called in situations we
> don't routinely expect that'd make diagnosis in many cases easier.

+1 ... even if that's not related to the immediate issue, it seems like
a good idea.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 06:19:09

Andres Freund <andres@2ndquadrant.com> writes:
> My guess is that it's related to d7ee82e50f. It seems realistic that the
> event trigger added by it to the object_address test can cause errors at
> varying times.

[ squint... ]  Event triggers are global across the whole database, aren't
they?  Isn't it frickin insane to run a test like this in parallel with
others?

Not but what it seems to be exposing some bugs.  Still, I don't think this
is a reasonable test design.  We have absolutely no idea what behaviors
are being triggered in the other tests, except that they are unrelated to
what those tests think they are testing.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Alvaro Herrera

Date:

26 December 2014, 06:32:46

Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > My guess is that it's related to d7ee82e50f. It seems realistic that the
> > event trigger added by it to the object_address test can cause errors at
> > varying times.
> 
> [ squint... ]  Event triggers are global across the whole database, aren't
> they?  Isn't it frickin insane to run a test like this in parallel with
> others?

Well, the event trigger function is BEGIN END; so I don't think it
should affect anything.

> Not but what it seems to be exposing some bugs.

That seems to me a good thing ... a bit inconvenient of course, but it
beats having users get strange behavior which they can't possibly debug.

> Still, I don't think this is a reasonable test design.  We have
> absolutely no idea what behaviors are being triggered in the other
> tests, except that they are unrelated to what those tests think they
> are testing.

I can of course move it to a separate parallel test, but I don't think
that should be really necessary.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 18:32:10

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Still, I don't think this is a reasonable test design.  We have
>> absolutely no idea what behaviors are being triggered in the other
>> tests, except that they are unrelated to what those tests think they
>> are testing.

> I can of course move it to a separate parallel test, but I don't think
> that should be really necessary.

I've not proven this rigorously, but it seems obvious in hindsight:
what's happening is that when the object_address test drops everything
with DROP CASCADE, other processes are sometimes just starting to execute
the event trigger when the DROP commits.  When they go to look up the
trigger function, they don't find it, leading to "cache lookup failed for
function".  The fact that the complained-of OID is slightly variable, but
always in the range of OIDs that would be getting assigned around this
point in a "make check" run, buttresses the theory.

I thought about changing the object_address test so that it explicitly
drops the event trigger first.  But that would not be a fix, it would
just make the timing harder to hit (ie, a victim process would need to
lose control for longer at the critical point).

Since I remain of the opinion that a test called object_address has no
damn business causing global side-effects, I think there are two
reasonable fixes:

1. Remove the event trigger.  This would slightly reduce the test's
coverage.

2. Run that whole test as a single transaction, so that the event trigger
is created and dropped in one transaction and is never seen as valid by
any concurrent test.

A long-term idea is to try to fix things so that there's sufficient
locking to make dropping an event trigger and immediately dropping its
trigger function safe.  But I'm not sure that's either possible or a good
idea (the lock obtained by DROP would bring the entire database to a
standstill ...).
        regards, tom lane

Re: Some other odd buildfarm failures

From

Alvaro Herrera

Date:

26 December 2014, 19:24:01

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Still, I don't think this is a reasonable test design.  We have
> >> absolutely no idea what behaviors are being triggered in the other
> >> tests, except that they are unrelated to what those tests think they
> >> are testing.
> 
> > I can of course move it to a separate parallel test, but I don't think
> > that should be really necessary.
> 
> I've not proven this rigorously, but it seems obvious in hindsight:
> what's happening is that when the object_address test drops everything
> with DROP CASCADE, other processes are sometimes just starting to execute
> the event trigger when the DROP commits.  When they go to look up the
> trigger function, they don't find it, leading to "cache lookup failed for
> function".

Hm, maybe we can drop the event trigger explicitely first, then wait a
little bit, then drop the remaining objects with DROP CASCADE?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 19:27:11

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> I've not proven this rigorously, but it seems obvious in hindsight:
>> what's happening is that when the object_address test drops everything
>> with DROP CASCADE, other processes are sometimes just starting to execute
>> the event trigger when the DROP commits.  When they go to look up the
>> trigger function, they don't find it, leading to "cache lookup failed for
>> function".

> Hm, maybe we can drop the event trigger explicitely first, then wait a
> little bit, then drop the remaining objects with DROP CASCADE?

As I said, that's no fix; it just makes the timing harder to hit.  Another
process could be paused at the critical point for longer than whatever "a
little bit" is.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Alvaro Herrera

Date:

26 December 2014, 19:35:50

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> I've not proven this rigorously, but it seems obvious in hindsight:
> >> what's happening is that when the object_address test drops everything
> >> with DROP CASCADE, other processes are sometimes just starting to execute
> >> the event trigger when the DROP commits.  When they go to look up the
> >> trigger function, they don't find it, leading to "cache lookup failed for
> >> function".
> 
> > Hm, maybe we can drop the event trigger explicitely first, then wait a
> > little bit, then drop the remaining objects with DROP CASCADE?
> 
> As I said, that's no fix; it just makes the timing harder to hit.  Another
> process could be paused at the critical point for longer than whatever "a
> little bit" is.

Yeah, I was thinking we could play some games with the currently running
XIDs from a txid_snapshot or some such, with a reasonable upper limit on
the waiting time (for the rare cases with a server doing other stuff
with long-running transactions.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 19:47:48

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> Hm, maybe we can drop the event trigger explicitely first, then wait a
>>> little bit, then drop the remaining objects with DROP CASCADE?

>> As I said, that's no fix; it just makes the timing harder to hit.  Another
>> process could be paused at the critical point for longer than whatever "a
>> little bit" is.

> Yeah, I was thinking we could play some games with the currently running
> XIDs from a txid_snapshot or some such, with a reasonable upper limit on
> the waiting time (for the rare cases with a server doing other stuff
> with long-running transactions.)

Whether that's sane or not, the whole problem is so far out-of-scope for
a test of pg_get_object_address() that it's not even funny.  I think
we should adopt one of the two fixes I recommended and call it good.
If you want to work on making DROP EVENT TRIGGER safer in the long run,
that can be a separate activity.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Alvaro Herrera

Date:

26 December 2014, 20:10:53

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >>> Hm, maybe we can drop the event trigger explicitely first, then wait a
> >>> little bit, then drop the remaining objects with DROP CASCADE?
> 
> >> As I said, that's no fix; it just makes the timing harder to hit.  Another
> >> process could be paused at the critical point for longer than whatever "a
> >> little bit" is.
> 
> > Yeah, I was thinking we could play some games with the currently running
> > XIDs from a txid_snapshot or some such, with a reasonable upper limit on
> > the waiting time (for the rare cases with a server doing other stuff
> > with long-running transactions.)
> 
> Whether that's sane or not, the whole problem is so far out-of-scope for
> a test of pg_get_object_address() that it's not even funny.  I think
> we should adopt one of the two fixes I recommended and call it good.

I think dropping the part involving an event trigger from the test is
reasonable.  I will go do that.

> If you want to work on making DROP EVENT TRIGGER safer in the long run,
> that can be a separate activity.

This sounds like a huge project -- it's not like event triggers are the
only objects in the system where this is an issue, is it?  I'm sure
there is value in fixing it, but I have enough other projects.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Some other odd buildfarm failures

From

Andres Freund

Date:

26 December 2014, 20:33:55

On December 26, 2014 6:10:51 PM CET, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>Tom Lane wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> > Tom Lane wrote:
>> >> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> >>> Hm, maybe we can drop the event trigger explicitely first, then
>wait a
>> >>> little bit, then drop the remaining objects with DROP CASCADE?
>> 
>> >> As I said, that's no fix; it just makes the timing harder to hit. 
>Another
>> >> process could be paused at the critical point for longer than
>whatever "a
>> >> little bit" is.
>> 
>> > Yeah, I was thinking we could play some games with the currently
>running
>> > XIDs from a txid_snapshot or some such, with a reasonable upper
>limit on
>> > the waiting time (for the rare cases with a server doing other
>stuff
>> > with long-running transactions.)
>> 
>> Whether that's sane or not, the whole problem is so far out-of-scope
>for
>> a test of pg_get_object_address() that it's not even funny.  I think
>> we should adopt one of the two fixes I recommended and call it good.
>
>I think dropping the part involving an event trigger from the test is
>reasonable.  I will go do that.
>
>> If you want to work on making DROP EVENT TRIGGER safer in the long
>run,
>> that can be a separate activity.
>
>This sounds like a huge project -- it's not like event triggers are the
>only objects in the system where this is an issue, is it?  I'm sure
>there is value in fixing it, but I have enough other projects.

Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences
timewise.


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 20:36:36

Andres Freund <andres@2ndquadrant.com> writes:
> Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences
timewise.

That just leaves the door open for somebody to add more tests parallel to
it in future.

TBH, I think we could have done without this test altogether; but if we're
going to have it, a minimum expectation is that it not be hazardous to
other tests around it.
        regards, tom lane

Re: Some other odd buildfarm failures

From

Alvaro Herrera

Date:

26 December 2014, 20:42:03

Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable
consequencestimewise.

> 
> That just leaves the door open for somebody to add more tests parallel to
> it in future.

I've been long wanted to add declarative dependencies to tests: each
test file would declare what other tests it depends on, and we would
have a special clause to state "this one must not be run in concurrence
with anything else".  Of course, this is just wishful thinking at this
point.

> TBH, I think we could have done without this test altogether; but if we're
> going to have it, a minimum expectation is that it not be hazardous to
> other tests around it.

The number of assertion failures in get_object_address without all the
sanity checks I added in pg_get_object_address was a bit surprising.
That's the whole reason I decided to add the test.  I don't want to
blindly assume that all cases will work nicely in the future,
particularly as other object types are added.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Some other odd buildfarm failures

From

Alvaro Herrera

Date:

26 December 2014, 20:42:40

Andres Freund wrote:

> >This sounds like a huge project -- it's not like event triggers are the
> >only objects in the system where this is an issue, is it?  I'm sure
> >there is value in fixing it, but I have enough other projects.
> 
> Can't we just move the test to run without parallelism? Its quite
> quick, so I don't it'd have noticeable consequences timewise.

(I got this a minute too late anyhow.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Some other odd buildfarm failures

From

Tom Lane

Date:

26 December 2014, 21:00:01

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> TBH, I think we could have done without this test altogether; but if we're
>> going to have it, a minimum expectation is that it not be hazardous to
>> other tests around it.

> The number of assertion failures in get_object_address without all the
> sanity checks I added in pg_get_object_address was a bit surprising.
> That's the whole reason I decided to add the test.  I don't want to
> blindly assume that all cases will work nicely in the future,
> particularly as other object types are added.

I'm surprised then that you didn't prefer the other solution (wrap the
whole test in a single transaction).  But we've probably spent more
time on this than it deserves.
        regards, tom lane