Thread: Some other odd buildfarm failures
These two recent failures look suspiciously similar: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi&dt=2014-12-24%2021%3A03%3A05 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kouprey&dt=2014-12-25%2018%3A43%3A17 to wit: *** /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/expected/brin.out Wed Dec 24 22:03:07 2014 --- /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/results/brin.out Wed Dec 24 22:54:26 2014 *************** *** 146,151 **** --- 146,154 ---- end loop; end; $x$; + ERROR: cache lookup failed for function 30281 + CONTEXT: SQL statement "create temp table qry_2_ss (tid tid) ON COMMIT DROP" + PL/pgSQL function inline_code_block line 26 at EXECUTE statement INSERT INTO brintest SELECT repeat(stringu1, 42)::bytea, substr(stringu1, 1, 1)::"char", *** /home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/expected/matview.out Thu Dec 25 18:43:31 2014 --- /home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/results/matview.out Thu Dec 25 18:45:25 2014 *************** *** 90,97 **** --- 90,102 ---- CREATE MATERIALIZED VIEW tvvm AS SELECT * FROM tvv; CREATE VIEW tvvmv AS SELECT * FROM tvvm; + ERROR: cache lookup failed for function 30311 CREATE MATERIALIZED VIEW bb AS SELECT * FROM tvvmv; When I saw jagarundi's failure yesterday, I figured it was something to do with CLOBBER_CACHE_ALWAYS ... but kouprey isn't using that option AFAICS, so that idea fails to hold water. I don't believe that the referenced function OIDs would ever actually exist in the database in the current regression tests; and the two failing statements have no reason to be accessing any user-defined functions anyway. So those OIDs are probably bogus. It seems likely that something is clobbering storage that is later expected to hold an OID. Whatever's going on, it's likely that this is a recently-introduced bug, because I don't recall seeing reports like these before. regards, tom lane
I wrote: > These two recent failures look suspiciously similar: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi&dt=2014-12-24%2021%3A03%3A05 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kouprey&dt=2014-12-25%2018%3A43%3A17 And I'd barely finished posting that before this one arrived: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shearwater&dt=2014-12-25%2020%3A24%3A32 *** /home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/expected/privileges.out Fri Dec 26 00:24:38 2014 --- /home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/results/privileges.out Fri Dec 26 00:25:54 2014 *************** *** 197,202 **** --- 197,203 ---- CREATE VIEW atestv1 AS SELECT * FROM atest1; -- ok /* The next *should* fail, but it's not implemented thatway yet. */ CREATE VIEW atestv2 AS SELECT * FROM atest2; + ERROR: cache lookup failed for function 30274 CREATE VIEW atestv3 AS SELECT * FROM atest3; -- ok /* Empty view is a cornercase that failed in 9.2. */ CREATE VIEW atestv0 AS SELECT 0 as x WHERE false; -- ok I find it suspicious that all three examples are in the same group of parallel tests. A possible theory is that one of these tests: test: brin gin gist spgist privileges security_label collate matview lock replica_identity rowsecurity object_address is doing something that has bad side-effects on concurrent sessions. In any case, it now seems dead certain that this is a recently introduced bug. Andres is fortunate that the first instance occurred before his recent batch of commits, or I'd be on him to revert them. As is, though, I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly be related. regards, tom lane
Hi, On 2014-12-25 16:01:47 -0500, Tom Lane wrote: > In any case, it now seems dead certain that this is a recently introduced > bug. Andres is fortunate that the first instance occurred before his > recent batch of commits, or I'd be on him to revert them. Yes, Phew. These look rather odd. > As is, though, > I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly > be related. I really can't imagine how. If a additional barrier in that place can cause such problems we'd surely have more reports by an accidental wait in the right place than these. My guess is that it's related to d7ee82e50f. It seems realistic that the event trigger added by it to the object_address test can cause errors at varying times. I wonder if it'd not be a good idea if the event trigger code installed a error context callback? Since they can be called in situations we don't routinely expect that'd make diagnosis in many cases easier. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I wonder if it'd not be a good idea if the event trigger code installed > a error context callback? Since they can be called in situations we > don't routinely expect that'd make diagnosis in many cases easier. +1 ... even if that's not related to the immediate issue, it seems like a good idea. regards, tom lane
Andres Freund <andres@2ndquadrant.com> writes: > My guess is that it's related to d7ee82e50f. It seems realistic that the > event trigger added by it to the object_address test can cause errors at > varying times. [ squint... ] Event triggers are global across the whole database, aren't they? Isn't it frickin insane to run a test like this in parallel with others? Not but what it seems to be exposing some bugs. Still, I don't think this is a reasonable test design. We have absolutely no idea what behaviors are being triggered in the other tests, except that they are unrelated to what those tests think they are testing. regards, tom lane
Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > My guess is that it's related to d7ee82e50f. It seems realistic that the > > event trigger added by it to the object_address test can cause errors at > > varying times. > > [ squint... ] Event triggers are global across the whole database, aren't > they? Isn't it frickin insane to run a test like this in parallel with > others? Well, the event trigger function is BEGIN END; so I don't think it should affect anything. > Not but what it seems to be exposing some bugs. That seems to me a good thing ... a bit inconvenient of course, but it beats having users get strange behavior which they can't possibly debug. > Still, I don't think this is a reasonable test design. We have > absolutely no idea what behaviors are being triggered in the other > tests, except that they are unrelated to what those tests think they > are testing. I can of course move it to a separate parallel test, but I don't think that should be really necessary. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Still, I don't think this is a reasonable test design. We have >> absolutely no idea what behaviors are being triggered in the other >> tests, except that they are unrelated to what those tests think they >> are testing. > I can of course move it to a separate parallel test, but I don't think > that should be really necessary. I've not proven this rigorously, but it seems obvious in hindsight: what's happening is that when the object_address test drops everything with DROP CASCADE, other processes are sometimes just starting to execute the event trigger when the DROP commits. When they go to look up the trigger function, they don't find it, leading to "cache lookup failed for function". The fact that the complained-of OID is slightly variable, but always in the range of OIDs that would be getting assigned around this point in a "make check" run, buttresses the theory. I thought about changing the object_address test so that it explicitly drops the event trigger first. But that would not be a fix, it would just make the timing harder to hit (ie, a victim process would need to lose control for longer at the critical point). Since I remain of the opinion that a test called object_address has no damn business causing global side-effects, I think there are two reasonable fixes: 1. Remove the event trigger. This would slightly reduce the test's coverage. 2. Run that whole test as a single transaction, so that the event trigger is created and dropped in one transaction and is never seen as valid by any concurrent test. A long-term idea is to try to fix things so that there's sufficient locking to make dropping an event trigger and immediately dropping its trigger function safe. But I'm not sure that's either possible or a good idea (the lock obtained by DROP would bring the entire database to a standstill ...). regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> Still, I don't think this is a reasonable test design. We have > >> absolutely no idea what behaviors are being triggered in the other > >> tests, except that they are unrelated to what those tests think they > >> are testing. > > > I can of course move it to a separate parallel test, but I don't think > > that should be really necessary. > > I've not proven this rigorously, but it seems obvious in hindsight: > what's happening is that when the object_address test drops everything > with DROP CASCADE, other processes are sometimes just starting to execute > the event trigger when the DROP commits. When they go to look up the > trigger function, they don't find it, leading to "cache lookup failed for > function". Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> I've not proven this rigorously, but it seems obvious in hindsight: >> what's happening is that when the object_address test drops everything >> with DROP CASCADE, other processes are sometimes just starting to execute >> the event trigger when the DROP commits. When they go to look up the >> trigger function, they don't find it, leading to "cache lookup failed for >> function". > Hm, maybe we can drop the event trigger explicitely first, then wait a > little bit, then drop the remaining objects with DROP CASCADE? As I said, that's no fix; it just makes the timing harder to hit. Another process could be paused at the critical point for longer than whatever "a little bit" is. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> I've not proven this rigorously, but it seems obvious in hindsight: > >> what's happening is that when the object_address test drops everything > >> with DROP CASCADE, other processes are sometimes just starting to execute > >> the event trigger when the DROP commits. When they go to look up the > >> trigger function, they don't find it, leading to "cache lookup failed for > >> function". > > > Hm, maybe we can drop the event trigger explicitely first, then wait a > > little bit, then drop the remaining objects with DROP CASCADE? > > As I said, that's no fix; it just makes the timing harder to hit. Another > process could be paused at the critical point for longer than whatever "a > little bit" is. Yeah, I was thinking we could play some games with the currently running XIDs from a txid_snapshot or some such, with a reasonable upper limit on the waiting time (for the rare cases with a server doing other stuff with long-running transactions.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> Hm, maybe we can drop the event trigger explicitely first, then wait a >>> little bit, then drop the remaining objects with DROP CASCADE? >> As I said, that's no fix; it just makes the timing harder to hit. Another >> process could be paused at the critical point for longer than whatever "a >> little bit" is. > Yeah, I was thinking we could play some games with the currently running > XIDs from a txid_snapshot or some such, with a reasonable upper limit on > the waiting time (for the rare cases with a server doing other stuff > with long-running transactions.) Whether that's sane or not, the whole problem is so far out-of-scope for a test of pg_get_object_address() that it's not even funny. I think we should adopt one of the two fixes I recommended and call it good. If you want to work on making DROP EVENT TRIGGER safer in the long run, that can be a separate activity. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: > >>> Hm, maybe we can drop the event trigger explicitely first, then wait a > >>> little bit, then drop the remaining objects with DROP CASCADE? > > >> As I said, that's no fix; it just makes the timing harder to hit. Another > >> process could be paused at the critical point for longer than whatever "a > >> little bit" is. > > > Yeah, I was thinking we could play some games with the currently running > > XIDs from a txid_snapshot or some such, with a reasonable upper limit on > > the waiting time (for the rare cases with a server doing other stuff > > with long-running transactions.) > > Whether that's sane or not, the whole problem is so far out-of-scope for > a test of pg_get_object_address() that it's not even funny. I think > we should adopt one of the two fixes I recommended and call it good. I think dropping the part involving an event trigger from the test is reasonable. I will go do that. > If you want to work on making DROP EVENT TRIGGER safer in the long run, > that can be a separate activity. This sounds like a huge project -- it's not like event triggers are the only objects in the system where this is an issue, is it? I'm sure there is value in fixing it, but I have enough other projects. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On December 26, 2014 6:10:51 PM CET, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >Tom Lane wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> > Tom Lane wrote: >> >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> >>> Hm, maybe we can drop the event trigger explicitely first, then >wait a >> >>> little bit, then drop the remaining objects with DROP CASCADE? >> >> >> As I said, that's no fix; it just makes the timing harder to hit. >Another >> >> process could be paused at the critical point for longer than >whatever "a >> >> little bit" is. >> >> > Yeah, I was thinking we could play some games with the currently >running >> > XIDs from a txid_snapshot or some such, with a reasonable upper >limit on >> > the waiting time (for the rare cases with a server doing other >stuff >> > with long-running transactions.) >> >> Whether that's sane or not, the whole problem is so far out-of-scope >for >> a test of pg_get_object_address() that it's not even funny. I think >> we should adopt one of the two fixes I recommended and call it good. > >I think dropping the part involving an event trigger from the test is >reasonable. I will go do that. > >> If you want to work on making DROP EVENT TRIGGER safer in the long >run, >> that can be a separate activity. > >This sounds like a huge project -- it's not like event triggers are the >only objects in the system where this is an issue, is it? I'm sure >there is value in fixing it, but I have enough other projects. Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences timewise. -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences timewise. That just leaves the door open for somebody to add more tests parallel to it in future. TBH, I think we could have done without this test altogether; but if we're going to have it, a minimum expectation is that it not be hazardous to other tests around it. regards, tom lane
Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequencestimewise. > > That just leaves the door open for somebody to add more tests parallel to > it in future. I've been long wanted to add declarative dependencies to tests: each test file would declare what other tests it depends on, and we would have a special clause to state "this one must not be run in concurrence with anything else". Of course, this is just wishful thinking at this point. > TBH, I think we could have done without this test altogether; but if we're > going to have it, a minimum expectation is that it not be hazardous to > other tests around it. The number of assertion failures in get_object_address without all the sanity checks I added in pg_get_object_address was a bit surprising. That's the whole reason I decided to add the test. I don't want to blindly assume that all cases will work nicely in the future, particularly as other object types are added. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund wrote: > >This sounds like a huge project -- it's not like event triggers are the > >only objects in the system where this is an issue, is it? I'm sure > >there is value in fixing it, but I have enough other projects. > > Can't we just move the test to run without parallelism? Its quite > quick, so I don't it'd have noticeable consequences timewise. (I got this a minute too late anyhow.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> TBH, I think we could have done without this test altogether; but if we're >> going to have it, a minimum expectation is that it not be hazardous to >> other tests around it. > The number of assertion failures in get_object_address without all the > sanity checks I added in pg_get_object_address was a bit surprising. > That's the whole reason I decided to add the test. I don't want to > blindly assume that all cases will work nicely in the future, > particularly as other object types are added. I'm surprised then that you didn't prefer the other solution (wrap the whole test in a single transaction). But we've probably spent more time on this than it deserves. regards, tom lane