Autovacuum breakage from a734fd5d1 - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Autovacuum breakage from a734fd5d1 |
Date | |
Msg-id | 21299.1480272347@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Autovacuum breakage from a734fd5d1
|
List | pgsql-hackers |
Buildfarm member skink failed a couple days ago: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-11-25%2017%3A50%3A01 I believe the interesting parts of the log are 2016-11-25 18:29:03.285 UTC [583882e7.2a45:1] LOG: autovacuum: dropping orphan temp table "(null)"."(null)" in database"regression" 2016-11-25 18:29:03.919 UTC [583882e7.2a45:2] ERROR: could not open relation with OID 33895 2016-11-25 18:29:03.919 UTC [583882e7.2a45:3] CONTEXT: dropping of orphan temp table "(null)"."(null)" in database "regression" TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "/home/andres/build/buildfarm/HEAD/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c",Line: 1001) 2016-11-25 18:29:07.524 UTC [58387ba7.609:3] LOG: server process (PID 10821) was terminated by signal 6: Aborted 2016-11-25 18:29:07.526 UTC [58387ba7.609:4] LOG: terminating any other active server processes 10821 = 2a45 hex so the first three log entries are from the process that later asserted. The assertion is evidently from a "pfree(NULL)" call. Looking into autovacuum.c, it's pretty clear that it was trying to drop what it thought was an orphan temp table, only the temp table was already gone, causing get_namespace_name() and get_rel_name() to return NULL. It's only a platform-specific artifact that the ereport calls printed "(null)" rather than crashing, and of course the pfree(relname) down at the bottom of the loop is what caused the assert failure. Presumably, what happened here is that somebody else already deleted the "orphan" table before autovacuum could get to it. Since there's no previous crash in this run, it's difficult to come to any conclusion other than that the "orphan" got deleted by its owning backend, meaning that the test for orphan status is jumping the gun and deciding that a table is orphaned before its owning session is really gone. That's rather disastrous if true. The other problem here is that the code is supposing that it can access a temp table's catalog entries without having obtained any lock on it. Correct coding would involve acquiring the relation lock and then checking to see if the table is still there. I am also not feeling terribly comfortable with the potentially long interval between where we decide that relation OID so-and-so is an orphaned table and where we summarily remove that relation OID. This is probably safe, if you assume that temp tables aren't likely to have lifespans comparable to the OID counter's wraparound interval, but I think we would be well advised to recheck that decision after acquiring lock. At least some of these behaviors were presumably hazards before, but were effectively masked by the long delay before we'd attempt to remove an orphan temp table. But I am wondering why we've not seen reports of crashes due to the old "found orphan temp table" log message trying to dereference null pointers, since that could have occurred immediately after the table became (ostensibly) orphan. I don't have proposed fixes for any of this yet, but it's clearly an area that needs review. regards, tom lane
pgsql-hackers by date: