Thread: Something is broken about connection startup
I noticed that buildfarm member piculet fell over this afternoon: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2016-11-10%2020%3A10%3A02 with this interesting failure during startup of the "collate" test: psql: FATAL: cache lookup failed for relation 1255 1255 is pg_proc, and nosing around, I noticed that the concurrent "init_privs" test does this: GRANT SELECT ON pg_proc TO CURRENT_USER; GRANT SELECT (prosrc) ON pg_proc TO CURRENT_USER; So that led me to hypothesize that GRANT on a system catalog can cause a concurrent connection failure, which I tested by running pgbench -U postgres -n -f script1.sql -T 300 regression with this script: GRANT SELECT ON pg_proc TO CURRENT_USER; GRANT SELECT (prosrc) ON pg_proc TO CURRENT_USER; REVOKE SELECT ON pg_proc FROM CURRENT_USER; REVOKE SELECT (prosrc) ON pg_proc FROM CURRENT_USER; and concurrently pgbench -C -U postgres -n -f script2.sql -c 10 -j 10 -T 300 regression with this script: select 2 + 2; and sure enough, the second one falls over after a bit with connection to database "regression" failed: FATAL: cache lookup failed for relation 1255 client 5 aborted while establishing connection For me, this typically happens within thirty seconds or less. I thought perhaps it only happened with --no-atomics which piculet is using, but nope, I can reproduce it in a stock debug build. For the record, I'm testing on an 8-core x86_64 machine running RHEL6. Note: you can't merge this test scenario into one pgbench run with two scripts, because then you can't keep pgbench from sometimes running two instances of script1 concurrently, with ensuing "tuple concurrently updated" errors. That's something we've previously deemed not worth changing, and in any case it's not what I'm on about at the moment. I tried to make script1 safe for concurrent calls by putting "begin; lock table pg_proc in share row exclusive mode; ...; commit;" around it, but that caused the error to go away, or at least become far less frequent. Which is odd in itself, since that lock level shouldn't block connection startup accesses to pg_proc. A quick look through the sources confirms that this error implies that SearchSysCache on the RELOID cache must have failed to find a tuple for pg_proc --- there are many occurrences of this text, but they all are reporting that. Which absolutely should not be happening now that we use MVCC catalog scans, concurrent updates or no. So I think this is a bug, and possibly a fairly-recently-introduced one, because I can't remember seeing buildfarm failures like this one before. I've not dug further than that yet. Any thoughts out there? regards, tom lane
I wrote: > A quick look through the sources confirms that this error implies that > SearchSysCache on the RELOID cache must have failed to find a tuple for > pg_proc --- there are many occurrences of this text, but they all are > reporting that. Which absolutely should not be happening now that we use > MVCC catalog scans, concurrent updates or no. So I think this is a bug, > and possibly a fairly-recently-introduced one, because I can't remember > seeing buildfarm failures like this one before. After tweaking elog.c to promote FATAL to PANIC, I got stack traces confirming that the error occurs here: #0 0x0000003779a325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003779a33dc5 in abort () at abort.c:92 #2 0x000000000080d177 in errfinish (dummy=<value optimized out>) at elog.c:560 #3 0x000000000080df94 in elog_finish (elevel=<value optimized out>, fmt=<value optimized out>) at elog.c:1381 #4 0x0000000000801859 in RelationCacheInitializePhase3 () at relcache.c:3444 #5 0x000000000081a145 in InitPostgres (in_dbname=<value optimized out>, dboid=0, username=<value optimized out>, useroid=<valueoptimized out>, out_dbname=0x0) at postinit.c:982 #6 0x0000000000710c81 in PostgresMain (argc=1, argv=<value optimized out>, dbname=0x24d4c40 "regression", username=0x24abc88"postgres") at postgres.c:3728 #7 0x00000000006a6eae in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4271 #8 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3945 #9 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1701 #10 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1309 #11 0x00000000006273d8 in main (argc=3, argv=0x24a9b20) at main.c:228 So it's happening when RelationCacheInitializePhase3 is trying to replace a fake pg_class row for pg_proc (made by formrdesc) with the real one. That's even odder, because that's late enough that this should be a pretty ordinary catalog lookup. Now I wonder if it's possible that this can be seen during ordinary relation opens after connection startup. If so, it would almost surely be a recently-introduced bug, else we'd have heard about this from the field. regards, tom lane
I wrote: > So it's happening when RelationCacheInitializePhase3 is trying to replace > a fake pg_class row for pg_proc (made by formrdesc) with the real one. > That's even odder, because that's late enough that this should be a pretty > ordinary catalog lookup. Now I wonder if it's possible that this can be > seen during ordinary relation opens after connection startup. If so, it > would almost surely be a recently-introduced bug, else we'd have heard > about this from the field. Okay, I've been tracing through this, and the reason that the catalog search for the real pg_proc row is failing is that all it finds are versions too new for its snapshot to see. Thus, the failure happens when the process running the GRANTs has pruned away a committed-dead tuple that is the version the incoming process would have needed to see. This is not the fault of the process running the GRANTs, because *the incoming process is advertising MyPgXact->xmin = zero* which should mean that it has no active snapshot. It has no right to complain that somebody truncated away data. The sequence of events inside the incoming process seems to go approximately like this: * RelationCacheInitializePhase3 goes to load up the pg_class row for some other catalog (maybe pg_class itself; whatever hash_seq_search comes to first). This results in systable_beginscan calling GetCatalogSnapshot which calls GetSnapshotData which sets MyPgXact->xmin nonzero as a side effect. All well so far. * Once we've collected that row, systable_endscan unregisters the snapshot, which ends with SnapshotResetXmin resetting MyPgXact->xmin to zero because the RegisteredSnapshots list is now empty. * RelationCacheInitializePhase3 now tries to read the pg_class row for pg_proc. This results in systable_beginscan calling GetCatalogSnapshot, which doesn't call GetSnapshotData this time, it just returns the snapshot it's already got. This is a fatal mistake, because there is nothing preventing data visible to that snapshot from being removed. So this has pretty much been broken since we put in MVCC snapshots for catalog searches. The problem would be masked when inside a transaction that has already got a transaction snapshot, but whenever we don't have one already, our catalog lookups are basically unprotected against premature row removal. I don't see any reason to think that this is specific to connection startup. The basic problem here, therefore, is that SnapshotResetXmin isn't aware that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in its hip pocket. That has to change. We could either treat the saved CatalogSnapshot as always being registered, or we could add some logic to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin or advance it past that snapshot's xmin. Neither of those is very nice from a performance standpoint. With the first option we risk delaying global cleanup by holding back MyPgXact->xmin to protect a CatalogSnapshot we might not actually use anytime soon. With the second option we will likely end up doing many more GetSnapshotData calls than we do now, because in a transaction that hasn't yet set a regular snapshot, we will need to get a new CatalogSnapshot for every catalog access (since we'll go back to MyPgXact->xmin = 0 after every access). And the parser, in particular, tends to do a lot of catalog accesses before we ever get a regular transaction snapshot. Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered but automatically kill it "every so often". I'm not sure how to drive that though. Thoughts? regards, tom lane
On Fri, Nov 11, 2016 at 3:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So this has pretty much been broken since we put in MVCC snapshots for > catalog searches. The problem would be masked when inside a transaction > that has already got a transaction snapshot, but whenever we don't have > one already, our catalog lookups are basically unprotected against > premature row removal. I don't see any reason to think that this is > specific to connection startup. Yeah. I was focused on the case where the backend is already in a transaction, and obviously didn't think hard enough about the scenario where that isn't the case. > The basic problem here, therefore, is that SnapshotResetXmin isn't aware > that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in > its hip pocket. That has to change. We could either treat the saved > CatalogSnapshot as always being registered, or we could add some logic > to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin > or advance it past that snapshot's xmin. > > Neither of those is very nice from a performance standpoint. With the > first option we risk delaying global cleanup by holding back > MyPgXact->xmin to protect a CatalogSnapshot we might not actually use > anytime soon. With the second option we will likely end up doing many > more GetSnapshotData calls than we do now, because in a transaction > that hasn't yet set a regular snapshot, we will need to get a new > CatalogSnapshot for every catalog access (since we'll go back to > MyPgXact->xmin = 0 after every access). And the parser, in particular, > tends to do a lot of catalog accesses before we ever get a regular > transaction snapshot. > > Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered > but automatically kill it "every so often". I'm not sure how to drive > that though. I think the really important thing is that we don't leave xmin set when the backend is idle. If we set MyProc->xmin during startup or in some other out-of-transaction activity and never updated it until we went idle, I think that wouldn't hurt much because we're not going to spend tons of time in that state anyway. But if we start leaving xmin set for idle backends, our users will be howling in agony. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Nov 11, 2016 at 3:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The basic problem here, therefore, is that SnapshotResetXmin isn't aware >> that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in >> its hip pocket. That has to change. We could either treat the saved >> CatalogSnapshot as always being registered, or we could add some logic >> to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin >> or advance it past that snapshot's xmin. >> >> Neither of those is very nice from a performance standpoint. With the >> first option we risk delaying global cleanup by holding back >> MyPgXact->xmin to protect a CatalogSnapshot we might not actually use >> anytime soon. With the second option we will likely end up doing many >> more GetSnapshotData calls than we do now, because in a transaction >> that hasn't yet set a regular snapshot, we will need to get a new >> CatalogSnapshot for every catalog access (since we'll go back to >> MyPgXact->xmin = 0 after every access). And the parser, in particular, >> tends to do a lot of catalog accesses before we ever get a regular >> transaction snapshot. >> >> Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered >> but automatically kill it "every so often". I'm not sure how to drive >> that though. > I think the really important thing is that we don't leave xmin set > when the backend is idle. Agreed. I did some stats-gathering on this over the weekend, seeing how often the various situations occur during the core regression tests. It's not as big a problem as I first thought, because we hold a snapshot while doing planning, but the case does arise for various utility commands (particularly LOCK TABLE) so it's not negligible. What is looking like the best answer to me is to treat CatalogSnapshot as registered, but forcibly unregister it before going idle. We don't need to unregister it more often than that, because we'd hold a transaction snapshot of some description throughout any long-running command, and the CatalogSnapshot would be no older than that anyway. "Before going idle" could be implemented with an InvalidateCatalogSnapshot call either in postgres.c's finish_xact_command(), or right in the main message-receipt loop. The latter seems like a wart, but it would cover cases like a client going to sleep between Bind and Execute phases of an extended query. On the other hand, I think we might be holding a transaction snapshot at that point anyway. It's at least arguable that we should be going through finish_xact_command() at any point where it's likely that we have a snapshot to release. Thoughts? regards, tom lane
On Mon, Nov 14, 2016 at 10:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think the really important thing is that we don't leave xmin set >> when the backend is idle. > > Agreed. I did some stats-gathering on this over the weekend, seeing how > often the various situations occur during the core regression tests. > It's not as big a problem as I first thought, because we hold a snapshot > while doing planning, but the case does arise for various utility commands > (particularly LOCK TABLE) so it's not negligible. > > What is looking like the best answer to me is to treat CatalogSnapshot > as registered, but forcibly unregister it before going idle. We don't > need to unregister it more often than that, because we'd hold a > transaction snapshot of some description throughout any long-running > command, and the CatalogSnapshot would be no older than that anyway. Makes sense. > "Before going idle" could be implemented with an InvalidateCatalogSnapshot > call either in postgres.c's finish_xact_command(), or right in the main > message-receipt loop. The latter seems like a wart, but it would cover > cases like a client going to sleep between Bind and Execute phases of an > extended query. On the other hand, I think we might be holding a > transaction snapshot at that point anyway. It's at least arguable that > we should be going through finish_xact_command() at any point where it's > likely that we have a snapshot to release. I assume you're going to back-patch this, and the consequences of failing to reset it before going idle could easily be vastly worse than the problem you're trying to fix. So I'd rather not make assumptions like "the client will probably never sleep between Bind and Execute". I bet that's actually false, and I certainly wouldn't want to bet the farm on it being true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I assume you're going to back-patch this, and the consequences of > failing to reset it before going idle could easily be vastly worse > than the problem you're trying to fix. So I'd rather not make > assumptions like "the client will probably never sleep between Bind > and Execute". I bet that's actually false, and I certainly wouldn't > want to bet the farm on it being true. No, I'm not at all proposing to assume that. But I may be willing to assume that "we don't hold a CatalogSnapshot between Bind and Execute unless we're also holding a transaction snapshot". I need to do a bit more research to see if that's actually true, though. regards, tom lane
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I assume you're going to back-patch this, and the consequences of >> failing to reset it before going idle could easily be vastly worse >> than the problem you're trying to fix. So I'd rather not make >> assumptions like "the client will probably never sleep between Bind >> and Execute". I bet that's actually false, and I certainly wouldn't >> want to bet the farm on it being true. > No, I'm not at all proposing to assume that. But I may be willing to > assume that "we don't hold a CatalogSnapshot between Bind and Execute > unless we're also holding a transaction snapshot". I need to do a bit > more research to see if that's actually true, though. Turns out it's not true. I still don't much like having the main loop in PostgresMain know about this hack, so I experimented with putting the InvalidateCatalogSnapshot() calls into places in postgres.c that were already dealing with transaction commit/abort or snapshot management. I ended up needing five such calls (as in the attached patch), which is just about equally ugly. So at this point I'm inclined to hold my nose and stick a call into step (1) of the main loop instead. Also, wherever we end up putting those calls, is it worth providing a variant invalidation function that only kills the catalog snapshot when it's the only one outstanding? (If it isn't, the transaction snapshot should be older, so there's no chance of advancing our xmin by killing it.) In principle this would save some catalog snapshot rebuilds for inside-a-transaction-block cases, but I'm not sure it's worth sweating that when we're doing client message exchange anyway. Lastly, I find myself disliking the separate CatalogSnapshotStale flag variable. The other special snapshots in snapmgr.c are managed by setting the pointer to NULL when it's not valid, so I wonder why CatalogSnapshot wasn't done that way. Since this patch is touching almost every use of that flag already, it wouldn't take much to switch it over. Comments? regards, tom lane diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 599874e..c9da8ed 100644 *** a/src/backend/tcop/postgres.c --- b/src/backend/tcop/postgres.c *************** exec_parse_message(const char *query_str *** 1352,1357 **** --- 1352,1358 ---- /* Done with the snapshot used for parsing */ if (snapshot_set) PopActiveSnapshot(); + InvalidateCatalogSnapshot(); } else { *************** exec_bind_message(StringInfo input_messa *** 1778,1783 **** --- 1779,1785 ---- /* Done with the snapshot used for parameter I/O and parsing/planning */ if (snapshot_set) PopActiveSnapshot(); + InvalidateCatalogSnapshot(); /* * And we're ready to start portal execution. *************** exec_execute_message(const char *portal_ *** 2000,2005 **** --- 2002,2012 ---- * those that start or end a transaction block. */ CommandCounterIncrement(); + + /* + * Flush catalog snapshot after every query, too. + */ + InvalidateCatalogSnapshot(); } /* Send appropriate CommandComplete to client */ *************** finish_xact_command(void) *** 2456,2461 **** --- 2463,2471 ---- CommitTransactionCommand(); + /* Flush catalog snapshot if any */ + InvalidateCatalogSnapshot(); + #ifdef MEMORY_CONTEXT_CHECKING /* Check all memory contexts that weren't freed during commit */ /* (those that were, were checked before being deleted) */ *************** PostgresMain(int argc, char *argv[], *** 3871,3876 **** --- 3881,3888 ---- */ AbortCurrentTransaction(); + InvalidateCatalogSnapshot(); + if (am_walsender) WalSndErrorCleanup(); diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 1ec9f70..842e135 100644 *** a/src/backend/utils/time/snapmgr.c --- b/src/backend/utils/time/snapmgr.c *************** GetTransactionSnapshot(void) *** 316,321 **** --- 316,327 ---- /* First call in transaction? */ if (!FirstSnapshotSet) { + /* + * Don't allow catalog snapshot to be older than xact snapshot. Must + * do this first to allow the empty-heap Assert to succeed. + */ + InvalidateCatalogSnapshot(); + Assert(pairingheap_is_empty(&RegisteredSnapshots)); Assert(FirstXactSnapshot == NULL); *************** GetTransactionSnapshot(void) *** 347,355 **** else CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); - /* Don't allow catalog snapshot to be older than xact snapshot. */ - CatalogSnapshotStale = true; - FirstSnapshotSet = true; return CurrentSnapshot; } --- 353,358 ---- *************** GetTransactionSnapshot(void) *** 358,364 **** return CurrentSnapshot; /* Don't allow catalog snapshot to be older than xact snapshot. */ ! CatalogSnapshotStale = true; CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); --- 361,367 ---- return CurrentSnapshot; /* Don't allow catalog snapshot to be older than xact snapshot. */ ! InvalidateCatalogSnapshot(); CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); *************** GetNonHistoricCatalogSnapshot(Oid relid) *** 465,471 **** */ if (!CatalogSnapshotStale && !RelationInvalidatesSnapshotsOnly(relid) && !RelationHasSysCache(relid)) ! CatalogSnapshotStale = true; if (CatalogSnapshotStale) { --- 468,474 ---- */ if (!CatalogSnapshotStale && !RelationInvalidatesSnapshotsOnly(relid) && !RelationHasSysCache(relid)) ! InvalidateCatalogSnapshot(); if (CatalogSnapshotStale) { *************** GetNonHistoricCatalogSnapshot(Oid relid) *** 473,478 **** --- 476,491 ---- CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData); /* + * Make sure the catalog snapshot will be accounted for in decisions + * about advancing PGXACT->xmin. We could apply RegisterSnapshot, but + * that would result in making a physical copy, which is overkill; and + * it would also create a dependency on some resource owner. Instead + * just shove the CatalogSnapshot into the pairing heap manually. + * This has to be reversed in InvalidateCatalogSnapshot, of course. + */ + pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node); + + /* * Mark new snapshost as valid. We must do this last, in case an * ERROR occurs inside GetSnapshotData(). */ *************** GetNonHistoricCatalogSnapshot(Oid relid) *** 492,498 **** void InvalidateCatalogSnapshot(void) { ! CatalogSnapshotStale = true; } /* --- 505,516 ---- void InvalidateCatalogSnapshot(void) { ! if (!CatalogSnapshotStale) ! { ! CatalogSnapshotStale = true; ! pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node); ! SnapshotResetXmin(); ! } } /* *************** SnapshotResetXmin(void) *** 930,935 **** --- 948,954 ---- if (pairingheap_is_empty(&RegisteredSnapshots)) { + Assert(CatalogSnapshotStale); MyPgXact->xmin = InvalidTransactionId; return; } *************** AtEOXact_Snapshot(bool isCommit) *** 1058,1063 **** --- 1077,1085 ---- exportedSnapshots = NIL; } + /* Drop catalog snapshot if any */ + InvalidateCatalogSnapshot(); + /* On commit, complain about leftover snapshots */ if (isCommit) {
On Mon, Nov 14, 2016 at 4:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> No, I'm not at all proposing to assume that. But I may be willing to >> assume that "we don't hold a CatalogSnapshot between Bind and Execute >> unless we're also holding a transaction snapshot". I need to do a bit >> more research to see if that's actually true, though. > > Turns out it's not true. > > I still don't much like having the main loop in PostgresMain know about > this hack, so I experimented with putting the InvalidateCatalogSnapshot() > calls into places in postgres.c that were already dealing with transaction > commit/abort or snapshot management. I ended up needing five such calls > (as in the attached patch), which is just about equally ugly. So at this > point I'm inclined to hold my nose and stick a call into step (1) of the > main loop instead. Seems like a good idea. > Also, wherever we end up putting those calls, is it worth providing a > variant invalidation function that only kills the catalog snapshot when > it's the only one outstanding? (If it isn't, the transaction snapshot > should be older, so there's no chance of advancing our xmin by killing > it.) In principle this would save some catalog snapshot rebuilds for > inside-a-transaction-block cases, but I'm not sure it's worth sweating > that when we're doing client message exchange anyway. I think that would be a fairly worthwhile thing to do. > Lastly, I find myself disliking the separate CatalogSnapshotStale flag > variable. The other special snapshots in snapmgr.c are managed by setting > the pointer to NULL when it's not valid, so I wonder why CatalogSnapshot > wasn't done that way. Since this patch is touching almost every use of > that flag already, it wouldn't take much to switch it over. I think I had some reason why I did it that way, but I don't think it was anything important, so I don't object to you revising it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Nov 14, 2016 at 4:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... I'm inclined to hold my nose and stick a call into step (1) of the >> main loop instead. > Seems like a good idea. >> Also, wherever we end up putting those calls, is it worth providing a >> variant invalidation function that only kills the catalog snapshot when >> it's the only one outstanding? (If it isn't, the transaction snapshot >> should be older, so there's no chance of advancing our xmin by killing >> it.) In principle this would save some catalog snapshot rebuilds for >> inside-a-transaction-block cases, but I'm not sure it's worth sweating >> that when we're doing client message exchange anyway. > I think that would be a fairly worthwhile thing to do. >> Lastly, I find myself disliking the separate CatalogSnapshotStale flag >> variable. The other special snapshots in snapmgr.c are managed by setting >> the pointer to NULL when it's not valid, so I wonder why CatalogSnapshot >> wasn't done that way. Since this patch is touching almost every use of >> that flag already, it wouldn't take much to switch it over. > I think I had some reason why I did it that way, but I don't think it > was anything important, so I don't object to you revising it. I've pushed a patch incorporating those changes. I can no longer reproduce piculet's failure on branches back to 9.4. Still need to look at the secondary issues I mentioned upthread. regards, tom lane