Thread: Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
I wrote: > ... what this sounds like is a > problem with somebody fetching temporary-table blocks into shared memory > (where they should never be), and then things going wrong after the > owning backend drops the temp table (without having cleared out shared > buffers, which it won't do because it doesn't think it needs to). Can > you say what was the exact command(s) you were using with pgstattuple? A quick look at contrib/pgstattuple shows that it makes no effort whatsoever to avoid reading temp tables belonging to other sessions. So even if that wasn't Stuart's problem (and I'll bet it was), this is quite broken. There is no way that pgstattuple can compute valid stats for temp tables of other sessions; it doesn't have access to pages in the other sessions' temp buffers. It seems that the alternatives we have are to make it throw error, or to silently return zeroes (or perhaps nulls?). Neither one is tremendously appetizing. The former would be especially unhelpful if someone tried to write a query to apply pgstattuple across all pg_class entries, which I kinda suspect is what Stuart did. Opinions? regards, tom lane
Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
From
Stuart Bishop
Date:
On Tue, Mar 31, 2009 at 11:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > A quick look at contrib/pgstattuple shows that it makes no effort > whatsoever to avoid reading temp tables belonging to other sessions. > So even if that wasn't Stuart's problem (and I'll bet it was), this > is quite broken. > > There is no way that pgstattuple can compute valid stats for temp > tables of other sessions; it doesn't have access to pages in the other > sessions' temp buffers. It seems that the alternatives we have are > to make it throw error, or to silently return zeroes (or perhaps > nulls?). Neither one is tremendously appetizing. The former would > be especially unhelpful if someone tried to write a query to apply > pgstattuple across all pg_class entries, which I kinda suspect is > what Stuart did. This is exactly what happened, and temporary tables belonging to other sessions where fed to pgstattuple. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
From
Heikki Linnakangas
Date:
Tom Lane wrote: > A quick look at contrib/pgstattuple shows that it makes no effort > whatsoever to avoid reading temp tables belonging to other sessions. contrib/pageinspect has the same bug. Not surprising as it was largely inspired by pgstattuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
From
Heikki Linnakangas
Date:
Stuart Bishop wrote: > On Tue, Mar 31, 2009 at 11:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> A quick look at contrib/pgstattuple shows that it makes no effort >> whatsoever to avoid reading temp tables belonging to other sessions. >> So even if that wasn't Stuart's problem (and I'll bet it was), this >> is quite broken. >> >> There is no way that pgstattuple can compute valid stats for temp >> tables of other sessions; it doesn't have access to pages in the other >> sessions' temp buffers. It seems that the alternatives we have are >> to make it throw error, or to silently return zeroes (or perhaps >> nulls?). Neither one is tremendously appetizing. The former would >> be especially unhelpful if someone tried to write a query to apply >> pgstattuple across all pg_class entries, which I kinda suspect is >> what Stuart did. > > This is exactly what happened, and temporary tables belonging to other > sessions where fed to pgstattuple. +1 for throwing an error. That's what we do for views, composite types, and GIN indexes as well. If you want to write a query to call pgstattuple for all tables in pg_class, you'll need to exclude all those cases anyway. To exclude temp tables of other sessions, you'll need to add "AND pg_is_other_temp_schema(relnamespace)". I'm ok with returning NULLs as well, but returning zeroes doesn't feel right. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
From
Stuart Bishop
Date:
On Tue, Mar 31, 2009 at 2:20 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> This is exactly what happened, and temporary tables belonging to other >> sessions where fed to pgstattuple. > > +1 for throwing an error. That's what we do for views, composite types, and > GIN indexes as well. If you want to write a query to call pgstattuple for > all tables in pg_class, you'll need to exclude all those cases anyway. To > exclude temp tables of other sessions, you'll need to add "AND > pg_is_other_temp_schema(relnamespace)". I would have expected an exception to be raised personally. > I'm ok with returning NULLs as well, but returning zeroes doesn't feel > right. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> A quick look at contrib/pgstattuple shows that it makes no effort >> whatsoever to avoid reading temp tables belonging to other sessions. > contrib/pageinspect has the same bug. Not surprising as it was largely > inspired by pgstattuple. Given the seriousness of the consequences (forced database shutdown is no fun), I wonder whether we should install some low-level defense against this type of problem; ie teach ReadBuffer to throw error if asked to read a block from someone else's temp table. This isn't entirely trivial because it's presently expensive to determine whether a table is someone else's temp table: it takes a system catalog lookup. I'm not even sure that it'd be safe to have the relcache do it and cache the result --- it could lead to infinite recursion. (At the very least this would promote pg_namespace into the set of critical relcache entries.) The solution that seems most practical to me is to add a bool column to pg_class indicating "this is a temp table". Then, if that flag is set but it's not our own temp table (which we can tell easily), refuse to read. However, a patch of that size would take a little while to develop, and I'm not entirely sure it's worth the trouble. I can't remember having seen bugs of this type before. Comments? regards, tom lane
Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
From
Alvaro Herrera
Date:
Tom Lane wrote: > The solution that seems most practical to me is to add a bool column > to pg_class indicating "this is a temp table". Then, if that flag > is set but it's not our own temp table (which we can tell easily), > refuse to read. However, a patch of that size would take a little > while to develop, and I'm not entirely sure it's worth the trouble. > I can't remember having seen bugs of this type before. If we had had this defense in place, it would have been obvious that reindex and cluster were buggy. The code to skip temp tables was not there from the beginning. (We already have rel->rd_istemp, but it's not what we need here.) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I can't remember having seen bugs of this type before. > If we had had this defense in place, it would have been obvious that > reindex and cluster were buggy. The code to skip temp tables was not > there from the beginning. I thought my memory was probably failing me (excuse: no caffeine yet). > (We already have rel->rd_istemp, but it's not what we need here.) Yeah. I was considering converting that into a three-state flag, but it might be simpler to remove it altogether and look to the new pg_class field; only after we've gone down the path into localbuf.c would we check relnamespace == our temp namespace before permitting a read or write. Barring objections, I'll go make that happen. (And fix the contrib bugs too, but not till after ... I'll need a test case ;-)) regards, tom lane
Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> A quick look at contrib/pgstattuple shows that it makes no effort >>> whatsoever to avoid reading temp tables belonging to other sessions. > >> contrib/pageinspect has the same bug. Not surprising as it was largely >> inspired by pgstattuple. > > Given the seriousness of the consequences (forced database shutdown is > no fun), I wonder whether we should install some low-level defense > against this type of problem; ie teach ReadBuffer to throw error if > asked to read a block from someone else's temp table. That would be nice. > This isn't entirely trivial because it's presently expensive to > determine whether a table is someone else's temp table: it takes a > system catalog lookup. I'm not even sure that it'd be safe to have > the relcache do it and cache the result --- it could lead to infinite > recursion. (At the very least this would promote pg_namespace into > the set of critical relcache entries.) You could hard code that PG_CATALOG_NAMESPACE is not a temp namespace. I believe that would stop the recursion. Would that avoid promoting pg_namespace to critical status, too? > The solution that seems most practical to me is to add a bool column > to pg_class indicating "this is a temp table". Then, if that flag > is set but it's not our own temp table (which we can tell easily), > refuse to read. However, a patch of that size would take a little > while to develop, and I'm not entirely sure it's worth the trouble. > I can't remember having seen bugs of this type before. In addition to the one Alvaro mentioned, I recall having problems with this when working on the patch to allow temporary file access with two phase commit in autumn. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> This isn't entirely trivial because it's presently expensive to >> determine whether a table is someone else's temp table: it takes a >> system catalog lookup. I'm not even sure that it'd be safe to have >> the relcache do it and cache the result --- it could lead to infinite >> recursion. (At the very least this would promote pg_namespace into >> the set of critical relcache entries.) > You could hard code that PG_CATALOG_NAMESPACE is not a temp namespace. I > believe that would stop the recursion. True. > Would that avoid promoting > pg_namespace to critical status, too? Not sure. It'd still be something you have to access while loading most relcache entries. On balance the extra pg_class column seems like a more robust and useful solution. I've wished for a clean way to see temp-table-ness at the SQL level before. And we've already changed pg_class's rowtype for 8.4, so it shouldn't pose any additional hardship from an application compatibility standpoint. regards, tom lane
I wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> (We already have rel->rd_istemp, but it's not what we need here.) > Yeah. I was considering converting that into a three-state flag, but > it might be simpler to remove it altogether and look to the new pg_class > field; only after we've gone down the path into localbuf.c would we > check relnamespace == our temp namespace before permitting a read or write. I looked at the uses of this flag a bit further. It seems that rd_istemp is being used with three subtly different meanings: * do we need to WAL-log operations on this relation* do we need to fsync this relation during checkpoint* do we use sharedor local buffers for this relation I thought briefly about trying to make these shades of meaning explicit, eg by introducing macros like RelationUsesWal(rel). It doesn't seem quite worth the trouble though. There would be a lot of places to change, and if we wanted to take it completely seriously, we'd have to pass more than one argument to, for example, smgrtruncate() (when it passes isTemp to DropRelFileNodeBuffers, the bufmgr is wanting to know about the buffer classification; but mdtruncate wants to know about fsync behavior). For the buffer-classification meaning we want to consider local and nonlocal temp tables the same, so that control passes down to localbuf.c where we can put the error test. For the other two meanings it probably doesn't matter, since in principle we shouldn't get as far as making such decisions for a nonlocal temp table --- but if we did, we'd want to treat local and nonlocal temp tables alike, ie, no WAL or fsync. In short, there are only a *very* small number of places, codewise, where we want to distinguish local from nonlocal temp tables. The test to be added to localbuf.c is one, and RELATION_IS_LOCAL() is another (it must not succeed for nonlocals), and right now I don't see any others. So my inclination is to leave rd_istemp as a boolean field of RelationData so as to minimize textual code changes, but redefine it as a copy of pg_class.relistemp (so that it will now be true for both local and nonlocal temp tables). I'm also thinking of adding a bool field rd_islocaltemp to be true only for local temp tables. This isn't really necessary in terms of code cleanliness, but it will cache the result of isTempOrToastNamespace so that we don't have to repeat that call every time through localbuf.c. (Although it's not *that* expensive, so maybe this is overkill...) Thoughts, better ideas? regards, tom lane