Thread: pg_dump tries to do too much per query
I was experimenting today with pg_dump's reaction to missing dependencies, such as a rule that refers to a no-longer-existing table. It's pretty bad. For example: create table test (f1 int); create view v_test as select f1+1 as f11 from test; drop table test; then run pg_dump: getTables(): SELECT failed. Explanation from backend: 'ERROR: cache lookup of attribute 1 in relation 400384 failed '. This is just about entirely useless as an error message, wouldn't you say? The immediate cause of this behavior is the initial data fetch in getTables(): appendPQExpBuffer(query, "SELECT pg_class.oid, relname, relkind, relacl, usename, " "relchecks,reltriggers, relhasindex, pg_get_viewdef(relname) as viewdef " "from pg_class, pg_user " "where relowner = usesysid and relname !~ '^pg_' " "and relkind in ('%c', '%c', '%c')" "order by oid", RELKIND_RELATION, RELKIND_SEQUENCE, RELKIND_VIEW); res = PQexec(g_conn, query->data); if (!res || PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr,"getTables(): SELECT failed. Explanation from backend: '%s'.\n", PQerrorMessage(g_conn)); exit_nicely(g_conn); } This can be criticized on a couple of points: 1. It invokes pg_get_viewdef() on every table and sequence, which is a big waste of time even when it doesn't fail outright. When it does fail outright, as above, you have no way to identify which view it failed for. pg_get_viewdef() should be invoked retail, for one view at a time, and only for things you have determined are indeed views. 2. As somebody pointed out a few days ago, pg_dump silently loses tables whose owners can't be identified. The cause is the inner join being done here against pg_user --- pg_dump will never even notice that a table exists if there's not a matching pg_user row for it. This is not robust. You should be able to fix the latter problem by doing an outer join, though it doesn't quite work yet in current sources. pg_get_userbyid() offers a different solution, although it won't return NULL for unknown IDs, which might be an easier failure case to check for. More generally I think there are comparable problems elsewhere in pg_dump, caused by trying to do too much per query and not thinking about what will happen if there's a failure. It looks like the join- against-pg_user problem exists for all object types, not just tables. It'd be worth examining all the queries closely with an eye to failure modes and whether you can give a usefully specific error message when something is wrong. regards, tom lane
At 16:29 17/09/00 -0400, Tom Lane wrote: > >getTables(): SELECT failed. Explanation from backend: 'ERROR: cache lookup of attribute 1 in relation 400384 failed >'. > >This is just about entirely useless as an error message, wouldn't you >say? I agree but the is representative of the error handling throughout pg_dump (eg. the notorious 'you are hosed' error message). Over time I will try to clean it up where possible. There are a number of different kinds of errors to deal with; ones resulting a corrupt database seem to be low on the list. In fact I would argue that 'DROP TABLE' should not work on a view relation. Secondly, your comments probably highlight the need for a database verification utility. That said, I will address your posints below. > >1. It invokes pg_get_viewdef() on every table and sequence, which is a >big waste of time even when it doesn't fail outright. Either pg_get_viewdef is a lot less efficient that I expected, or this is an exageration. If it helps, I can replace it with a case statement: case when relkind='v' then pg_get_viewdef() else '' end but this seems a little pointless and won't prevent errors when the db is corrupt. Being forced to break up SQL statements because the backend produces unclear errors from a function seems to be a case of the tail wagging the dog: perhaps pg_get_viewdef should at least identify itself as the source of the error, if that is what is happening. > When it does fail >outright, as above, you have no way to identify which view it failed >for. Good point. This is going to affect anybody who calls get_viewdef. Maybe it can be modified to indicate (a) that the error occurred in get_viewdef, and (b) which view is corrupt. Try: select * from pg_views; Same error. >pg_get_viewdef() should be invoked retail, for one view at a time, >and only for things you have determined are indeed views. Do you truly, ruly believe the first part? >2. As somebody pointed out a few days ago, pg_dump silently loses tables >whose owners can't be identified. The cause is the inner join being >done here against pg_user --- pg_dump will never even notice that a >table exists if there's not a matching pg_user row for it. This is not >robust. > >You should be able to fix the latter problem by doing an outer join, >though it doesn't quite work yet in current sources. pg_get_userbyid() >offers a different solution, although it won't return NULL for unknown >IDs, which might be an easier failure case to check for. This sounds sensible; and I think you are right - pg_dump crosses with user info relations all the time. I'll look at using pg_get_userbyid, LOJ and/or column selects now that they are available. Based on this suggestion, maybe pg_get_viewdef should return NULL if the view table does not exist. But I would still prefer a meaningful error message, since it really does reflect DB corruption. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 12:48 18/09/00 +1000, Philip Warner wrote: >> >>You should be able to fix the latter problem by doing an outer join, >>though it doesn't quite work yet in current sources. pg_get_userbyid() >>offers a different solution, although it won't return NULL for unknown >>IDs, which might be an easier failure case to check for. > >This sounds sensible; and I think you are right - pg_dump crosses with user >info relations all the time. I'll look at using pg_get_userbyid, LOJ and/or >column selects now that they are available. > I've just made these changes, and will commit them once I put in some warnings about NULL usernames - my guess is that this should not stop pg_dump, just warn the user. It's a real pity about pg_get_userbyid - the output for non-existant users is pretty near useless. I presume it's there for convenience of a specific piece of code. Would you see any value in creating a pg_get_usernamebyid() or similar that returns NULL when there is no match? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 16:29 17/09/00 -0400, Tom Lane wrote: >As somebody pointed out a few days ago, pg_dump silently loses tables >whose owners can't be identified. This is now fixed in CVS. The owner of all objects are now retrieved by using column select expressions. If you can recall where it was, I'd be interested in seeing the report since I have no recollection of it, and I try to keep abreast of pg_dump issues... ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 16:29 17/09/00 -0400, Tom Lane wrote: > >create table test (f1 int); >create view v_test as select f1+1 as f11 from test; >drop table test; > >then run pg_dump: > >getTables(): SELECT failed. Explanation from backend: 'ERROR: cache lookup of attribute 1 in relation 400384 failed >'. > FWIW, the patch below causes get_viewdef to produce: ERROR: pg_get_viewdef: cache lookup of attribute 1 in relation 19136 failed for rule v_test when a table has been deleted. ---------------------------------------------------- --- ruleutils.c.orig Wed Sep 13 22:08:04 2000 +++ ruleutils.c Mon Sep 18 20:59:25 2000 @@ -72,6 +72,7 @@ * ---------- */static char *rulename = NULL; +static char *toproutinename = NULL;static void *plan_getrule = NULL;static char *query_getrule = "SELECT * FROM pg_rewriteWHERE rulename = $1";static void *plan_getview = NULL; @@ -134,6 +135,12 @@ int len; /* ---------- + * We use this in reporting errors. + * ---------- + */ + toproutinename = "pg_get_ruledef"; + + /* ---------- * We need the rules name somewhere deep down: rulename is global * ---------- */ @@ -234,6 +241,12 @@ char *name; /* ---------- + * We use this in reporting errors. + * ---------- + */ + toproutinename = "pg_get_viewdef"; + + /* ---------- * We need the view name somewhere deep down * ---------- */ @@ -337,6 +350,13 @@ char *sep; /* ---------- + * We use this in reporting errors. + * ---------- + */ + toproutinename = "pg_get_indexdef"; + rulename = NULL; + + /* ---------- * Connect to SPI manager * ---------- */ @@ -554,6 +574,13 @@ Form_pg_shadow user_rec; /* ---------- + * We use this in reporting errors. + * ---------- + */ + toproutinename = "pg_get_userbyid"; + rulename = NULL; + + /* ---------- * Allocate space for the result * ---------- */ @@ -2014,8 +2041,16 @@ ObjectIdGetDatum(relid), (Datum) attnum, 0, 0); if (!HeapTupleIsValid(atttup)) - elog(ERROR, "cache lookup of attribute %d in relation %u failed", - attnum, relid); + { + if (rulename != NULL) + { + elog(ERROR, "%s: cache lookup of attribute %d in relation %u failed for rule %s", + toproutinename, attnum, relid, rulename); + } else { + elog(ERROR, "%s: cache lookup of attribute %d in relation %u failed", + toproutinename, attnum, relid); + } + } attStruct = (Form_pg_attribute) GETSTRUCT(atttup); return pstrdup(NameStr(attStruct->attname)); ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > FWIW, the patch below causes get_viewdef to produce: > ERROR: pg_get_viewdef: cache lookup of attribute 1 in relation 19136 > failed for rule v_test > when a table has been deleted. Not much of a solution --- or do you propose to go through and hack up every elog in every routine that could potentially be called during pg_get_ruledef? The reason that changing pg_dump is a superior solution for this problem is that there's only one place to change, not umpteen dozen ... regards, tom lane
At 11:37 18/09/00 -0400, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >> FWIW, the patch below causes get_viewdef to produce: >> ERROR: pg_get_viewdef: cache lookup of attribute 1 in relation 19136 >> failed for rule v_test >> when a table has been deleted. > >Not much of a solution --- or do you propose to go through and hack up >every elog in every routine that could potentially be called during >pg_get_ruledef? Well, I had assumed that most routines/subsystems would identify themselves as sources of errors, and that the behaviour of pg_get_viewdef was unusual. Even the existing code for get_viewdef tries to output it's name in most places, just not all. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 11:37 18/09/00 -0400, Tom Lane wrote: > >The reason that changing pg_dump is a superior solution for this problem >is that there's only one place to change, not umpteen dozen ... > Well at least two, unless you like the following: zzz=# select * from pg_views; ERROR: cache lookup of attribute 1 in relation 3450464 failed ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 10:42 19/09/00 +1000, Philip Warner wrote: >At 11:37 18/09/00 -0400, Tom Lane wrote: >> >>The reason that changing pg_dump is a superior solution for this problem >>is that there's only one place to change, not umpteen dozen ... >> > >Well at least two, unless you like the following: > > zzz=# select * from pg_views; > ERROR: cache lookup of attribute 1 in relation 3450464 failed > Apologies - I just noticed you fixed this in CVS, so it now manages (somehow!) to output a valid view definition even without the underlying table. A little scary, though. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Doing the following: create table test (f1 int); create view v_test as select f1+1 as f11 from test; drop table test; then selecting from the view results in: ERROR: Relation 'test' does not exist which is fine. Then I try: create table test (f1 int); select * from v_test; and I get: ERROR: has_subclass: Relation 19417 not found which not very helpful for the person who does not know the history, and leads me to believe that there may be a few issues here. Should a 'DROP TABLE' drop the views, fail, or be recoverable from by recreating the table? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > create table test (f1 int); > create view v_test as select f1+1 as f11 from test; > drop table test; > create table test (f1 int); > select * from v_test; > ERROR: has_subclass: Relation 19417 not found > which not very helpful for the person who does not know the history, and > leads me to believe that there may be a few issues here. Yes, this mistake needs to be detected earlier. The stored view contains both the name and the OID of the referenced table. It should *not* accept a new table with same name and different OID, since there's no guarantee that the new table has anything like the same column set. (ALTER TABLE has some issues here too...) > Should a 'DROP TABLE' drop the views, fail, or be recoverable from by > recreating the table? Yes ;-). Any of those behaviors would be better than what we have now. However, none of them is going to be easy to implement. There will need to be more info stored about views than there is now. regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes: >>> The reason that changing pg_dump is a superior solution for this problem >>> is that there's only one place to change, not umpteen dozen ... >> >> Well at least two, unless you like the following: >> >> zzz=# select * from pg_views; >> ERROR: cache lookup of attribute 1 in relation 3450464 failed > Apologies - I just noticed you fixed this in CVS, so it now manages > (somehow!) to output a valid view definition even without the underlying > table. A little scary, though. Say what? (... tries it ...) Fascinating. I wouldn't rely on this behavior however; the fact that it works today is a totally unintended consequence of a change I made for column alias support. Next week ruleutils.c might try to access the underlying tables again. The general issue still remains: if a database contains an inconsistency or error, introduced by whatever means (and there'll always be bugs), a pg_dump failure is likely to be the first notice a dbadmin has about it. So it behooves us to make sure that pg_dump issues error messages that are as specific as possible. In particular, if there is a specific object such as a view or rule that's broken, pg_dump should take care that it can finger that particular object, not have to report a generic "SELECT failed" error message. This problem has been around for a long time, of course, but now that we have someone who's taking an active interest in fixing pg_dump ;-) I'm hoping something will get done about it... regards, tom lane
On Tue, 19 Sep 2000, Tom Lane wrote: > Yes, this mistake needs to be detected earlier. The stored view > contains both the name and the OID of the referenced table. It should > *not* accept a new table with same name and different OID, since there's > no guarantee that the new table has anything like the same column set. > (ALTER TABLE has some issues here too...) > > > Should a 'DROP TABLE' drop the views, fail, or be recoverable from by > > recreating the table? > > Yes ;-). > > Any of those behaviors would be better than what we have now. However, > none of them is going to be easy to implement. There will need to be > more info stored about views than there is now. This is an example of a place where the dependencies chart will come in handy. :) I do actually hope to get to it (if noone else does it) after my work job has their official release and I get a chance to take time off and after I've figured out match partial for the referential integrity stuff (which is more of a pain than I could have ever imagined).
...... > The general issue still remains: if a database contains an inconsistency > or error, introduced by whatever means (and there'll always be bugs), > a pg_dump failure is likely to be the first notice a dbadmin has about it. > So it behooves us to make sure that pg_dump issues error messages that > are as specific as possible. In particular, if there is a specific > object such as a view or rule that's broken, pg_dump should take care > that it can finger that particular object, not have to report a generic > "SELECT failed" error message. > > This problem has been around for a long time, of course, but now that > we have someone who's taking an active interest in fixing pg_dump ;-) > I'm hoping something will get done about it... > > regards, tom lane`-----Original Message-----From: Tom Lane [SMTP:tgl@sss.pgh.pa.us]Sent: Tuesday, September19, 2000 11:13 AMTo: Philip WarnerCc: pgsql-hackers@postgresql.orgSubject: Re: [HACKERS] Re: pg_dump triesto do too much per query I can't agree with this more... On several occasions I have had databases that apparently were working fine with no problems, and the only indication of the problem was that pg_dump would fail. while this is nice in that I now know there is a problem, I had very small info on where to start looking.
At 12:13 19/09/00 -0400, Tom Lane wrote: > >Say what? (... tries it ...) Fascinating. I wouldn't rely on this >behavior however; the fact that it works today is a totally unintended >consequence of a change I made for column alias support. Next week >ruleutils.c might try to access the underlying tables again. In this particular instance, I'd still be much happier to see the code that fails identify itself properly, identify the cause, and explain the error. In the case in point, all of this information is available. I always prefer to fix two bugs with one code change where possible (ie. 'select * from pg_views' as well as pg_dump). To carry this to extremes, at one point pg_dump crosses pg_index, pg_class and pg_am somewhere else it crosses pg_rewrite, pg_class, and pg_rules etc. If we want to allow for database corruption, pg_dump should really be doing this sort of cross internally to verify that each part is present (or use LOJ, and test for NULLs - but you hinted that there was a problem with this). >This problem has been around for a long time, of course, but now that >we have someone who's taking an active interest in fixing pg_dump ;-) >I'm hoping something will get done about it... Part of my interest in pg_dump is in trying to make it know less about the database structure, and so make it a lot more independant of versions. If I could use one pg_* view or function for each entity type, I'd be very happy. The not-so-recent discussions about definition schemas should indicate where pg_dump wants to go, and the growing use of functions like 'format_type' make it increasingly hard to know why an error occurred - I haven't looked to see what format_type does when it encounters an internal inconsistency, but I hope it returns NULL. [Thought: What would be the impact of pg_get_ruledef etc returning NULL when it can't find the relevant data? This sort of approach seems to me more consistent with DB-ish things, and pg_dump could easily test for a NULL defn of a view relation] I'd prefer to see a 'pg_verify' (or similar). The idea being that it would know about and be totally anal about interrelationships - and successfully report a missing view table - unlike the current (or proposed) situation. In some ways, like a vacuum, but on metadata. I'm not toally opposed to your suggestion, and I certainly agree that making pg_dump more carefull about the data it retrieves is a good idea. But I still need more convincing about the calls to get_viewdef. In the current situation, I think ruleutils.c might need to be looked at more closely: eg. several error messages prepend the main routine name, some do not. Some display the problem rule name, other do not. There are only three or four external routines defined, so by using (more) globals it is easily possible for all routines to indicate the primary routine responsible. Similarly, the rule/view name is already a global, and is available to all routines in ruleutils.c. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > It's a real pity about pg_get_userbyid - the output for non-existant users > is pretty near useless. I presume it's there for convenience of a specific > piece of code. No, I imagine it's just that way because it seemed like a good idea at the time. Returning NULL for an unknown userid seems at least as plausible as what it does now. Comments anyone? regards, tom lane
Philip Warner writes: > Doing the following: > > create table test (f1 int); > create view v_test as select f1+1 as f11 from test; > drop table test; > > then selecting from the view results in: > > ERROR: Relation 'test' does not exist > > which is fine. If you peak into the standard, all DROP commands have a trailing RESTRICT/CASCADE (mandatory, btw.), which will tell what to do. But it's extremely hard to implement and keep up to date. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/