Thread: postmaster segfaults with HUGE table
Hi, this query makes postmaster (beta4) die with signal 11: (echo "CREATE TABLE footest(";for i in `seq 0 66000`; do echo "col$i int NOT NULL,";done; echo "PRIMARY KEY(col0));") | psql test ERROR: tables can have at most 1600 columns LOG: server process (PID 2140) was terminated by signal 11 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing Program received signal SIGSEGV, Segmentation fault. 0x4015d43c in mallopt () from /lib/tls/libc.so.6 (gdb) bt #0 0x4015d43c in mallopt () from /lib/tls/libc.so.6 #1 0x00021680 in ?? () [...] #16 0x00000001 in ?? () #17 0x0821bc9b in AllocSetDelete () Previous frame inner to this frame (corrupt stack?) Furthermore the backend doesn't react on query cancel requests from psql during execution of the query. Joachim
Joachim Wieland wrote: > this query makes postmaster (beta4) die with signal 11: > > (echo "CREATE TABLE footest("; > for i in `seq 0 66000`; do > echo "col$i int NOT NULL,"; > done; > echo "PRIMARY KEY(col0));") | psql test > > > ERROR: tables can have at most 1600 columns > LOG: server process (PID 2140) was terminated by signal 11 > LOG: terminating any other active server processes > LOG: all server processes terminated; reinitializing At best you're going to get the error message above: "tables can have at most 1600 columns". But this is definitely a bug: we end up triggering this assertion: TRAP: BadArgument("!(attributeNumber >= 1)", File: "tupdesc.c", Line: 405) This specific assertion is triggered because we represent attribute numbers throughout the code base as a (signed) int16 -- the assertion failure has occurred because an int16 has wrapped around due to overflow. A fix would be to add a check to DefineRelation() (or even earlier) to reject CREATE TABLEs with more than "MaxHeapAttributeNumber" columns. We eventually do perform this check in heap_create_with_catalog(), but by that point it is too late: various functions have been invoked that assume we're dealing with a sane number of columns. BTW, I noticed that there's an O(n^2) algorithm to check for duplicate column names in DefineRelation() -- with 60,000 column names that takes minutes to execute, but an inconsequential amount of time with 1500 column names. So I don't think there's any point rewriting the algorithm to be faster -- provided we move the check for MaxHeapAttributeNumber previous to this loop, we should be fine. Thanks for the report. -Neil
On Sun, 2004-11-14 at 10:05, Neil Conway wrote: > Joachim Wieland wrote: > > this query makes postmaster (beta4) die with signal 11: > > > > (echo "CREATE TABLE footest("; > > for i in `seq 0 66000`; do > > echo "col$i int NOT NULL,"; > > done; > > echo "PRIMARY KEY(col0));") | psql test > > > > > > ERROR: tables can have at most 1600 columns > > LOG: server process (PID 2140) was terminated by signal 11 > > LOG: terminating any other active server processes > > LOG: all server processes terminated; reinitializing > > At best you're going to get the error message above: "tables can have at > most 1600 columns". But this is definitely a bug: we end up triggering > this assertion: > > TRAP: BadArgument("!(attributeNumber >= 1)", File: "tupdesc.c", Line: 405) > > This specific assertion is triggered because we represent attribute > numbers throughout the code base as a (signed) int16 -- the assertion > failure has occurred because an int16 has wrapped around due to > overflow. A fix would be to add a check to DefineRelation() (or even > earlier) to reject CREATE TABLEs with more than "MaxHeapAttributeNumber" > columns. We eventually do perform this check in > heap_create_with_catalog(), but by that point it is too late: various > functions have been invoked that assume we're dealing with a sane number > of columns. > > BTW, I noticed that there's an O(n^2) algorithm to check for duplicate > column names in DefineRelation() -- with 60,000 column names that takes > minutes to execute, but an inconsequential amount of time with 1500 > column names. So I don't think there's any point rewriting the algorithm > to be faster -- provided we move the check for MaxHeapAttributeNumber > previous to this loop, we should be fine. > This seems too obvious a problem to have caused a bug...presumably this has been there for a while? Does this mean that we do not have regression tests for each maximum setting ... i.e. are we missing a whole class of tests in the regression tests? -- Best Regards, Simon Riggs
Neil Conway <neilc@samurai.com> writes: > This specific assertion is triggered because we represent attribute > numbers throughout the code base as a (signed) int16 -- the assertion > failure has occurred because an int16 has wrapped around due to > overflow. A fix would be to add a check to DefineRelation() (or even > earlier) to reject CREATE TABLEs with more than "MaxHeapAttributeNumber" > columns. Good analysis. We can't check earlier than DefineRelation AFAICS, because earlier stages don't know about inherited columns. On reflection I suspect there are similar issues with SELECTs that have more than 64K output columns. This probably has to be guarded against in parser/analyze.c. regards, tom lane
On Sun, 2004-11-14 at 18:29 -0500, Tom Lane wrote: > Good analysis. We can't check earlier than DefineRelation AFAICS, > because earlier stages don't know about inherited columns. > > On reflection I suspect there are similar issues with SELECTs that have > more than 64K output columns. This probably has to be guarded against > in parser/analyze.c. You're correct -- we also crash on extremely long SELECT statements. Another variant of the problem would be a CREATE TABLE that inherits from, say, 70 relations, each of which has 1,000 columns. Attached is a patch. Not entirely sure that the checks I added are in the right places, but at any rate this fixes the three identified problems for me. -Neil
Attachment
On Sun, 2004-11-14 at 11:24 +0000, Simon Riggs wrote: > This seems too obvious a problem to have caused a bug Well, I'd imagine that we've checked CREATE TABLE et al. with somewhat-too-large values (like 2000 columns), which wouldn't be sufficiently large to trigger the problem. > presumably this has been there for a while? Not sure. > Does this mean that we do not have > regression tests for each maximum setting ... i.e. are we missing a > whole class of tests in the regression tests? I'm always in favour of more regression tests -- patches are welcome :) That said, there are some minor logistical problems with testing that a 70,000 column CREATE TABLE doesn't fail (it would be nice not to have to include all that text in the regression tests themselves). -Neil
Neil Conway <neilc@samurai.com> writes: > Attached is a patch. Not entirely sure that the checks I added are in > the right places, but at any rate this fixes the three identified > problems for me. I think the SELECT limit should be MaxTupleAttributeNumber not MaxHeapAttributeNumber. The point of the differential is to allow you a bit of slop to do extra stuff (like sorting) when selecting from a max-width table, but the proposed patch takes that away. As for the placement issue, I'm OK with what you did in tablecmds.c, but for the SELECT case I think that testing in transformTargetList is probably not good. It definitely doesn't do to test at the top of the routine, because we haven't expanded '*' yet. But testing at the bottom probably won't do either since it doesn't cover later addition of junk attributes --- in other words you could probably still crash it by specifying >64K distinct ORDER BY values. What I think needs to happen is to check p_next_resno at some point after the complete tlist has been built. Since that's an int, we don't need to worry too much about it overflowing, so one test at the end should do (though I suppose if you're really paranoid you could instead add a test everywhere it's used/incremented). regards, tom lane
Neil Conway <neilc@samurai.com> writes: > On Sun, 2004-11-14 at 11:24 +0000, Simon Riggs wrote: >> Does this mean that we do not have >> regression tests for each maximum setting ... i.e. are we missing a >> whole class of tests in the regression tests? > That said, there are some minor logistical problems with testing that a > 70,000 column CREATE TABLE doesn't fail (it would be nice not to have to > include all that text in the regression tests themselves). There are always limits; this just catches the next level up. Are we going to try to test whether the behavior is appropriate when running out of memory to store the tlist? (I think it's OK, but there's surely no regression test for it.) Are we going to test whether the behavior is appropriate when the list length overflows an int32? (I can pretty much guarantee it isn't, but you'll need a 64-bit machine to reach this level before you hit out-of-memory, and you'll need very large quantities of patience as well as RAM.) A regression test for the latter case is silly on its face, though if Moore's law can hold up for another couple decades it might someday not be. regards, tom lane
On Mon, 2004-11-15 at 21:08 -0500, Tom Lane wrote: > Are we going to try to test whether the behavior is appropriate when > running out of memory to store the tlist? We absolutely should: segfaulting on OOM is not acceptable behavior. Testing that we recover safely when palloc() elogs (or _any_ routine elogs) would be a good idea. I'd guess model checking would help here. -Neil
On Mon, 2004-11-15 at 20:53 -0500, Tom Lane wrote: > I think the SELECT limit should be MaxTupleAttributeNumber not > MaxHeapAttributeNumber. Ah, true -- I forgot about the distinction... > What I think needs to happen is to check p_next_resno at some point > after the complete tlist has been built. Attached is a revised patch -- I just did the check at the end of transformStmt(), since otherwise we'll need to duplicate code in the various places that resnos are used/incremented (set operation statements, normal selects, updates, and so on). This is somewhat fragile in that we usually assign p_next_resno to an AttrNumber and only check for overflow at the end of the analysis phase, but it seems safe for the moment... BTW I figure this should be backpatched to REL7_4_STABLE. Barring any objections I will do that (and apply to HEAD) this evening. -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > Attached is a revised patch -- I just did the check at the end of > transformStmt(), Looks OK offhand. > BTW I figure this should be backpatched to REL7_4_STABLE. Barring any > objections I will do that (and apply to HEAD) this evening. No objection here. regards, tom lane
On Tue, 2004-11-16 at 16:25 +1100, Neil Conway wrote: > Attached is a revised patch Applied to HEAD, and backpatched to REL7_4_STABLE. -Neil