Thread: postmaster segfaults with HUGE table

postmaster segfaults with HUGE table

From

Joachim Wieland

Date:

14 November 2004, 09:33:30

Hi,

this query makes postmaster (beta4) die with signal 11:

(echo "CREATE TABLE footest(";for i in `seq 0 66000`; do    echo "col$i int NOT NULL,";done;
echo "PRIMARY KEY(col0));") | psql test


ERROR:  tables can have at most 1600 columns
LOG:  server process (PID 2140) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing

Program received signal SIGSEGV, Segmentation fault.
0x4015d43c in mallopt () from /lib/tls/libc.so.6
(gdb) bt
#0  0x4015d43c in mallopt () from /lib/tls/libc.so.6
#1  0x00021680 in ?? ()
[...]
#16 0x00000001 in ?? ()
#17 0x0821bc9b in AllocSetDelete ()
Previous frame inner to this frame (corrupt stack?)

Furthermore the backend doesn't react on query cancel requests from psql
during execution of the query.


Joachim

Re: postmaster segfaults with HUGE table

From

Neil Conway

Date:

14 November 2004, 10:05:19

Joachim Wieland wrote:
> this query makes postmaster (beta4) die with signal 11:
> 
> (echo "CREATE TABLE footest(";
>     for i in `seq 0 66000`; do
>         echo "col$i int NOT NULL,";
>     done;
> echo "PRIMARY KEY(col0));") | psql test
> 
> 
> ERROR:  tables can have at most 1600 columns
> LOG:  server process (PID 2140) was terminated by signal 11
> LOG:  terminating any other active server processes
> LOG:  all server processes terminated; reinitializing

At best you're going to get the error message above: "tables can have at 
most 1600 columns". But this is definitely a bug: we end up triggering 
this assertion:

TRAP: BadArgument("!(attributeNumber >= 1)", File: "tupdesc.c", Line: 405)

This specific assertion is triggered because we represent attribute 
numbers throughout the code base as a (signed) int16 -- the assertion 
failure has occurred because an int16 has wrapped around due to 
overflow. A fix would be to add a check to DefineRelation() (or even 
earlier) to reject CREATE TABLEs with more than "MaxHeapAttributeNumber" 
columns. We eventually do perform this check in 
heap_create_with_catalog(), but by that point it is too late: various 
functions have been invoked that assume we're dealing with a sane number 
of columns.

BTW, I noticed that there's an O(n^2) algorithm to check for duplicate 
column names in DefineRelation() -- with 60,000 column names that takes 
minutes to execute, but an inconsequential amount of time with 1500 
column names. So I don't think there's any point rewriting the algorithm 
to be faster -- provided we move the check for MaxHeapAttributeNumber 
previous to this loop, we should be fine.

Thanks for the report.

-Neil

Re: postmaster segfaults with HUGE table

From

Simon Riggs

Date:

14 November 2004, 11:27:26

On Sun, 2004-11-14 at 10:05, Neil Conway wrote:
> Joachim Wieland wrote:
> > this query makes postmaster (beta4) die with signal 11:
> > 
> > (echo "CREATE TABLE footest(";
> >     for i in `seq 0 66000`; do
> >         echo "col$i int NOT NULL,";
> >     done;
> > echo "PRIMARY KEY(col0));") | psql test
> > 
> > 
> > ERROR:  tables can have at most 1600 columns
> > LOG:  server process (PID 2140) was terminated by signal 11
> > LOG:  terminating any other active server processes
> > LOG:  all server processes terminated; reinitializing
> 
> At best you're going to get the error message above: "tables can have at 
> most 1600 columns". But this is definitely a bug: we end up triggering 
> this assertion:
> 
> TRAP: BadArgument("!(attributeNumber >= 1)", File: "tupdesc.c", Line: 405)
> 
> This specific assertion is triggered because we represent attribute 
> numbers throughout the code base as a (signed) int16 -- the assertion 
> failure has occurred because an int16 has wrapped around due to 
> overflow. A fix would be to add a check to DefineRelation() (or even 
> earlier) to reject CREATE TABLEs with more than "MaxHeapAttributeNumber" 
> columns. We eventually do perform this check in 
> heap_create_with_catalog(), but by that point it is too late: various 
> functions have been invoked that assume we're dealing with a sane number 
> of columns.
> 
> BTW, I noticed that there's an O(n^2) algorithm to check for duplicate 
> column names in DefineRelation() -- with 60,000 column names that takes 
> minutes to execute, but an inconsequential amount of time with 1500 
> column names. So I don't think there's any point rewriting the algorithm 
> to be faster -- provided we move the check for MaxHeapAttributeNumber 
> previous to this loop, we should be fine.
> 

This seems too obvious a problem to have caused a bug...presumably this
has been there for a while? Does this mean that we do not have
regression tests for each maximum setting ... i.e. are we missing a
whole class of tests in the regression tests?

-- 
Best Regards, Simon Riggs

Re: postmaster segfaults with HUGE table

From

Tom Lane

Date:

14 November 2004, 23:30:05

Neil Conway <neilc@samurai.com> writes:
> This specific assertion is triggered because we represent attribute 
> numbers throughout the code base as a (signed) int16 -- the assertion 
> failure has occurred because an int16 has wrapped around due to 
> overflow. A fix would be to add a check to DefineRelation() (or even 
> earlier) to reject CREATE TABLEs with more than "MaxHeapAttributeNumber" 
> columns.

Good analysis.  We can't check earlier than DefineRelation AFAICS,
because earlier stages don't know about inherited columns.

On reflection I suspect there are similar issues with SELECTs that have
more than 64K output columns.  This probably has to be guarded against
in parser/analyze.c.
        regards, tom lane

Re: postmaster segfaults with HUGE table

From

Neil Conway

Date:

16 November 2004, 00:55:51

On Sun, 2004-11-14 at 18:29 -0500, Tom Lane wrote:
> Good analysis.  We can't check earlier than DefineRelation AFAICS,
> because earlier stages don't know about inherited columns.
>
> On reflection I suspect there are similar issues with SELECTs that have
> more than 64K output columns.  This probably has to be guarded against
> in parser/analyze.c.

You're correct -- we also crash on extremely long SELECT statements.
Another variant of the problem would be a CREATE TABLE that inherits
from, say, 70 relations, each of which has 1,000 columns.

Attached is a patch. Not entirely sure that the checks I added are in
the right places, but at any rate this fixes the three identified
problems for me.

-Neil

Attachment

too-many-columns-1.patch

Re: postmaster segfaults with HUGE table

From

Neil Conway

Date:

16 November 2004, 00:58:55

On Sun, 2004-11-14 at 11:24 +0000, Simon Riggs wrote:
> This seems too obvious a problem to have caused a bug

Well, I'd imagine that we've checked CREATE TABLE et al. with
somewhat-too-large values (like 2000 columns), which wouldn't be
sufficiently large to trigger the problem.

> presumably this has been there for a while?

Not sure.

> Does this mean that we do not have
> regression tests for each maximum setting ... i.e. are we missing a
> whole class of tests in the regression tests?

I'm always in favour of more regression tests -- patches are welcome :)

That said, there are some minor logistical problems with testing that a
70,000 column CREATE TABLE doesn't fail (it would be nice not to have to
include all that text in the regression tests themselves).

-Neil

Re: postmaster segfaults with HUGE table

From

Tom Lane

Date:

16 November 2004, 01:54:12

Neil Conway <neilc@samurai.com> writes:
> Attached is a patch. Not entirely sure that the checks I added are in
> the right places, but at any rate this fixes the three identified
> problems for me.

I think the SELECT limit should be MaxTupleAttributeNumber not
MaxHeapAttributeNumber.  The point of the differential is to allow
you a bit of slop to do extra stuff (like sorting) when selecting
from a max-width table, but the proposed patch takes that away.

As for the placement issue, I'm OK with what you did in tablecmds.c,
but for the SELECT case I think that testing in transformTargetList
is probably not good.  It definitely doesn't do to test at the top
of the routine, because we haven't expanded '*' yet.  But testing at
the bottom probably won't do either since it doesn't cover later
addition of junk attributes --- in other words you could probably still
crash it by specifying >64K distinct ORDER BY values.

What I think needs to happen is to check p_next_resno at some point
after the complete tlist has been built.  Since that's an int, we
don't need to worry too much about it overflowing, so one test at the
end should do (though I suppose if you're really paranoid you could
instead add a test everywhere it's used/incremented).
        regards, tom lane

Re: postmaster segfaults with HUGE table

From

Tom Lane

Date:

16 November 2004, 02:09:11

Neil Conway <neilc@samurai.com> writes:
> On Sun, 2004-11-14 at 11:24 +0000, Simon Riggs wrote:
>> Does this mean that we do not have
>> regression tests for each maximum setting ... i.e. are we missing a
>> whole class of tests in the regression tests?

> That said, there are some minor logistical problems with testing that a
> 70,000 column CREATE TABLE doesn't fail (it would be nice not to have to
> include all that text in the regression tests themselves).

There are always limits; this just catches the next level up.  Are we
going to try to test whether the behavior is appropriate when running
out of memory to store the tlist?  (I think it's OK, but there's surely
no regression test for it.)  Are we going to test whether the behavior
is appropriate when the list length overflows an int32?  (I can pretty
much guarantee it isn't, but you'll need a 64-bit machine to reach this
level before you hit out-of-memory, and you'll need very large
quantities of patience as well as RAM.)  A regression test for the
latter case is silly on its face, though if Moore's law can hold up for
another couple decades it might someday not be.
        regards, tom lane

Re: postmaster segfaults with HUGE table

From

Neil Conway

Date:

16 November 2004, 04:02:10

On Mon, 2004-11-15 at 21:08 -0500, Tom Lane wrote:
> Are we going to try to test whether the behavior is appropriate when
> running out of memory to store the tlist?

We absolutely should: segfaulting on OOM is not acceptable behavior.
Testing that we recover safely when palloc() elogs (or _any_ routine
elogs) would be a good idea. I'd guess model checking would help here.

-Neil

Re: postmaster segfaults with HUGE table

From

Neil Conway

Date:

16 November 2004, 05:26:32

On Mon, 2004-11-15 at 20:53 -0500, Tom Lane wrote:
> I think the SELECT limit should be MaxTupleAttributeNumber not
> MaxHeapAttributeNumber.

Ah, true -- I forgot about the distinction...

> What I think needs to happen is to check p_next_resno at some point
> after the complete tlist has been built.

Attached is a revised patch -- I just did the check at the end of
transformStmt(), since otherwise we'll need to duplicate code in the
various places that resnos are used/incremented (set operation
statements, normal selects, updates, and so on). This is somewhat
fragile in that we usually assign p_next_resno to an AttrNumber and only
check for overflow at the end of the analysis phase, but it seems safe
for the moment...

BTW I figure this should be backpatched to REL7_4_STABLE. Barring any
objections I will do that (and apply to HEAD) this evening.

-Neil

Attachment

too-many-columns-2.patch

Re: postmaster segfaults with HUGE table

From

Tom Lane

Date:

16 November 2004, 20:08:01

Neil Conway <neilc@samurai.com> writes:
> Attached is a revised patch -- I just did the check at the end of
> transformStmt(),

Looks OK offhand.

> BTW I figure this should be backpatched to REL7_4_STABLE. Barring any
> objections I will do that (and apply to HEAD) this evening.

No objection here.
        regards, tom lane

Re: postmaster segfaults with HUGE table

From

Neil Conway

Date:

17 November 2004, 00:19:11

On Tue, 2004-11-16 at 16:25 +1100, Neil Conway wrote:
> Attached is a revised patch

Applied to HEAD, and backpatched to REL7_4_STABLE.

-Neil