Thread: BUG #5145: Complex query with lots of LEFT JOIN causes segfault
The following bug has been logged online: Bug reference: 5145 Logged by: Bernt Marius Johnsen Email address: bernt.johnsen@sun.com PostgreSQL version: 8.3.8 Operating system: Linux 2.6.27-14 (Ubuntu Interpid) Description: Complex query with lots of LEFT JOIN causes segfault Details: The below query generated by the Random Query Generator (https://launchpad.net/randgen) causes a segfault. It was caused running ./gentest.pl --dsn=dbi:Pg:user=bernt --gendata --queries=100000 --threads=1 --grammar=/home/bernt/xx.yy (I'll follow up with a mail with xx.yy as attachment) Query: SELECT * from B AS alias0 LEFT JOIN BB AS alias1 LEFT JOIN B AS alias2 LEFT JOIN A AS alias3 LEFT JOIN AA AS alias4 LEFT JOIN B AS alias5 ON alias4.int_key = alias5.int_key ON alias3.int_key = alias4.int_key LEFT JOIN AA AS alias6 LEFT JOIN A AS alias7 ON alias6.int_key = alias7.int_key LEFT JOIN BB AS alias8 ON alias7.int_key = alias8.int_key ON alias3.int_key = alias8.int_key LEFT JOIN AA AS alias9 ON alias6.int_key = alias9.int_key ON alias2.int_key = alias8.int_key LEFT JOIN BB AS alias10 LEFT JOIN AA AS alias11 LEFT JOIN B AS alias12 ON alias11.int_key = alias12.int_key ON alias10.int_key = alias11.int_key ON alias9.int_key = alias10.int_key ON alias1.int_key = alias8.int_key LEFT JOIN BB AS alias13 LEFT JOIN A AS alias14 LEFT JOIN AA AS alias15 LEFT JOIN A AS alias16 ON alias15.int_key = alias16.int_key LEFT JOIN B AS alias17 ON alias15.int_key = alias17.int_key ON alias14.int_key = alias16.int_key LEFT JOIN AA AS alias18 ON alias14.int_key = alias18.int_key LEFT JOIN B AS alias19 ON alias15.int_key = alias19.int_key LEFT JOIN AA AS alias20 ON alias16.int_key = alias20.int_key ON alias13.int_key = alias19.int_key LEFT JOIN A AS alias21 ON alias13.int_key = alias21.int_key ON alias3.int_key = alias17.int_key LEFT JOIN B AS alias22 ON alias7.int_key = alias22.int_key LEFT JOIN A AS alias23 ON alias20.int_key = alias23.int_key LEFT JOIN A AS alias24 ON alias14.int_key = alias24.int_key LEFT JOIN BB AS alias25 LEFT JOIN BB AS alias26 ON alias25.int_key = alias26.int_key LEFT JOIN A AS alias27 LEFT JOIN A AS alias28 ON alias27.int_key = alias28.int_key LEFT JOIN B AS alias29 LEFT JOIN BB AS alias30 LEFT JOIN B AS alias31 LEFT JOIN A AS alias32 LEFT JOIN B AS alias33 ON alias32.int_key = alias33.int_key LEFT JOIN A AS alias34 ON alias32.int_key = alias34.int_key ON alias31.int_key = alias33.int_key ON alias30.int_key = alias33.int_key ON alias29.int_key = alias34.int_key ON alias27.int_key = alias34.int_key LEFT JOIN AA AS alias35 LEFT JOIN A AS alias36 ON alias35.int_key = alias36.int_key ON alias34.int_key = alias36.int_key LEFT JOIN A AS alias37 ON alias33.int_key = alias37.int_key ON alias25.int_key = alias32.int_key LEFT JOIN A AS alias38 ON alias37.int_key = alias38.int_key ON alias15.int_key = alias37.int_key ON alias0.int_key = alias9.int_key
Re: BUG #5145: Complex query with lots of LEFT JOIN causes segfault
From
Euler Taveira de Oliveira
Date:
Bernt Marius Johnsen escreveu: > The below query generated by the Random Query Generator > (https://launchpad.net/randgen) causes a segfault. It was caused running > Could you get a core dump and post the gdb backtrace? $ ulimit -c unlimited $ pg_ctl start $ psql -c "<my query goes here>" mydb $ gdb /path/to/postgres $PGDATA/core (gdb) bt . . . (gdb) quit -- Euler Taveira de Oliveira http://www.timbira.com/
>>>>>>>>>>>> Euler Taveira de Oliveira wrote (2009-10-29 12:17:36): > Bernt Marius Johnsen escreveu: > > The below query generated by the Random Query Generator > > (https://launchpad.net/randgen) causes a segfault. It was caused running > > Checkout the latest RQG from launchpad and run ./gentest.pl as shown above (The lastest tarball misses a feature you need). xx.yy is attached. Run like this: ./gentest.pl --dsn=dbi:Pg:user=xxxx --gendata --queries=100000 --threads=1 --grammar=/path/to/xx.yy > Could you get a core dump and post the gdb backtrace? > > $ ulimit -c unlimited > $ pg_ctl start > $ psql -c "<my query goes here>" mydb > $ gdb /path/to/postgres $PGDATA/core > (gdb) bt > . > . > . > (gdb) quit We'll see next week If I can spare some time. > > > -- > Euler Taveira de Oliveira > http://www.timbira.com/ -- Bernt Marius Johnsen, Staff Engineer Database Technology Group, Sun Microsystems, Trondheim, Norway
Attachment
Bernt M. Johnsen wrote: >>>>>>>>>>>>> Euler Taveira de Oliveira wrote (2009-10-29 12:17:36): >> Bernt Marius Johnsen escreveu: >>> The below query generated by the Random Query Generator >>> (https://launchpad.net/randgen) causes a segfault. It was caused running >>> > > Checkout the latest RQG from launchpad and run ./gentest.pl as shown > above (The lastest tarball misses a feature you need). xx.yy is > attached. > > Run like this: > ./gentest.pl --dsn=dbi:Pg:user=xxxx --gendata --queries=100000 --threads=1 --grammar=/path/to/xx.yy > > > >> Could you get a core dump and post the gdb backtrace? >> >> $ ulimit -c unlimited >> $ pg_ctl start >> $ psql -c "<my query goes here>" mydb >> $ gdb /path/to/postgres $PGDATA/core >> (gdb) bt >> . >> . >> . >> (gdb) quit > > We'll see next week If I can spare some time. I can easily reproduce the segfault on 8.4 and 8.5a2: Program received signal SIGSEGV, Segmentation fault. ExecHashJoinSaveTuple (tuple=0xb49c8870, hashvalue=3316173823, fileptr=0x96185a8) at nodeHashjoin.c:775 775 BufFile *file = *fileptr; (gdb) bt #0 ExecHashJoinSaveTuple (tuple=0xb49c8870, hashvalue=3316173823, fileptr=0x96185a8) at nodeHashjoin.c:775 #1 0x081cf21f in ExecHashJoin (node=0x88c6540) at nodeHashjoin.c:224 #2 0x081bd898 in ExecProcNode (node=0x88c6540) at execProcnode.c:427 #3 0x081bc445 in standard_ExecutorRun (queryDesc=0x875d22c, direction=ForwardScanDirection, count=0) at execMain.c:1187 #4 0x0828215c in PortalRunSelect (portal=0x879197c, forward=1 '\001', count=0, dest=0xb4f7bfb8) at pquery.c:953 #5 0x082834be in PortalRun (portal=0x879197c, count=2147483647, isTopLevel=1 '\001', dest=0xb4f7bfb8, altdest=0xb4f7bfb8, completionTag=0xbfe8ff9a "") at pquery.c:807 #6 0x0827f760 in exec_simple_query ( query_string=0x8751d3c " SELECT * from B AS alias0 LEFT JOIN BB AS alias1 LEFT JOIN B AS alias2 LEFT JOIN A AS alias3 LEFT JOIN AA AS alias4 LEFT JOIN B AS alias5 ON alias4.int_key = alias5.int_key O"...) at postgres.c:1000 #7 0x08280b8e in PostgresMain (argc=2, argv=0x86d7960, username=0x86d7928 "mastermind") at postgres.c:3573 #8 0x082499be in ServerLoop () at postmaster.c:3366 #9 0x0824a9db in PostmasterMain (argc=3, argv=0x86d5a98) at postmaster.c:1064 #10 0x081ea466 in main (argc=3, argv=0x86d5a98) at main.c:188 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > I can easily reproduce the segfault on 8.4 and 8.5a2: Doesn't crash here ... could we see the specific test data being used, please? regards, tom lane
On Thu, Oct 29, 2009 at 2:14 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > ExecHashJoinSaveTuple (tuple=3D0xb49c8870, hashvalue=3D3316173823, > fileptr=3D0x96185a8) at nodeHashjoin.c:775 > 775 =A0 =A0 =A0 =A0 =A0 =A0 BufFile =A0 =A0*file =3D *fileptr; > That back trace doesn't make much sense. That function only has one call site and it's called with fileptr set to an expression starting with &. Ie, there's no way it can get null or an invalid pointer. --=20 greg
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> I can easily reproduce the segfault on 8.4 and 8.5a2: > > Doesn't crash here ... could we see the specific test data being used, > please? uploaded a dump of the dataset here: http://www.kaltenbrunner.cc/files/rand_gen_data.sql and the query that causes the segfault: http://www.kaltenbrunner.cc/files/rand_gen_query.sql Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > uploaded a dump of the dataset here: > http://www.kaltenbrunner.cc/files/rand_gen_data.sql > and the query that causes the segfault: > http://www.kaltenbrunner.cc/files/rand_gen_query.sql [ scratches head... ] Still no crash here, and I tried it on a couple different types of hardware. What configure parameters are you using, and what non-default postgresql.conf settings? regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> uploaded a dump of the dataset here: >> http://www.kaltenbrunner.cc/files/rand_gen_data.sql >> and the query that causes the segfault: >> http://www.kaltenbrunner.cc/files/rand_gen_query.sql > > [ scratches head... ] Still no crash here, and I tried it on a couple > different types of hardware. What configure parameters are you using, > and what non-default postgresql.conf settings? this is 8.5a2 configured with ./configure --enable-cassert --enable-debug and just default settings(ie plain initdb with default settings). The OS is Debian Lenny/AMD64. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > this is 8.5a2 configured with ./configure --enable-cassert > --enable-debug and just default settings(ie plain initdb with default > settings). > The OS is Debian Lenny/AMD64. Huh. That should not be noticeably different from my F11/Xeon64 machine ... but I still can't make it crash. Somebody else is gonna have to debug this one. regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> this is 8.5a2 configured with ./configure --enable-cassert >> --enable-debug and just default settings(ie plain initdb with default >> settings). >> The OS is Debian Lenny/AMD64. > > Huh. That should not be noticeably different from my F11/Xeon64 > machine ... but I still can't make it crash. Somebody else is gonna > have to debug this one. hmm sorry - that was actually Lenny/i386 however I fail to reproduce this on some of my other test boxes as well though I don't have an idea why it crashes on my laptop(and for the original reporter) but not elsewhere :( Stefan
Stefan Kaltenbrunner wrote: > Tom Lane wrote: >> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> this is 8.5a2 configured with ./configure --enable-cassert >>> --enable-debug and just default settings(ie plain initdb with >>> default settings). >>> The OS is Debian Lenny/AMD64. >> >> Huh. That should not be noticeably different from my F11/Xeon64 >> machine ... but I still can't make it crash. Somebody else is gonna >> have to debug this one. > > hmm sorry - that was actually Lenny/i386 however I fail to reproduce > this on some of my other test boxes as well though I don't have an idea > why it crashes on my laptop(and for the original reporter) but not > elsewhere :( ok I now see why you (and I) failed to reproduce the problem - it only causes clusters/databases to crash that were actually generated using the upthread mentioned script. it does NOT fail using a dump generated by a database that fails(!). So the issue must be a bit more complex and somehow relate to some prior stuff the script does. Stefan
On Fri, Oct 30, 2009 at 11:22 AM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > ok I now see why you (and I) failed to reproduce the problem - it only > causes clusters/databases to crash that were actually generated using the > upthread mentioned script. it does NOT fail using a dump generated by a > database that fails(!). > So the issue must be a bit more complex and somehow relate to some prior > stuff the script does. Does it still crash if you compile with CFLAGS='-O0 -g' ? Could you send a backtrace from that? -- greg
Greg Stark wrote: > On Fri, Oct 30, 2009 at 11:22 AM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: >> ok I now see why you (and I) failed to reproduce the problem - it only >> causes clusters/databases to crash that were actually generated using the >> upthread mentioned script. it does NOT fail using a dump generated by a >> database that fails(!). >> So the issue must be a bit more complex and somehow relate to some prior >> stuff the script does. > > Does it still crash if you compile with CFLAGS='-O0 -g' ? Could you > send a backtrace from that? ok just assembled a new testcase from the querylog of the tool: http://www.kaltenbrunner.cc/files/rand_gen_crash.sql this crashes on i386 and results in something like: ERROR: invalid memory alloc request size 8589934592 on an AMD64 host for me. However the error seems to go away after an ANALYZE... so I wonder if this is just another case of "if we missestimated the size of the hashtable we are doomed" Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > ok just assembled a new testcase from the querylog of the tool: > http://www.kaltenbrunner.cc/files/rand_gen_crash.sql Ah, that works (or should I say fails). Will take a look. > However the error seems to go away after an ANALYZE... so I wonder if > this is just another case of "if we missestimated the size of the > hashtable we are doomed" Well, we shouldn't dump core in any case. regards, tom lane
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > However the error seems to go away after an ANALYZE... so I wonder if > this is just another case of "if we missestimated the size of the > hashtable we are doomed" Actually, what seems to be happening on a 32-bit machine is that ExecChooseHashTableSize sets nbatch = INT_MAX/2, and then when we try to do hashtable->outerBatchFile = (BufFile **) palloc0(nbatch * sizeof(BufFile *)); the memory size calculation overflows to zero, so we get an empty outerBatchFile array. So the fix is to make sure we limit the number of batches to something sane, perhaps work_mem / sizeof(pointer). regards, tom lane