Thread: Join query crashes 7.3.1 backend...
Hi everyone, I'm having a problem executing a join query which causes the postgresql backend to crash. The query I'm trying to execute is: select t2.resid from res t2, line t1 where t2.resid=t1.resid and t1.layer=355::bigint; When this runs then it causes the backend to die and disconnect from psql :(. I did some initial tests with a limit clause to ensure to see where things were going wrong and I managed to locate roughly where the problem was since I could find where the query failed. select t2.resid from res t2, line t1 where t2.resid=t1.resid and t1.layer=355::bigint limit 710050; ...worked fine whereas.... select t2.resid from res t2, line t1 where t2.resid=t1.resid and t1.layer=355::bigint limit 710075; ...caused the backend to die. So I was beginning to suspect that somehow the data in the table had been corrupted as I've been executing the same query with many different values of layer before without any problems. So firstly I tried the line table: select resid from line where layer=355::bigint; ...which worked without any problems whatsoever. So I was guessing that the corrupted rows were in the res table. Using the \o command, I dumped all the resids from the above query into a file and wrote a libpq program to read in all the resids from the file and query them one by one in the form 'select resid,* from res where resid=x::bigint' where x was each line from the above query. This is where things start getting strange because the program happily returned all rows produced by the 'select resid from line where layer=355::bigint' on line far beyond the problem area... up until the 2M+ mark when I decided to stop it. So I was now thinking that the data in res was not corrupted either. My next step was to verify that the rows in both the joined and unjoined queries on the line table were being returned in the same order. I did an explain which returned the following: QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------ Hash Join (cost=113489207.76..119550368.76 rows=28567936 width=16) Hash Cond: ("outer".resid = "inner".resid) -> Seq Scan on res t2 (cost=100000000.00..103939122.92 rows=72475392 width=8) -> Hash (cost=13306193.92..13306193.92 rows=28567936 width=8) -> Index Scan using line_resid_idx on line t1 (cost=0.00..13306193.92 rows=28567936 width=8) Filter: (layer = 355::bigint) (6 rows) So my assumption from this was that the data was being returned in the order of the line table. To check this, I executed the last query that worked again: select t2.resid from res t2, line t1 where t2.resid=t1.resid and t1.layer=355::bigint limit 710050; ...but this time recorded the output into a file. And then the following: select resid from line where layer=355::bigint; I then did a diff against the output of both queries to make sure that the resid rows were being returned in the same order. As suspected, the diff showed that the resids were being returned in the same order in both cases up to the 710050th result (which is within 50 rows of whatever was causing it to crash). So now I'm stuck as it seems pulling out the data around offset 710050 separately from both tables does not cause a problem whereas executing the join directly causes the backend to crash. I have verified (to within 25 rows) that the rows are being accessed in the same order in both cases. Can anyone suggest why this is happening? I can supply additional debug information if people can direct me as to what to do. Many thanks, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > I'm having a problem executing a join query which causes the postgresql > backend to crash. Please try 7.3.2. There's a (longstanding) overflow problem in the hash join code that only triggers on sufficiently large input sets. It was detected and fixed a couple months ago... regards, tom lane
Hi Tom, Thanks for the hint, I'll give that a go later today or tomorrow and see if that fixes the problem. Incidentally, I did try looking for a changelog for 7.3.2 to see if there were any known issues that had been fixed since 7.3.1, but it appears that the release notes pages in the documentation have not been updated since the release of 7.3 (see http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=release. html) Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 25 February 2003 15:17 > To: Mark Cave-Ayland > Cc: PostgreSQL General > Subject: Re: [GENERAL] Join query crashes 7.3.1 backend... > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > I'm having a problem executing a join query which causes the postgresql > > backend to crash. > > Please try 7.3.2. There's a (longstanding) overflow problem in the > hash join code that only triggers on sufficiently large input sets. > It was detected and fixed a couple months ago... > > regards, tom lane
Hi Tom, I've upgraded to 7.3.2 and the problem still exists :(. I'm not that familiar with the workings of GDB but I managed to get a backtrace for you: (gdb) continue Continuing. Program received signal SIGSEGV, Segmentation fault. 0x0811ed8c in BufFileWrite () (gdb) bt #0 0x0811ed8c in BufFileWrite () #1 0x080dc237 in ExecHashJoinSaveTuple () #2 0x080db201 in ExecHash () #3 0x080d5f6d in ExecProcNode () #4 0x080dbb06 in ExecHashJoin () #5 0x080d5f7d in ExecProcNode () #6 0x080d4d44 in ExecutePlan () #7 0x080d4389 in ExecutorRun () #8 0x081294fb in ProcessQuery () #9 0x081279bb in pg_exec_query_string () #10 0x08128ad3 in PostgresMain () #11 0x0810f044 in DoBackend () #12 0x0810e92d in BackendStartup () #13 0x0810d9c0 in ServerLoop () #14 0x0810d3f9 in PostmasterMain () #15 0x080e8cdd in main () #16 0x42017499 in __libc_start_main () from /lib/i686/libc.so.6 (gdb) You may be on the right track with your overflow theory - we estimate that the query should return somewhere in the region of 29M records as this is one of the largest layers we have in the DB. Please let me know if you need any more information and I will do my best to help. Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > I've upgraded to 7.3.2 and the problem still exists :(. I'm not that > familiar with the workings of GDB but I managed to get a backtrace for > you: Boy, that backtrace certainly looks exactly like the hashjoin problem that I thought I fixed ... and Rae Stiening, the original complainant, confirmed that the patch fixed the problem for him. You're certain this is 7.3.2? I'd recommend rebuilding with --enable-debug so you can get a more detailed stack trace. The thing I'd want to look at is whether ExecHashJoinGetBatch() is returning an out-of-range batch number to ExecHashJoin() just before the crash occurs. It would also be useful to look at the nbatch, nbuckets, and totalbuckets fields of the hashtable structure in ExecHashJoin. regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 26 February 2003 15:17 > To: Mark Cave-Ayland > Cc: PostgreSQL General > Subject: Re: Join query crashes 7.3.1 backend... > > Boy, that backtrace certainly looks exactly like the hashjoin > problem that I thought I fixed ... and Rae Stiening, the > original complainant, confirmed that the patch fixed the > problem for him. You're certain this is 7.3.2? Hi Tom, Hope you got my last email (we've been having mail server problems). Just in case you didn't, I found that the SRPM I got yesterday was 7.3-2 and *NOT* 7.3.2-1. I've now recompiled the right RPMs and installed them... and the problem has gone away :) So yes, the problem was fixed between 7.3.1 and 7.3.2. I'm now off to PEBCAK school to study basic file identification.... Thanks again, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.