Thread: bug with aggregates
Hi, while playing with aggregates I found this bug in the planner: dz=> select count(1) from my_table; pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. We have lost the connection to the backend, so further processing is impossible. Terminating. The debugger prints the following information: (xxgdb) cont Program received signal SIGSEGV, Segmentation fault. 0x80d93cf in set_agg_tlist_references (aggNode=0x82a4310) at setrefs.c:765 (xxgdb) info stack #0 0x80d93cf in set_agg_tlist_references (aggNode=0x82a4310) at setrefs.c:765 #1 0x80d80ac in union_planner (parse=0x82a40a0) at planner.c:319 #2 0x80d7d05 in planner (parse=0x82a40a0) at planner.c:83 #3 0x80fd344 in pg_parse_and_plan (query_string=0xbffef2d8 "select count(1) from my_table;", typev=0x0, nargs=0, queryListP=0xbffef268,dest=Remote, aclOverride=0 '\000') at postgres.c:590 #4 0x80fd4a3 in pg_exec_query_dest (query_string=0xbffef2d8 "select count(1) from my_table;", dest=Remote, aclOverride=0)at postgres.c:678 #5 0x80fd454 in pg_exec_query (query_string=0xbffef2d8 "select count(1) from my_table;") at postgres.c:656 #6 0x80fe6c8 in PostgresMain (argc=9, argv=0xbffff850, real_argc=6, real_argv=0xbffffd6c) at postgres.c:1658 #7 0x80e32ec in DoBackend (port=0x8235ca8) at postmaster.c:1628 (xxgdb) print *aggNode $2 = { plan = { type = T_Agg, cost = 0, plan_size = 0, plan_width = 0, plan_tupperpage = 0, state = 0x0, targetlist = 0x82a44f8, qual = 0x0, lefttree = 0x0, righttree = 0x0, extParam = 0x0, locParam = 0x0, chgParam = 0x0, initPlan = 0x0, subPlan = 0x0, nParamExec = 0 }, aggs = 0x0, aggstate = 0x0 } (xxgdb) The problem is caused by a null plan.lefttree in set_agg_tlist_references() (setrefs.c:765), but I don't know what it means: subplanTargetList = aggNode->plan.lefttree->targetlist; -- Massimo Dal Zotto +----------------------------------------------------------------------+ | Massimo Dal Zotto email: dz@cs.unitn.it | | Via Marconi, 141 phone: ++39-0461534251 | | 38057 Pergine Valsugana (TN) www: http://www.cs.unitn.it/~dz/ | | Italy pgp: finger dz@tango.cs.unitn.it | +----------------------------------------------------------------------+
Massimo Dal Zotto <dz@cs.unitn.it> writes: > dz=> select count(1) from my_table; > pqReadData() -- backend closed the channel unexpectedly. Oops. Probably not a big enough bug to delay 6.5 release for, but I'll look into it and commit a fix shortly after the release. I think the parser may be doing the wrong thing here. Thanks! regards, tom lane
Massimo Dal Zotto <dz@cs.unitn.it> writes: > dz=> select count(1) from my_table; > pqReadData() -- backend closed the channel unexpectedly. Further notes --- I find that you can get the same crash with no table at all, select count(1); 6.4.2 executes both queries --- but curiously enough, it produces "1" regardless of the size of the table you mention, which is not surprising when you look at its plan ... it optimizes out the scan of the table entirely. But if you do select a,count(1) from table group by a; then you get a count of the number of rows in each group, which is more or less what I'd expect. This behavior is not consistent with the ungrouped case. After a quick gander at the SQL spec, I see no evidence that either of these queries is allowed by the spec. I'm inclined to think that "select count(1);" ought to be disallowed and "select count(1) from my_table;" ought to be treated the same as "select count(*) from my_table;", like it is in the grouped case. Comments? regards, tom lane
Massimo Dal Zotto <dz@cs.unitn.it> writes: > dz=> select count(1) from my_table; > pqReadData() -- backend closed the channel unexpectedly. Poking into this failure revealed a potentially serious problem in execQual.c, so I decided it would be wise to fix it now rather than wait till after 6.5. In the situation where ExecTargetList() is asked to generate a null tuple --- which arises in the case above, and evidently in other cases judging from the comments there and the multiple bogus ways that people have tried to fix it before --- it was handing back a palloc'd but uninitialized chunk of memory. This would result in unpredictable behavior if anyone actually tried to do anything with the tuple. In the case above, nodeAgg.c tried to copy the tuple, leading to coredumps some of the time. I fixed ExecTargetList to generate a valid tuple containing zero attributes, which should work reliably. I had managed to break the planner's handling of this case too, so I figured I would fix that as long as I was annoying Marc anyway ;-). The behavior is now back to that of 6.4.2: you get "1" when the query is not grouped and row counts when it is. I still think that that's wrong, but I will not risk trying to change it just before release. regards, tom lane