Thread: bug with aggregates

bug with aggregates

From
Massimo Dal Zotto
Date:
Hi,

while playing with aggregates I found this bug in the planner:

dz=> select count(1) from my_table;
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 
We have lost the connection to the backend, so further processing is impossible.  Terminating.

The debugger prints the following information:

(xxgdb) cont
Program received signal SIGSEGV, Segmentation fault.
0x80d93cf in set_agg_tlist_references (aggNode=0x82a4310) at setrefs.c:765
(xxgdb) info stack
#0  0x80d93cf in set_agg_tlist_references (aggNode=0x82a4310) at setrefs.c:765
#1  0x80d80ac in union_planner (parse=0x82a40a0) at planner.c:319
#2  0x80d7d05 in planner (parse=0x82a40a0) at planner.c:83
#3  0x80fd344 in pg_parse_and_plan (query_string=0xbffef2d8 "select count(1) from my_table;", typev=0x0, nargs=0,
queryListP=0xbffef268,dest=Remote, aclOverride=0 '\000') at postgres.c:590
 
#4  0x80fd4a3 in pg_exec_query_dest (query_string=0xbffef2d8 "select count(1) from my_table;", dest=Remote,
aclOverride=0)at postgres.c:678
 
#5  0x80fd454 in pg_exec_query (query_string=0xbffef2d8 "select count(1) from my_table;") at postgres.c:656
#6  0x80fe6c8 in PostgresMain (argc=9, argv=0xbffff850, real_argc=6, real_argv=0xbffffd6c) at postgres.c:1658
#7  0x80e32ec in DoBackend (port=0x8235ca8) at postmaster.c:1628
(xxgdb) print *aggNode
$2 = { plan = {   type = T_Agg,    cost = 0,    plan_size = 0,    plan_width = 0,    plan_tupperpage = 0,    state =
0x0,   targetlist = 0x82a44f8,    qual = 0x0,    lefttree = 0x0,    righttree = 0x0,    extParam = 0x0,    locParam =
0x0,   chgParam = 0x0,    initPlan = 0x0,    subPlan = 0x0,    nParamExec = 0 },  aggs = 0x0,  aggstate = 0x0
 
}
(xxgdb) 

The problem is caused by a null plan.lefttree in set_agg_tlist_references()
(setrefs.c:765), but I don't know what it means:
subplanTargetList = aggNode->plan.lefttree->targetlist;

-- 
Massimo Dal Zotto

+----------------------------------------------------------------------+
|  Massimo Dal Zotto               email: dz@cs.unitn.it               |
|  Via Marconi, 141                phone: ++39-0461534251              |
|  38057 Pergine Valsugana (TN)      www: http://www.cs.unitn.it/~dz/  |
|  Italy                             pgp: finger dz@tango.cs.unitn.it  |
+----------------------------------------------------------------------+


Re: [HACKERS] bug with aggregates

From
Tom Lane
Date:
Massimo Dal Zotto <dz@cs.unitn.it> writes:
> dz=> select count(1) from my_table;
> pqReadData() -- backend closed the channel unexpectedly.

Oops.  Probably not a big enough bug to delay 6.5 release for,
but I'll look into it and commit a fix shortly after the release.
I think the parser may be doing the wrong thing here.  Thanks!
        regards, tom lane


Re: [HACKERS] bug with aggregates

From
Tom Lane
Date:
Massimo Dal Zotto <dz@cs.unitn.it> writes:
> dz=> select count(1) from my_table;
> pqReadData() -- backend closed the channel unexpectedly.

Further notes --- I find that you can get the same crash with no table
at all,    select count(1);

6.4.2 executes both queries --- but curiously enough, it produces "1"
regardless of the size of the table you mention, which is not surprising
when you look at its plan ... it optimizes out the scan of the table
entirely.  But if you do    select a,count(1) from table group by a;
then you get a count of the number of rows in each group, which is more
or less what I'd expect.  This behavior is not consistent with the
ungrouped case.

After a quick gander at the SQL spec, I see no evidence that either of
these queries is allowed by the spec.  I'm inclined to think that
"select count(1);" ought to be disallowed and "select count(1) from
my_table;" ought to be treated the same as "select count(*) from
my_table;", like it is in the grouped case.  Comments?
        regards, tom lane


Re: [HACKERS] bug with aggregates

From
Tom Lane
Date:
Massimo Dal Zotto <dz@cs.unitn.it> writes:
> dz=> select count(1) from my_table;
> pqReadData() -- backend closed the channel unexpectedly.

Poking into this failure revealed a potentially serious problem in
execQual.c, so I decided it would be wise to fix it now rather than
wait till after 6.5.  In the situation where ExecTargetList() is asked
to generate a null tuple --- which arises in the case above, and
evidently in other cases judging from the comments there and the
multiple bogus ways that people have tried to fix it before ---
it was handing back a palloc'd but uninitialized chunk of memory.
This would result in unpredictable behavior if anyone actually tried
to do anything with the tuple.  In the case above, nodeAgg.c tried to
copy the tuple, leading to coredumps some of the time.  I fixed
ExecTargetList to generate a valid tuple containing zero attributes,
which should work reliably.

I had managed to break the planner's handling of this case too, so I
figured I would fix that as long as I was annoying Marc anyway ;-).

The behavior is now back to that of 6.4.2: you get "1" when the query is
not grouped and row counts when it is.  I still think that that's wrong,
but I will not risk trying to change it just before release.
        regards, tom lane