DML and column cound in aggregated subqueries - Mailing list pgsql-hackers

From Andres Freund
Subject DML and column cound in aggregated subqueries
Date
Msg-id 20161031110245.o4ujpioamf62uhdo@alap3.anarazel.de
Whole thread Raw
Responses Re: DML and column cound in aggregated subqueries  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

while hacking up nodeAgg.c to use a dedicated slot/desc for the
hashtable, an assert I placed brought up a weird case:

regression[29535][1]=# EXPLAIN VERBOSE update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
┌───────────────────────────────────────────────────────────────────────────────────────┐
│                                      QUERY PLAN                                       │
├───────────────────────────────────────────────────────────────────────────────────────┤
│ Update on public.bar  (cost=42.75..109.25 rows=1130 width=20)                         │
│   ->  Hash Join  (cost=42.75..109.25 rows=1130 width=20)                              │
│         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid                            │
│         Hash Cond: (bar.f1 = foo.f1)                                                  │
│         ->  Seq Scan on public.bar  (cost=0.00..32.60 rows=2260 width=14)             │
│               Output: bar.f1, bar.f2, bar.ctid                                        │
│         ->  Hash  (cost=40.25..40.25 rows=200 width=10)                               │
│               Output: foo.ctid, foo.f1                                                │
│               ->  HashAggregate  (cost=38.25..40.25 rows=200 width=10)                │
│                     Output: foo.ctid, foo.f1                                          │
│                     Group Key: foo.f1                                                 │
│                     ->  Seq Scan on public.foo  (cost=0.00..32.60 rows=2260 width=10) │
│                           Output: foo.ctid, foo.f1                                    │
└───────────────────────────────────────────────────────────────────────────────────────┘
(13 rows)

this doesn't look right. The ctid shouldn't be in the aggregate output -
after all it's pretty much meaningless here.

Note that in this case find_hash_columns(aggstate) will return two
columns, but node->numCols will be 1 (and thus node->grpOperators only
have one element).

So it seems we're doing something wrong here.

Casting a wider net: find_hash_columns() and it's subroutines seem like
pretty much dead code, including outdated comments (look for "SQL99
semantics"). The main explanation says: /* * Create a list of the tuple
columns that actually need to be stored in * hashtable entries.  The
incoming tuples from the child plan node will * contain grouping
columns, other columns referenced in our targetlist and * qual, columns
used to compute the aggregate functions, and perhaps just * junk columns
we don't use at all.  Only columns of the first two types * need to be
stored in the hashtable, and getting rid of the others can * make the
table entries significantly smaller.  To avoid messing up Var *
numbering, we keep the same tuple descriptor for hashtable entries as
the * incoming tuples have, but set unwanted columns to NULL in the
tuples that * go into the table.  but that seems bogus - when are we
referencing "other columns referenced in our targetlist and qual" from
representative tuple that aren't also grouped upon?  This method would
select pretty arbitrary value for those, since we don't hash them, so
it'll just be the tuple first occurred?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Emre Hasegeli
Date:
Subject: Re: Improve output of BitmapAnd EXPLAIN ANALYZE
Next
From: Robert Haas
Date:
Subject: Re: WAL consistency check facility