Re: Solving hash table overrun problems - Mailing list pgsql-hackers

From Aaron Birkland
Subject Re: Solving hash table overrun problems
Date
Msg-id 19ab0ccd05030616463919926a@mail.gmail.com
Whole thread Raw
In response to Re: Solving hash table overrun problems  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> > This also brings up a line of thought I had a while ago on a related
> > topic.  Something like a "HashDistinct" might be useful, if it had no
> > startup cost.  

> We already have that: the planner will use a HashAgg node in this
> fashion in some contexts (I think just as one of the ways to do IN,
> at the moment).  

Hmm.. I see HashAggregate being used that way in the IN queries, but I
have not observed it used in a way that incurrs no startup cost.  It
looked to me that in doing hash aggregation in ExecAgg (nodeAgg.c),
agg_fill_hash_table() would have to be called, which iterate through
every output of the child plan building the hash table before it
returns, thus incurring at least the startup cost of executing the
entire subplan of the child node at the aggregation stage.  I'm not
too familiar with the code, so there is probably something I'm missing
somewhere :(

> It's not yet bright enough to consider doing it for
> SELECT DISTINCT.  The DISTINCT planning code is old and crufty and
> pretty tightly interwired with ORDER BY ... it needs work.

Yes, SELECT DISTINCT was my motivating example, though in my specific
application latency (i.e. not having to wait for the entire query
below the DISTINCT operation to finish) was also an important factor,
hence my thoughts on a zero startup cost hash aggregation and
wondering if it would really be any kind of win in the end.
 -Aaron


pgsql-hackers by date:

Previous
From: mchron@aueb.gr
Date:
Subject: ERROR: unrecognized node type in PostgresMain( )
Next
From: Tom Lane
Date:
Subject: Re: ERROR: unrecognized node type in PostgresMain( )