Re: wip: functions median and percentile - Mailing list pgsql-hackers

From Tom Lane
Subject Re: wip: functions median and percentile
Date
Msg-id 22629.1286805820@sss.pgh.pa.us
Whole thread Raw
In response to Re: wip: functions median and percentile  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: wip: functions median and percentile
List pgsql-hackers
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 10 October 2010 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, as far as the implementation issues go, telling tuplesort that it
>> can use gigabytes of memory no matter what seems quite unacceptable.
>> Put this thing into a hash aggregation and you'll blow out your memory
>> in no time. �I don't think it's even a good idea to use work_mem there.

> Argh! Yes that sounds like a much more serious problem.

> Interestingly I couldn't seem to produce this effect. Every effort I
> make to write a query to test this with median ends up being executed
> using a GroupAggregate, while the equivalent query with avg uses a
> HashAggregate. I don't understand why they are being treated
> differently.

If you're using recent sources, there's some code in count_agg_clauses()
that assumes that an aggregate with transtype INTERNAL will use
ALLOCSET_DEFAULT_INITSIZE (ie 8K) workspace.  So that'll discourage the
planner from selecting HashAggregate except for a pretty small number of
groups.  The problem is that there's still a whole lot of daylight
between 8K and 2G, so plenty of room to go wrong.

The other approach that we could take here is to replace the
ALLOCSET_DEFAULT_INITSIZE hack (which is certainly no more than a hack)
with some way for an aggregate to declare how much space it'll eat,
or more simply to mark it as "never use in HashAgg".  This was discussed
earlier but it would require a significant amount of dogwork and no one
was real excited about doing it.  Maybe it's time to bite that bullet
though.  Reflecting on it, I think it'd be best to allow an agg to
provide an estimation function that'd be told the input data type and
expected number of rows --- even on a per-aggregate basis, a constant
estimate just isn't good enough.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: [JDBC] Support for JDBC setQueryTimeout, et al.
Next
From: Tom Lane
Date:
Subject: Re: wip: functions median and percentile