Re: Odd out of memory problem. - Mailing list pgsql-hackers

From Hitoshi Harada
Subject Re: Odd out of memory problem.
Date
Msg-id CAP7QgmmiB9XtnKv1v3FFhbum+Gy_rCmEo6gsoXCKFL3dFgVtmw@mail.gmail.com
Whole thread Raw
In response to Re: Odd out of memory problem.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Odd out of memory problem.  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Mar 26, 2012 at 5:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Could you give us a brain dump on the sketch?  I've never seen how to
>>> do it without unreasonable overhead.
>
>> Hm. So my original plan was dependent on adding the state-merge
>> function we've talked about in the past. Not all aggregate functions
>> necessarily can support such a function but I think all or nearly all
>> the builtin aggregates can. Certainly min,max, count, sum, avg,
>> stddev, array_agg can which are most of what people do. That would be
>> a function which can take two state variables and produce a new state
>> variable.
>
> I'd rather not invent new requirements for aggregate implementations
> if we can avoid it.
>
>> However now that I've started thinking about it further I think you
>> could solve it with less complexity by cheating in various ways. For
>> example if you limit the hash size to 1/2 of work_mem then you when
>> you reach that limit you could just stuff any tuple that doesn't match
>> a hash entry into a tuplesort with 1/2 of work_mem and do the regular
>> level break logic on the output of that.
>
> Or just start dumping such tuples into a tuplestore, while continuing to
> process tuples that match the hashagg entries that are already in
> existence.  Once the input is exhausted, read out the hashagg entries we
> have, flush the hashagg table, start reading from the tuplestore.
> Repeat as needed.
>
> I like this idea because the only thing you give up is predictability of
> the order of output of aggregated entries, which is something that a
> hashagg isn't guaranteeing anyway.  In particular, we would still have a
> guarantee that any one aggregate evaluation processes the matching
> tuples in arrival order, which is critical for some aggregates.
>
> The main problem I can see is that if we start to flush after work_mem
> is X% full, we're essentially hoping that the state values for the
> existing aggregates won't grow by more than 1-X%, which is safe for many
> common aggregates but fails for some like array_agg().  Ultimately, for
> ones like that, it'd probably be best to never consider hashing at all.
> I guess we could invent an "unsafe for hash aggregation" flag for
> aggregates that have unbounded state-size requirements.
>

According to what I've learned in the last couple of months, array_agg
is not ready for fallback ways like dumping to tuplestore.  Even
merge-state is not able for them.  The problem is that the executor
doesn't know how to serialize/deserialize the internal type trans
value.  So in one implementation, the existence of merge function is a
flag to switch back to sort grouping not hash aggregate and array_agg
is one of such aggregate functions.  That said, if you invent a new
flag to note the aggregate is not dump-ready, it'd be worth inventing
state merge function to aggregate infrastructure anyway.

So I can imagine a way without state-merge function nor dumping to
tuplestore would be to sort hash table content the rest of inputs so
that we can switch to sort grouping.  Since we have hash table, we can
definitely sort them in memory, and we can put to disk everything that
comes later than the fallback and read it after the scan finishes. Now
we have sorted state values and sorted input, we can continue the rest
of work.

Anyway the memory usage problem is not only array_agg and hash
aggregate.  Even if you can say the hash table exceeds X% of the
work_mem, how can you tell other operators such like Sort are not
using the rest of memory?  One approach I could see to avoid this is
assigning arbitrary amount of memory to each operator from work_mem
and calculate it locally.  But this approach is going to skew
occasionally and not perfect, either.


Thanks,
--
Hitoshi Harada


pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Re: Cross-backend signals and administration (Was: Re: pg_terminate_backend for same-role)
Next
From: Simon Riggs
Date:
Subject: Re: 9.2 commitfest closure (was Command Triggers, v16)