Thread: Workaround for custom aggregate which would need "internal" as statetype

Workaround for custom aggregate which would need "internal" as statetype

From
"Florian G. Pflug"
Date:
Hi

I'm trying to write an aggrecate collect_distinct(int8) which puts all
distinct values into an array. My first try was defining an aggregate
"collect" using array_append, and doing "select collect(distinct <field>) ..",
but this is quite slow - probably because distinct sorts the values, instead
of using a hash to filter out duplicates.

Using perl, and a perl-hash was even slower, so I wrote my to c-functions
(actualy c++), which use a STL hash_set to filter out duplicates.

I initially defined my state-transaction function as
"collect_distinct(internal, int8) returns internal". The parameter marked internal was
a pointer to a STL hash_set. But using this to define an aggregate failed, because
internal seems to be forbidden as a state-type.

I now resorted to a crude hack to get this running - I changed "internal" to
"int4", and just cast my pointer to a int4 before returning it, and after receiving
it as an argument. This at least enabled me to do some benchmarking, and performance-wise
things look good...

Before using this on a production system, I need to get rid of that hack, but I don't
see how this could be done ATM... Maybe someone here could give me a hint how this could
work...

greetings, Florian Pflug

"Florian G. Pflug" <fgp@phlo.org> writes:
> Using perl, and a perl-hash was even slower, so I wrote my to c-functions
> (actualy c++), which use a STL hash_set to filter out duplicates.

This makes me fairly nervous, because what's going to ensure that the
memory used by the hash_set is reclaimed?  Particularly if the query
errors out partway through?

            regards, tom lane

Re: Workaround for custom aggregate which would need "internal"

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>
>>Using perl, and a perl-hash was even slower, so I wrote my to c-functions
>>(actualy c++), which use a STL hash_set to filter out duplicates.
>
> This makes me fairly nervous, because what's going to ensure that the
> memory used by the hash_set is reclaimed?  Particularly if the query
> errors out partway through?

hash_set can be told to use a user-defined allocator class, which in turn
can use palloc/pfree, with an appropriate memory context. I'm not
really sure what the "appropriate context" is, as using CurrentMemoryContext
leads to strange crashes. For now, i'm using the standard c++ allocator,
because I figured it should make debugging easier.

Still, the question remains how I can sanely use a c++ object as "state" of
a aggregate...

greetings, Florian Pflug



"Florian G. Pflug" <fgp@phlo.org> writes:
> hash_set can be told to use a user-defined allocator class, which in turn
> can use palloc/pfree, with an appropriate memory context. I'm not
> really sure what the "appropriate context" is, as using CurrentMemoryContext
> leads to strange crashes. For now, i'm using the standard c++ allocator,
> because I figured it should make debugging easier.

Yeah, the assumption is that anything allocated in CurrentMemoryContext
other than the actual return value is just memory leakage, and it'll
automatically get thrown away.  You could probably use
aggstate->aggcontext, which is accessible to aggregate functions since
PG 8.1 (see the comments at the head of nodeAgg.c).

            regards, tom lane