Re: wip: functions median and percentile - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: wip: functions median and percentile
Date
Msg-id AANLkTikFUa_82B6ybTWVKELw33reeyRT_4jdX+8Fn3n7@mail.gmail.com
Whole thread Raw
In response to Re: wip: functions median and percentile  (Hitoshi Harada <umi.tanuki@gmail.com>)
Responses Re: wip: functions median and percentile
List pgsql-hackers
2010/9/23 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2010/9/23 Pavel Stehule <pavel.stehule@gmail.com>:
>> Hello
>>
>> 2010/9/22 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> 2010/9/22 Pavel Stehule <pavel.stehule@gmail.com>:
>>>> Hello
>>>>
>>>> I found probably hard problem in cooperation with window functions :(
>>
>> maybe I was confused. I found a other possible problems.
>>
>> The problem with median function is probably inside a final function
>> implementation. Actually we request possibility of repetitive call of
>> final function. But final function call tuplesort_end function and
>> tuplesort_performsort. These function changes a state of tuplesort.
>> The most basic question is "who has to call tuplesort_end function and
>> when?
>
> Reading the comment in array_userfuncs.c, array_agg_finalfn() doesn't
> clean up its internal state at all and tells it's the executor's
> responsibility to clear memory. It is allowed since ArrayBuildState is
> only in-memory state. In the other hand, TupleSort should be cleared
> by calling tuplesort_end() if it has tapeset member (on file based
> sort) to close physical files.
>
> So 2 or 3 ways to go in my mind:

it is little bit worse - we cannot to call tuplesort_performsort repetitive.

>
> 1. call tuplesort_begin_datum with INT_MAX workMem rather than the
> global work_mem, to avoid it spills out sort state to files. It may
> sounds dangerous, but actually memory exhausting can happen in
> array_agg() as well.
>
> 2. add TupleSort an argument that tells not to use file at all. This
> results in the same as #1 but more generic approach.
>
> 3. don't use tuplesort in median() but implement its original sort
> management. This looks quite redundant and like maintenance problem.
>
> #2 sounds like the best in generic and consistent way. The only point
> is whether the change is worth for implementing median() as it's very
> system-wide common fundamentals.
>
> Other options?

#4 block median under window clause

#5 use a C array instead tuplesort under window clause. It is very
unpractical to use a windows clauses with large datasets, so it should
not be a problem. More, this can be very quick, because for C array we
can use a qsort function.

Now I prefer #5 - it can be fast for using inside windows clause and
safe when window clause will not be used.

Regards

Pavel
>
>
> Regards,
> --
> Hitoshi Harada
>


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: .gitignore files, take two
Next
From: Robert Haas
Date:
Subject: Re: Why is time with timezone 12 bytes?