Re: Built-in binning functions - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Built-in binning functions
Date
Msg-id 17571.1409517238@sss.pgh.pa.us
Whole thread Raw
In response to Re: Built-in binning functions  (Petr Jelinek <petr@2ndquadrant.com>)
Responses Re: Built-in binning functions
List pgsql-hackers
Petr Jelinek <petr@2ndquadrant.com> writes:
> On 30/08/14 19:24, Tom Lane wrote:
>> I wasn't terribly happy about that either.  I still think we should
>> reduce this to a single polymorphic function, as in the attached.

> I did try to write generic function very similar to what you wrote but 
> discarded it because of the performance reasons. If we really want to 
> support any data type I am all for having generic function but I still 
> would like to have one optimized for float8 because that seems to be the 
> most used use-case (at least that one is the reason why I even wrote 
> this) for performance reasons.

Well, part of the reason why your v3 float8 function looks so fast is that
it's cheating: it will not produce the right answers for comparisons
involving NaNs.  I'm not sure how good it would look once you'd added
some isnan() tests to make the behavior actually equivalent to
float8_cmp_internal().

However, assuming it still seems worthwhile once that's accounted for,
I don't have a strong objection to having an additional code path for
float8.  There are two ways we could do that:

1. Add a runtime test in the polymorphic function, eg
if (element_type == FLOAT8OID)    result = width_bucket_float8(...);else if (typentry->typlen > 0)    result =
width_bucket_fixed(...);else   result = width_bucket_variable(...);
 

2. Define a SQL function width_bucket(float8, float8[]) alongside
the polymorphic one.

While #2 might be marginally faster at runtime, the main difference
between these is what the parser would choose to do with ambiguous cases.

I experimented a bit and it seemed that the parser would prefer the
float8 function in any situation where the inputs were of numeric types,
probably because float8 is the preferred type in the numeric category.
I'm not real sure that would be a good thing: in particular it would
cast integer and numeric inputs to float8, which risks precision loss,
and there doesn't seem to be any way to get the parser to make the
other choice if the inputs are of numeric-category types.

As against that objection, it would make cross-type numeric cases easier
to use, for example "width_bucket(1, array[2.4])" would be accepted.
If we just offer the anyelement/anyarray version then the parser would
make you cast "1" to numeric before it would consider the function to
match.

On balance the runtime-test approach looks like a better idea, in that
it doesn't risk any unexpected semantic behaviors.

> The difference between my generic and Tom's generic is because Tom's is 
> slowed down by the deconstruct_array.

Meh.  It looked to me like your version would have O(N^2) performance
problems from computing array offsets repeatedly, depending on exactly
which array element it ended up on.  It would avoid repeat calculations
only if it always moved right.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: On partitioning
Next
From: Hannu Krosing
Date:
Subject: Re: On partitioning