Thread: width_bucket function for timestamps

width_bucket function for timestamps

From
Jeremy Drake
Date:
I just came across this code I wrote about a year ago which implements a
function equivilant to width_bucket for timestamps.

I wrote this when I was trying to plot some data over time, and I had more
points than I needed.  This function allowed me to create a pre-determined
number of "bins" to average the data inside of so that I could get a sane
number of points.  Part of the problem was that there were so many data
points, that a sql implementation of the function (or plpgsql, I forget,
it was a year ago) was painfully slow.  This C function provided much
better performance than any other means at my disposal.

I wanted to share this code since it may be useful for someone else, but I
don't know exactly what to do with it.  So I am putting it out there, and
asking what the proper home for such a function might be.  I believe it
would be generally useful for people, but it is so small that it hardly
seems like a reasonable pgFoundry project.  Maybe there is a home for such
a thing in the core distribution in a future release?

The code can be found at
http://www.jdrake.com/postgresql/bintimestamp.tar.gz for a buildable PGXS
module, or I attached just the C code.  There is no documentation, the
parameters work the same as the width_bucket function.  The code is not
necessarily the most readable in the world, I was trying to get as much
speed out of it as possible, since I was calling it over a million times
as a group by value.

Thanks for any pointers...

-- 
Fortune's Office Door Sign of the Week:
Incorrigible punster -- Do not incorrige.

Re: width_bucket function for timestamps

From
"Jim C. Nasby"
Date:
Sinte we already have width_bucket, I'd argue this should go in core. If
someone's feeling adventurous, there should probably be a double
precision version as well. Hrm... and maybe text...

Doesn't the backend already have something like this for calculating
histograms?

On Sun, Oct 08, 2006 at 10:30:47PM -0700, Jeremy Drake wrote:
> I just came across this code I wrote about a year ago which implements a
> function equivilant to width_bucket for timestamps.
> 
> I wrote this when I was trying to plot some data over time, and I had more
> points than I needed.  This function allowed me to create a pre-determined
> number of "bins" to average the data inside of so that I could get a sane
> number of points.  Part of the problem was that there were so many data
> points, that a sql implementation of the function (or plpgsql, I forget,
> it was a year ago) was painfully slow.  This C function provided much
> better performance than any other means at my disposal.
> 
> I wanted to share this code since it may be useful for someone else, but I
> don't know exactly what to do with it.  So I am putting it out there, and
> asking what the proper home for such a function might be.  I believe it
> would be generally useful for people, but it is so small that it hardly
> seems like a reasonable pgFoundry project.  Maybe there is a home for such
> a thing in the core distribution in a future release?
> 
> The code can be found at
> http://www.jdrake.com/postgresql/bintimestamp.tar.gz for a buildable PGXS
> module, or I attached just the C code.  There is no documentation, the
> parameters work the same as the width_bucket function.  The code is not
> necessarily the most readable in the world, I was trying to get as much
> speed out of it as possible, since I was calling it over a million times
> as a group by value.
> 
> Thanks for any pointers...
> 
> -- 
> Fortune's Office Door Sign of the Week:
> 
>     Incorrigible punster -- Do not incorrige.

> /*****************************************************************************
>  * file:        $RCSfile: bintimestamp.c,v $ $Revision: 1.1 $
>  * module:      timestamp
>  * authors:     jeremyd
>  * last mod:    $Author: jeremyd $ at $Date: 2005/10/28 20:26:38 $
>  * 
>  * created:     Fri Oct 28 13:26:38 PDT 2005
>  * 
>  *****************************************************************************/
> 
> #include <string.h>
> #include <math.h>
> #include "postgres.h"
> 
> #include "fmgr.h"
> #include "libpq/pqformat.h"
> #include "utils/builtins.h"
> #include "funcapi.h"
> #include "utils/timestamp.h"
> 
> #ifndef JROUND
> #    define JROUND(x) (x)
> #endif
> 
> Datum timestamp_get_bin_size(PG_FUNCTION_ARGS);
> Datum timestamp_bin(PG_FUNCTION_ARGS);
> 
> PG_FUNCTION_INFO_V1(timestamp_get_bin_size);
> Datum
> timestamp_get_bin_size(PG_FUNCTION_ARGS)
> {
>     Timestamp start = PG_GETARG_TIMESTAMP(0);
>     Timestamp stop = PG_GETARG_TIMESTAMP(1);
>     int32 nbuckets = PG_GETARG_INT32(2);
>     Interval * retval = (Interval *)palloc (sizeof(Interval));
> 
>     if (!retval)
>     {
>         ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("insufficient memory for Interval allocation")));
>         PG_RETURN_NULL();
>     }
> 
>     memset (retval, 0, sizeof(Interval));
> 
>     retval->time = JROUND ((stop - start) / nbuckets);
> 
>     PG_RETURN_INTERVAL_P(retval);
> }
> 
> PG_FUNCTION_INFO_V1(timestamp_bin);
> Datum
> timestamp_bin(PG_FUNCTION_ARGS)
> {
>     /*Timestamp op = PG_GETARG_TIMESTAMP(0);*/
>     Timestamp start = PG_GETARG_TIMESTAMP(1);
>     /*Timestamp stop = PG_GETARG_TIMESTAMP(2);*/
>     Timestamp binsz;
>     /*int32 nbuckets = PG_GETARG_INT32(3)*/;
> 
>     binsz = (PG_GETARG_TIMESTAMP(2) - start) / PG_GETARG_INT32(3);
> 
>     PG_RETURN_TIMESTAMP(JROUND((int)((PG_GETARG_TIMESTAMP(0) - start) / binsz) * binsz + start));
> }

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly


-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: width_bucket function for timestamps

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Sinte we already have width_bucket, I'd argue this should go in core. If
> someone's feeling adventurous, there should probably be a double
> precision version as well. Hrm... and maybe text...

It's not clear to me why we have width_bucket operating on numeric and
not float8 --- that seems like an oversight, if not outright
misunderstanding of the type hierarchy.  But if we had the float8
version, I think Jeremy's problem would be solved just by applying
the float8 version to "extract(epoch from timestamp)".  I don't really
see the use-case for putting N versions of the function in there.
        regards, tom lane


Re: width_bucket function for timestamps

From
"Jim C. Nasby"
Date:
On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > Sinte we already have width_bucket, I'd argue this should go in core. If
> > someone's feeling adventurous, there should probably be a double
> > precision version as well. Hrm... and maybe text...
> 
> It's not clear to me why we have width_bucket operating on numeric and
> not float8 --- that seems like an oversight, if not outright
> misunderstanding of the type hierarchy.  But if we had the float8
> version, I think Jeremy's problem would be solved just by applying
> the float8 version to "extract(epoch from timestamp)".  I don't really
> see the use-case for putting N versions of the function in there.

Well, it would be nice to have a timestamp version so that users didn't
have to keep typing "extract(epoch from timestamp)"... but yeah, I
suspect that would work fine for timestamps. For intervals I suspect you
could just convert to seconds (if we're going to add timestamps, it
seems like we should add intervals as well).
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: width_bucket function for timestamps

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote:
>> ... I think Jeremy's problem would be solved just by applying
>> the float8 version to "extract(epoch from timestamp)".

Thinko there ... I meant to type "extract(epoch from interval)".

> Well, it would be nice to have a timestamp version so that users didn't
> have to keep typing "extract(epoch from timestamp)"... but yeah, I
> suspect that would work fine for timestamps. For intervals I suspect you
> could just convert to seconds (if we're going to add timestamps, it
> seems like we should add intervals as well).

This is exactly the slippery slope I don't care to start down.
        regards, tom lane


Re: width_bucket function for timestamps

From
"Jim C. Nasby"
Date:
On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote:
> >> ... I think Jeremy's problem would be solved just by applying
> >> the float8 version to "extract(epoch from timestamp)".
> 
> Thinko there ... I meant to type "extract(epoch from interval)".

Except that the patch is for timestamp support, not intervals.

> > Well, it would be nice to have a timestamp version so that users didn't
> > have to keep typing "extract(epoch from timestamp)"... but yeah, I
> > suspect that would work fine for timestamps. For intervals I suspect you
> > could just convert to seconds (if we're going to add timestamps, it
> > seems like we should add intervals as well).
> 
> This is exactly the slippery slope I don't care to start down.

I guess I'm confused as to how this is any different from other
functions where we've provided multiple input arguments, such as the
aggregate functions. It certainly doesn't seem like it'd take a lot of
extra code to support this...
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: width_bucket function for timestamps

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote:
>> This is exactly the slippery slope I don't care to start down.

> I guess I'm confused as to how this is any different from other
> functions where we've provided multiple input arguments, such as the
> aggregate functions.

The salient reason is that the spec only defines width_bucket for numeric
input arguments, whereas stuff like max/min is defined *by the spec* for
other data types.

Since there's no spec-based argument for allowing width_bucket for other
datatypes, and only an (IMHO) very weak use-case for it, I don't think
we should add the clutter.
        regards, tom lane


Re: width_bucket function for timestamps

From
Jeremy Drake
Date:
On Mon, 9 Oct 2006, Tom Lane wrote:

> It's not clear to me why we have width_bucket operating on numeric and
> not float8 --- that seems like an oversight, if not outright
> misunderstanding of the type hierarchy.

Would that make the below a lot faster?

> But if we had the float8
> version, I think Jeremy's problem would be solved just by applying
> the float8 version to "extract(epoch from timestamp)".  I don't really
> see the use-case for putting N versions of the function in there.

I found the function I used before I implemented the C version.  It was
significantly slower, which is why I wrote the C version.

-- given a date range and a number of buckets, round the given date to one
-- of the buckets such that any number of dates within the date range passed
-- in to this function will only return up to the number of buckets unique
-- values
CREATE OR REPLACE FUNCTION date_width_bucket       (tm TIMESTAMP WITHOUT TIME ZONE,       low TIMESTAMP WITHOUT TIME
ZONE,      high TIMESTAMP WITHOUT TIME ZONE,       nbuckets INTEGER
 
) RETURNS TIMESTAMP WITHOUT TIME ZONE AS $$       SELECT ((EXTRACT(epoch FROM $3) - EXTRACT(epoch FROM $2)) / $4) *
         (width_bucket(EXTRACT(epoch FROM $1)::NUMERIC,                       EXTRACT(epoch FROM $2)::NUMERIC,
            EXTRACT(epoch FROM $3)::NUMERIC,                       $4)               - 1) * '1 second'::INTERVAL + $2;
 
$$ LANGUAGE sql IMMUTABLE STRICT;


-- 
I don't think they could put him in a mental hospital.  On the other
hand, if he were already in, I don't think they'd let him out.


Re: width_bucket function for timestamps

From
"Jim C. Nasby"
Date:
On Mon, Oct 09, 2006 at 03:49:50PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote:
> >> This is exactly the slippery slope I don't care to start down.
> 
> > I guess I'm confused as to how this is any different from other
> > functions where we've provided multiple input arguments, such as the
> > aggregate functions.
> 
> The salient reason is that the spec only defines width_bucket for numeric
> input arguments, whereas stuff like max/min is defined *by the spec* for
> other data types.
> 
> Since there's no spec-based argument for allowing width_bucket for other
> datatypes, and only an (IMHO) very weak use-case for it, I don't think
> we should add the clutter.

Catalog or code clutter? ISTM that it wouldn't take much extra work at
all to provide this for timestamps or intervals...

In any case, having a faster version that used double certainly seems
like it'd be useful. It'd probably allow the OP to go back to his
original, simple version.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: width_bucket function for timestamps

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> I found the function I used before I implemented the C version.  It was
> significantly slower, which is why I wrote the C version.

I would imagine that most of the problem is the NUMERIC arithmetic
that's doing.
        regards, tom lane


Re: width_bucket function for timestamps

From
Neil Conway
Date:
On Mon, 2006-10-09 at 12:02 -0400, Tom Lane wrote:
> It's not clear to me why we have width_bucket operating on numeric and
> not float8

I asked about this when I originally implemented width_bucket(), I
recall[1]. At the time, there was scepticism about whether it was even
worth implementing width_bucket(), let alone providing multiple
implementations of it. I'd be happy to provide a float8 implementation
(I may even have one lying around somewhere...)

-Neil

[1] http://archives.postgresql.org/pgsql-patches/2004-04/msg00259.php



Re: width_bucket function for timestamps

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Mon, 2006-10-09 at 12:02 -0400, Tom Lane wrote:
>> It's not clear to me why we have width_bucket operating on numeric and
>> not float8

> I asked about this when I originally implemented width_bucket(), I
> recall[1]. At the time, there was scepticism about whether it was even
> worth implementing width_bucket(), let alone providing multiple
> implementations of it. I'd be happy to provide a float8 implementation
> (I may even have one lying around somewhere...)

It's probably too late for 8.2, unless some other compelling reason to
force an initdb surfaces.  But if you can find the code, please stick it
in when 8.3 devel starts.  As it stands, one must do an explicit
coercion to numeric to use width_bucket() with floats; which is tedious,
slow, and nowhere required by the spec.
        regards, tom lane