Thread: width_bucket function for timestamps
I just came across this code I wrote about a year ago which implements a function equivilant to width_bucket for timestamps. I wrote this when I was trying to plot some data over time, and I had more points than I needed. This function allowed me to create a pre-determined number of "bins" to average the data inside of so that I could get a sane number of points. Part of the problem was that there were so many data points, that a sql implementation of the function (or plpgsql, I forget, it was a year ago) was painfully slow. This C function provided much better performance than any other means at my disposal. I wanted to share this code since it may be useful for someone else, but I don't know exactly what to do with it. So I am putting it out there, and asking what the proper home for such a function might be. I believe it would be generally useful for people, but it is so small that it hardly seems like a reasonable pgFoundry project. Maybe there is a home for such a thing in the core distribution in a future release? The code can be found at http://www.jdrake.com/postgresql/bintimestamp.tar.gz for a buildable PGXS module, or I attached just the C code. There is no documentation, the parameters work the same as the width_bucket function. The code is not necessarily the most readable in the world, I was trying to get as much speed out of it as possible, since I was calling it over a million times as a group by value. Thanks for any pointers... -- Fortune's Office Door Sign of the Week: Incorrigible punster -- Do not incorrige.
Sinte we already have width_bucket, I'd argue this should go in core. If someone's feeling adventurous, there should probably be a double precision version as well. Hrm... and maybe text... Doesn't the backend already have something like this for calculating histograms? On Sun, Oct 08, 2006 at 10:30:47PM -0700, Jeremy Drake wrote: > I just came across this code I wrote about a year ago which implements a > function equivilant to width_bucket for timestamps. > > I wrote this when I was trying to plot some data over time, and I had more > points than I needed. This function allowed me to create a pre-determined > number of "bins" to average the data inside of so that I could get a sane > number of points. Part of the problem was that there were so many data > points, that a sql implementation of the function (or plpgsql, I forget, > it was a year ago) was painfully slow. This C function provided much > better performance than any other means at my disposal. > > I wanted to share this code since it may be useful for someone else, but I > don't know exactly what to do with it. So I am putting it out there, and > asking what the proper home for such a function might be. I believe it > would be generally useful for people, but it is so small that it hardly > seems like a reasonable pgFoundry project. Maybe there is a home for such > a thing in the core distribution in a future release? > > The code can be found at > http://www.jdrake.com/postgresql/bintimestamp.tar.gz for a buildable PGXS > module, or I attached just the C code. There is no documentation, the > parameters work the same as the width_bucket function. The code is not > necessarily the most readable in the world, I was trying to get as much > speed out of it as possible, since I was calling it over a million times > as a group by value. > > Thanks for any pointers... > > -- > Fortune's Office Door Sign of the Week: > > Incorrigible punster -- Do not incorrige. > /***************************************************************************** > * file: $RCSfile: bintimestamp.c,v $ $Revision: 1.1 $ > * module: timestamp > * authors: jeremyd > * last mod: $Author: jeremyd $ at $Date: 2005/10/28 20:26:38 $ > * > * created: Fri Oct 28 13:26:38 PDT 2005 > * > *****************************************************************************/ > > #include <string.h> > #include <math.h> > #include "postgres.h" > > #include "fmgr.h" > #include "libpq/pqformat.h" > #include "utils/builtins.h" > #include "funcapi.h" > #include "utils/timestamp.h" > > #ifndef JROUND > # define JROUND(x) (x) > #endif > > Datum timestamp_get_bin_size(PG_FUNCTION_ARGS); > Datum timestamp_bin(PG_FUNCTION_ARGS); > > PG_FUNCTION_INFO_V1(timestamp_get_bin_size); > Datum > timestamp_get_bin_size(PG_FUNCTION_ARGS) > { > Timestamp start = PG_GETARG_TIMESTAMP(0); > Timestamp stop = PG_GETARG_TIMESTAMP(1); > int32 nbuckets = PG_GETARG_INT32(2); > Interval * retval = (Interval *)palloc (sizeof(Interval)); > > if (!retval) > { > ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("insufficient memory for Interval allocation"))); > PG_RETURN_NULL(); > } > > memset (retval, 0, sizeof(Interval)); > > retval->time = JROUND ((stop - start) / nbuckets); > > PG_RETURN_INTERVAL_P(retval); > } > > PG_FUNCTION_INFO_V1(timestamp_bin); > Datum > timestamp_bin(PG_FUNCTION_ARGS) > { > /*Timestamp op = PG_GETARG_TIMESTAMP(0);*/ > Timestamp start = PG_GETARG_TIMESTAMP(1); > /*Timestamp stop = PG_GETARG_TIMESTAMP(2);*/ > Timestamp binsz; > /*int32 nbuckets = PG_GETARG_INT32(3)*/; > > binsz = (PG_GETARG_TIMESTAMP(2) - start) / PG_GETARG_INT32(3); > > PG_RETURN_TIMESTAMP(JROUND((int)((PG_GETARG_TIMESTAMP(0) - start) / binsz) * binsz + start)); > } > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > Sinte we already have width_bucket, I'd argue this should go in core. If > someone's feeling adventurous, there should probably be a double > precision version as well. Hrm... and maybe text... It's not clear to me why we have width_bucket operating on numeric and not float8 --- that seems like an oversight, if not outright misunderstanding of the type hierarchy. But if we had the float8 version, I think Jeremy's problem would be solved just by applying the float8 version to "extract(epoch from timestamp)". I don't really see the use-case for putting N versions of the function in there. regards, tom lane
On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Sinte we already have width_bucket, I'd argue this should go in core. If > > someone's feeling adventurous, there should probably be a double > > precision version as well. Hrm... and maybe text... > > It's not clear to me why we have width_bucket operating on numeric and > not float8 --- that seems like an oversight, if not outright > misunderstanding of the type hierarchy. But if we had the float8 > version, I think Jeremy's problem would be solved just by applying > the float8 version to "extract(epoch from timestamp)". I don't really > see the use-case for putting N versions of the function in there. Well, it would be nice to have a timestamp version so that users didn't have to keep typing "extract(epoch from timestamp)"... but yeah, I suspect that would work fine for timestamps. For intervals I suspect you could just convert to seconds (if we're going to add timestamps, it seems like we should add intervals as well). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote: >> ... I think Jeremy's problem would be solved just by applying >> the float8 version to "extract(epoch from timestamp)". Thinko there ... I meant to type "extract(epoch from interval)". > Well, it would be nice to have a timestamp version so that users didn't > have to keep typing "extract(epoch from timestamp)"... but yeah, I > suspect that would work fine for timestamps. For intervals I suspect you > could just convert to seconds (if we're going to add timestamps, it > seems like we should add intervals as well). This is exactly the slippery slope I don't care to start down. regards, tom lane
On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote: > >> ... I think Jeremy's problem would be solved just by applying > >> the float8 version to "extract(epoch from timestamp)". > > Thinko there ... I meant to type "extract(epoch from interval)". Except that the patch is for timestamp support, not intervals. > > Well, it would be nice to have a timestamp version so that users didn't > > have to keep typing "extract(epoch from timestamp)"... but yeah, I > > suspect that would work fine for timestamps. For intervals I suspect you > > could just convert to seconds (if we're going to add timestamps, it > > seems like we should add intervals as well). > > This is exactly the slippery slope I don't care to start down. I guess I'm confused as to how this is any different from other functions where we've provided multiple input arguments, such as the aggregate functions. It certainly doesn't seem like it'd take a lot of extra code to support this... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote: >> This is exactly the slippery slope I don't care to start down. > I guess I'm confused as to how this is any different from other > functions where we've provided multiple input arguments, such as the > aggregate functions. The salient reason is that the spec only defines width_bucket for numeric input arguments, whereas stuff like max/min is defined *by the spec* for other data types. Since there's no spec-based argument for allowing width_bucket for other datatypes, and only an (IMHO) very weak use-case for it, I don't think we should add the clutter. regards, tom lane
On Mon, 9 Oct 2006, Tom Lane wrote: > It's not clear to me why we have width_bucket operating on numeric and > not float8 --- that seems like an oversight, if not outright > misunderstanding of the type hierarchy. Would that make the below a lot faster? > But if we had the float8 > version, I think Jeremy's problem would be solved just by applying > the float8 version to "extract(epoch from timestamp)". I don't really > see the use-case for putting N versions of the function in there. I found the function I used before I implemented the C version. It was significantly slower, which is why I wrote the C version. -- given a date range and a number of buckets, round the given date to one -- of the buckets such that any number of dates within the date range passed -- in to this function will only return up to the number of buckets unique -- values CREATE OR REPLACE FUNCTION date_width_bucket (tm TIMESTAMP WITHOUT TIME ZONE, low TIMESTAMP WITHOUT TIME ZONE, high TIMESTAMP WITHOUT TIME ZONE, nbuckets INTEGER ) RETURNS TIMESTAMP WITHOUT TIME ZONE AS $$ SELECT ((EXTRACT(epoch FROM $3) - EXTRACT(epoch FROM $2)) / $4) * (width_bucket(EXTRACT(epoch FROM $1)::NUMERIC, EXTRACT(epoch FROM $2)::NUMERIC, EXTRACT(epoch FROM $3)::NUMERIC, $4) - 1) * '1 second'::INTERVAL + $2; $$ LANGUAGE sql IMMUTABLE STRICT; -- I don't think they could put him in a mental hospital. On the other hand, if he were already in, I don't think they'd let him out.
On Mon, Oct 09, 2006 at 03:49:50PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote: > >> This is exactly the slippery slope I don't care to start down. > > > I guess I'm confused as to how this is any different from other > > functions where we've provided multiple input arguments, such as the > > aggregate functions. > > The salient reason is that the spec only defines width_bucket for numeric > input arguments, whereas stuff like max/min is defined *by the spec* for > other data types. > > Since there's no spec-based argument for allowing width_bucket for other > datatypes, and only an (IMHO) very weak use-case for it, I don't think > we should add the clutter. Catalog or code clutter? ISTM that it wouldn't take much extra work at all to provide this for timestamps or intervals... In any case, having a faster version that used double certainly seems like it'd be useful. It'd probably allow the OP to go back to his original, simple version. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jeremy Drake <pgsql@jdrake.com> writes: > I found the function I used before I implemented the C version. It was > significantly slower, which is why I wrote the C version. I would imagine that most of the problem is the NUMERIC arithmetic that's doing. regards, tom lane
On Mon, 2006-10-09 at 12:02 -0400, Tom Lane wrote: > It's not clear to me why we have width_bucket operating on numeric and > not float8 I asked about this when I originally implemented width_bucket(), I recall[1]. At the time, there was scepticism about whether it was even worth implementing width_bucket(), let alone providing multiple implementations of it. I'd be happy to provide a float8 implementation (I may even have one lying around somewhere...) -Neil [1] http://archives.postgresql.org/pgsql-patches/2004-04/msg00259.php
Neil Conway <neilc@samurai.com> writes: > On Mon, 2006-10-09 at 12:02 -0400, Tom Lane wrote: >> It's not clear to me why we have width_bucket operating on numeric and >> not float8 > I asked about this when I originally implemented width_bucket(), I > recall[1]. At the time, there was scepticism about whether it was even > worth implementing width_bucket(), let alone providing multiple > implementations of it. I'd be happy to provide a float8 implementation > (I may even have one lying around somewhere...) It's probably too late for 8.2, unless some other compelling reason to force an initdb surfaces. But if you can find the code, please stick it in when 8.3 devel starts. As it stands, one must do an explicit coercion to numeric to use width_bucket() with floats; which is tedious, slow, and nowhere required by the spec. regards, tom lane