Thread: group by week (ww), plus -S performance

group by week (ww), plus -S performance

From
Michael Blakeley
Date:
I thought I'd pass along a work-around I came up with for the limits
in 'ww' support (7.0 final). Maybe this would be a useful example for
the docs? They're a little lean on date/time examples, IMO. So is the
new book.

Task:
Select a count of records from a table, grouped by the week of the
record. The table is something like

    CREATE table EVENTS
        (event varchar(128) not null, stamp datetime default now());

And I want the output to use human-readable dates, like

     week    | count
------------+-------
  2000-03-06 |     4
  2000-03-13 |     5
  2000-03-20 |     3

My immediate solution was something like
    SELECT to_date(date_part('year',stamp)||'-'||date_part('week',stamp),
        'yyyy-ww'),count(*) FROM EVENTS GROUP BY to_date;

but to_date() doesn't seem to support 'ww' format for text-to-date
translation (not documented, AFAIK).

The solution I eventually found was
    SELECT to_date(date_part('year',stamp),'yyyy')-5+7*date_part('week',stamp)
        as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;

I haven't tested the '-5' kludge very extensively. It seems to work
ok, for the dates I tested in 2000. I'm sure it would run into
trouble with calendar-change weirdness pre-1900.

I'd also love to hear any suggestions for performance improvements -
it's cpu-bound on my system, and takes about 70 sec to run with
86,000 rows.

The query plan is
Aggregate  (cost=9155.76..9584.66 rows=8578 width=20)
   ->  Group  (cost=9155.76..9370.21 rows=85780 width=20)
         ->  Sort  (cost=9155.76..9155.76 rows=85780 width=20)
               ->  Seq Scan on events  (cost=0.00..2126.80 rows=85780 width=20)

The order-by clause doesn't seem to add much overhead - the query
plan is the same with or without it.

I'm running with "-i -N 64 -B 1024 -o '-F'", and I've tried up to '-S
8192' without seeing any noticeable improvement. At higher values,
performance actually went down by almost 50% - something to do with
shmem segment sizes? This is on Solaris 2.6, and I compiled PG7.0
using gcc 2.95.

The -S does reduce disk I/O, but I think that's only about 5% of the
work that's going on (90% user time). An index on 'stamp' doesn't
seem to help, either.

thanks,
-- Mike

Re: group by week (ww), plus -S performance

From
Tom Lane
Date:
Michael Blakeley <mike@blakeley.com> writes:
> SELECT to_date(date_part('year',stamp),'yyyy')-5+7*date_part('week',stamp)
>     as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;

Seems like the hard way.  You are relying on an implicit conversion from
the float8 output of date_part() to the text input to_date expects,
which is kind of slow (and IMHO it's a bug in 7.0 that it will do such
a conversion silently, anyhow).  Better to use date_trunc to save the
conversion step:

SELECT date_trunc('year',stamp)-5+7*date_part('week',stamp)
    as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;

But the real problem is that this is going to use a start-of-week day
that is offset five days from whatever day-of-the-week 1 January is.
If you'd tried a few other years than 2000 you'd likely have been
dissatisfied with the results...

It seems like the right answer is that date_trunc('week',stamp) ought
to do what you want, but it doesn't seem to be implemented.  That's
definitely a shortcoming --- anyone want to fix it?

> The order-by clause doesn't seem to add much overhead - the query
> plan is the same with or without it.

Right, the GROUP BY requires a sort anyway, so the planner knows
there's no need to sort again on the same key.

            regards, tom lane

Re: group by week (ww), plus -S performance

From
Michael Blakeley
Date:
At 12:54 AM -0400 5/29/2000, Tom Lane wrote:
>Michael Blakeley <mike@blakeley.com> writes:
>>  SELECT to_date(date_part('year',stamp),'yyyy')-5+7*date_part('week',stamp)
>>    as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;
>
>Seems like the hard way.  You are relying on an implicit conversion from
>the float8 output of date_part() to the text input to_date expects,
>which is kind of slow (and IMHO it's a bug in 7.0 that it will do such
>a conversion silently, anyhow).  Better to use date_trunc to save the
>conversion step:
>
>SELECT date_trunc('year',stamp)-5+7*date_part('week',stamp)
>    as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;

Thanks - that seems to be about 10% faster. I also got a 25% boost by
recompiling PG7 with -O3.

>But the real problem is that this is going to use a start-of-week day
>that is offset five days from whatever day-of-the-week 1 January is.
>If you'd tried a few other years than 2000 you'd likely have been
>dissatisfied with the results...

Here's something that works a bit better, at the expense of falling
in line with 'dow' and starting each week on Sunday, which I don't
mind. I'd only gone with Monday in the first place because that was
what 'ww' seemed to do.

$ cal 12 1999
    December 1999
  S  M Tu  W Th  F  S
           1  2  3  4
  5  6  7  8  9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31

$ cal 1 2000
    January 2000
  S  M Tu  W Th  F  S
                    1
  2  3  4  5  6  7  8
  9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31

$ cal 1 2001
    January 2001
  S  M Tu  W Th  F  S
     1  2  3  4  5  6
  7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

$ D="'2000-01-01'"; psql -c "SELECT $D::date-date_part('dow',$D::date);"
   ?column?
------------
  1999-12-26
(1 row)

$ D="'2001-01-01'"; psql -c "SELECT $D::date-date_part('dow',$D::date);"
   ?column?
------------
  2000-12-31
(1 row)

$ D="'2000-01-15'"; psql -c "SELECT $D::date-date_part('dow',$D::date);"
   ?column?
------------
  2000-01-09
(1 row)

$ D="'2000-01-16'"; psql -c "SELECT $D::date-date_part('dow',$D::date);"
   ?column?
------------
  2000-01-16
(1 row)

It's also about 35% faster, and hits me with a healthy "duh!" factor
since it's so obvious and so much cleaner. The total time for my
original query has been cut by more than half - from 70 sec to 28
sec. Thanks!

>It seems like the right answer is that date_trunc('week',stamp) ought
>to do what you want, but it doesn't seem to be implemented.  That's
>definitely a shortcoming --- anyone want to fix it?

You know what they say - if you want the right answer, just post the
wrong answer and wait to be corrected :-).

If you can't find fault with the query above, maybe it could be
plugged into date_trunc('week', stamp) as-is? I haven't even looked
at the source code, myself....

-- Mike

Re: group by week (ww), plus -S performance

From
Erich
Date:
> Thanks - that seems to be about 10% faster. I also got a 25% boost
> by recompiling PG7 with -O3.

Do you mean you got a 25% boost by recompiling with -O3 instead of
-O2, or instead of -g?  If going from -O2 to -O3 gives a 25% boost,
time for me to recompile.  I'm just now moving from MySQL to
PostgreSQL, and I'm also totally changing the data structures into
something much less efficient, so I need all the speed boosts I can
get.  Recompiling is much cheaper than buying a new computer.

e

Re: group by week (ww), plus -S performance

From
Karel Zak
Date:
On Sun, 28 May 2000, Michael Blakeley wrote:

> At 12:54 AM -0400 5/29/2000, Tom Lane wrote:
> >Michael Blakeley <mike@blakeley.com> writes:
> >>  SELECT to_date(date_part('year',stamp),'yyyy')-5+7*date_part('week',stamp)
> >>    as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;
> >
> >Seems like the hard way.  You are relying on an implicit conversion from
> >the float8 output of date_part() to the text input to_date expects,
> >which is kind of slow (and IMHO it's a bug in 7.0 that it will do such
> >a conversion silently, anyhow).  Better to use date_trunc to save the
> >conversion step:
> >
> >SELECT date_trunc('year',stamp)-5+7*date_part('week',stamp)
> >    as week,count(*) FROM EVENTS GROUP BY week ORDER BY week;
>
> Thanks - that seems to be about 10% faster. I also got a 25% boost by
> recompiling PG7 with -O3.


Well, support for 'WW' in 'to_date()' is just in my TODO.


Probably will more faster

    select to_date( to_char(now(), 'YYYY-WW'), 'YYYY-WW');

than 3 operators and 2 functions in one query.


Or best way will add support for 'start-date' per week / month / quarter to
'to_char()'. Hmm, I try it. But how format-pictires use for this?

    SWW / SMM / SQ ?

 IMHO, the postgreSQL will first DB with this feature :-)

                            Karel