Thread: update and group by/aggregate

update and group by/aggregate

From

Ivan Sergio Borgonovo

Date:

27 August 2008, 08:32:57

How can I update catalog_items.Authors

create table catalog_items(
  ItemID bigint primary key,
  Authors varchar(1024)
);

taking results from

select ia.ItemID, array_accum(trim(' \t]' from a.Name))
  from catalog_itemauthor ia
  join catalog_author a on a.AuthorID=ia.AuthorID
  where a.Name is not null and length(trim(' \t' from a.Name))>1
  group by ia.ItemID;

Currently I'm achieving the same result with a plpsql function with
a for loop, and I'm planning to see if using aggregates is going to
be faster and then see if it is even worth to keep an extra column...

create or replace function UpdateAuthors()
returns void
as
$$
declare
  _row record;
  _ItemID bigint;
  _Authors varchar(1024);
  _Name varchar(50);
begin
  _Authors:='';
  _ItemID:=null;
  for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia
    join {catalog_author} a on a.AuthorID=ia.AuthorID
    order by ia.ItemID
      loop
    if(_row.ItemID<>_ItemID) then
      if(length(_Authors)>2) then
        _Authors:=substring(_Authors from 3);
        update {catalog_items} set Authors=_Authors
          where ItemID=_ItemID;
      end if;
      _Authors:='';
    end if;
    _ItemID:=_row.ItemID;
    _Name:=trim(' \t' from _row.Name);
    if(length(_Name)>0) then
      _Authors:= _Authors || ', ' || _Name;
    end if;
  end loop;
return;
end;
$$ language plpgsql volatile;

BTW
I've noticed that array_append gives back not uniform results:

select array_append(ARRAY['az','e','i'],'');
{az,e,i,""}
while I'd expect
{"az","e","i",""}

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: update and group by/aggregate

From

"Pavel Stehule"

Date:

27 August 2008, 08:43:01

2008/8/27 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> How can I update catalog_items.Authors
>
> create table catalog_items(
>  ItemID bigint primary key,
>  Authors varchar(1024)
> );
>
> taking results from
>
> select ia.ItemID, array_accum(trim(' \t]' from a.Name))
>  from catalog_itemauthor ia
>  join catalog_author a on a.AuthorID=ia.AuthorID
>  where a.Name is not null and length(trim(' \t' from a.Name))>1
>  group by ia.ItemID;
>
> Currently I'm achieving the same result with a plpsql function with
> a for loop, and I'm planning to see if using aggregates is going to
> be faster and then see if it is even worth to keep an extra column...
>
> create or replace function UpdateAuthors()
> returns void
> as
> $$
> declare
>  _row record;
>  _ItemID bigint;
>  _Authors varchar(1024);
>  _Name varchar(50);
> begin
>  _Authors:='';
>  _ItemID:=null;
>  for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia
>    join {catalog_author} a on a.AuthorID=ia.AuthorID
>    order by ia.ItemID
>      loop
>    if(_row.ItemID<>_ItemID) then
>      if(length(_Authors)>2) then
>        _Authors:=substring(_Authors from 3);
>        update {catalog_items} set Authors=_Authors
>          where ItemID=_ItemID;
>      end if;
>      _Authors:='';
>    end if;
>    _ItemID:=_row.ItemID;
>    _Name:=trim(' \t' from _row.Name);
>    if(length(_Name)>0) then
>      _Authors:= _Authors || ', ' || _Name;
>    end if;
>  end loop;
> return;
> end;
> $$ language plpgsql volatile;
>
> BTW
> I've noticed that array_append gives back not uniform results:
>
> select array_append(ARRAY['az','e','i'],'');
> {az,e,i,""}

it's correct, double quotes are used only for elements with some
speces or for empty string

if you would to print array content well, use array_to_string function

regards
Pavel Stehule

> while I'd expect
> {"az","e","i",""}
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: update and group by/aggregate

From

Sam Mason

Date:

27 August 2008, 10:44:29

On Wed, Aug 27, 2008 at 10:32:43AM +0200, Ivan Sergio Borgonovo wrote:
> How can I update catalog_items.Authors
>
> create table catalog_items(
>   ItemID bigint primary key,
>   Authors varchar(1024)
> );

The type "varchar(1024)" looks a little awkward, wouldn't an unadorned
TEXT be easier? if you explicitly want to limit things to 1024
characters then what you're doing is correct.

> taking results from
>
> select ia.ItemID, array_accum(trim(' \t]' from a.Name))
>   from catalog_itemauthor ia
>   join catalog_author a on a.AuthorID=ia.AuthorID
>   where a.Name is not null and length(trim(' \t' from a.Name))>1
>   group by ia.ItemID;

UPDATE catalog_items SET authors=array_to_string(x.authors,', ')
  FROM (
    SELECT ia.itemid, array_accum(trim(' \t]' from a.name)) AS authors
    FROM catalog_itemauthor ia
      JOIN catalog_author a ON a.authorid = ia.authorid
    WHERE a.name IS NOT NULL
      AND length(trim(' \t' from a.name))>1
    GROUP BY ia.itemid) x;

is a reasonably direct translation.  Though I may be tempted to use
something more like:

UPDATE catalog_items SET authors=array_to_string(x.authors,', ')
  FROM (
    SELECT ia.itemid, array_accum(a.name) AS authors
    FROM catalog_itemauthor ia, (
      SELECT authorid, trim(' \t' from name) AS name
      FROM catalog_author) a
    WHERE ia.authorid = a.authorid
      AND a.name IS NOT NULL
      AND length(a.name) > 1
    GROUP BY ia.itemid) x;

to ensure that the characters trimmed from the authors' names are
consistent.


  Sam

Re: update and group by/aggregate

From

Sam Mason

Date:

27 August 2008, 10:51:50

Oops, both my statements were horribly broken.  They needed a WHERE
clause for the UPDATE.

On Wed, Aug 27, 2008 at 11:44:20AM +0100, Sam Mason wrote:
> UPDATE catalog_items SET authors=array_to_string(x.authors,', ')
>   FROM (
>     SELECT ia.itemid, array_accum(trim(' \t]' from a.name)) AS authors
>     FROM catalog_itemauthor ia
>       JOIN catalog_author a ON a.authorid = ia.authorid
>     WHERE a.name IS NOT NULL
>       AND length(trim(' \t' from a.name))>1
>     GROUP BY ia.itemid) x;

should be:

UPDATE catalog_items i SET authors=array_to_string(x.authors,', ')
  FROM (
    SELECT ia.itemid, array_accum(trim(' \t]' from a.name)) AS authors
    FROM catalog_itemauthor ia
      JOIN catalog_author a ON a.authorid = ia.authorid
    WHERE a.name IS NOT NULL
      AND length(trim(' \t' from a.name))>1
    GROUP BY ia.itemid) x
  WHERE i.itemid = x.itemid;

> UPDATE catalog_items SET authors=array_to_string(x.authors,', ')
>   FROM (
>     SELECT ia.itemid, array_accum(a.name) AS authors
>     FROM catalog_itemauthor ia, (
>       SELECT authorid, trim(' \t' from name) AS name
>       FROM catalog_author) a
>     WHERE ia.authorid = a.authorid
>       AND a.name IS NOT NULL
>       AND length(a.name) > 1
>     GROUP BY ia.itemid) x;

should be:

UPDATE catalog_items i SET authors=array_to_string(x.authors,', ')
  FROM (
    SELECT ia.itemid, array_accum(a.name) AS authors
    FROM catalog_itemauthor ia, (
      SELECT authorid, trim(' \t' from name) AS name
      FROM catalog_author) a
    WHERE ia.authorid = a.authorid
      AND a.name IS NOT NULL
      AND length(a.name) > 1
    GROUP BY ia.itemid) x
  WHERE i.itemid = x.itemid;

Sorry!


  Sam

Re: loop vs. aggregate was: update and group by/aggregate

From

Ivan Sergio Borgonovo

Date:

27 August 2008, 11:19:40

On Wed, 27 Aug 2008 10:32:43 +0200
Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

> create or replace function UpdateAuthors()
> returns void
> as
> $$
> declare
>   _row record;
>   _ItemID bigint;
>   _Authors varchar(1024);
>   _Name varchar(50);
> begin
>   _Authors:='';
>   _ItemID:=null;
>   for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia
>     join {catalog_author} a on a.AuthorID=ia.AuthorID
>     order by ia.ItemID
>       loop
>     if(_row.ItemID<>_ItemID) then
>       if(length(_Authors)>2) then
>         _Authors:=substring(_Authors from 3);
>         update {catalog_items} set Authors=_Authors
>           where ItemID=_ItemID;
>       end if;
>       _Authors:='';
>     end if;
>     _ItemID:=_row.ItemID;
>     _Name:=trim(' \t' from _row.Name);
>     if(length(_Name)>0) then
>       _Authors:= _Authors || ', ' || _Name;
>     end if;
>   end loop;
> return;
> end;
> $$ language plpgsql volatile;

Replying to myself:
update catalog_items set Authors=s.Authors
from (
  select ia.ItemID,
  array_to_string(array_accum(trim(' \t]' from a.Name)),', ') as
  Authors
  from catalog_itemauthor ia
  join catalog_author a on a.AuthorID=ia.AuthorID
  where a.Name is not null and
  length(trim(' \t' from a.Name))>1
  group by ia.ItemID)
  as s where s.ItemID=catalog_items.ItemID;

but this looks much slower than the function:
function: 113sec
vs.
single statement: 488sec
I repeated the test 3 times with similar results.
Can anybody explain why aggregates under perform so badly?
I just read that most of the times I can't be smarter than the
planner and I thought that this would be one of the circumstances
theplanner could outperform my handwritten function.

here is the explain:
"Hash Join  (cost=137408.51..271599.78 rows=209674 width=221)"
"  Hash Cond: ("outer".itemid = "inner".itemid)"
"  ->  HashAggregate  (cost=32994.81..36664.11 rows=209674 width=58)"
"        ->  Hash Join  (cost=8544.62..31946.44 rows=209674 width=58)"
"              Hash Cond: ("outer".authorid = "inner".authorid)"
"              ->  Seq Scan on catalog_itemauthor ia  (cost=0.00..10297.21 rows=629021 width=12)"
"              ->  Hash  (cost=8309.00..8309.00 rows=94248 width=54)"
"                    ->  Seq Scan on catalog_author a  (cost=0.00..8309.00 rows=94248 width=54)"
"                          Filter: ((name IS NOT NULL) AND (length(btrim((name)::text, E' \011'::text)) > 1))"
"  ->  Hash  (cost=79538.96..79538.96 rows=833496 width=189)"
"        ->  Seq Scan on catalog_items  (cost=0.00..79538.96 rows=833496 width=189)"

thanks to Pavel who suggested the use of array_to_string

Sam... I did your same error the first time I wrote the above
statement... missing the where clause but I can't understand the
need for your longer version "to ensure that characters trimmed from
the authors' name are consistent.

I prefer to put constraint on the length of varchar as an early
warning for something that is unexpected.
eg. if the loop was not working as expected I'd get an error. Infact
that's exactly what happened during development of the above
function.
If Authors was a text field I'd be waiting forever, fill the DB with
rubbish etc...
There are not unlimited fields... there are fields where the upper
limit may be hard to guess.
1024 is actually an inherited choice... but I did some research to
find a reasonable upper limit (highest # of authors * longest
author).

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: loop vs. aggregate was: update and group by/aggregate

From

Sam Mason

Date:

27 August 2008, 11:45:51

On Wed, Aug 27, 2008 at 01:19:30PM +0200, Ivan Sergio Borgonovo wrote:
> but this looks much slower than the function:
> function: 113sec
> vs.
> single statement: 488sec
> I repeated the test 3 times with similar results.

Have you been vacuuming (non-full) between runs? and as always, are the
stats reasonably up to date?

> Can anybody explain why aggregates under perform so badly?

You could try just running the SELECT part to see how long the
aggregation takes.

> I just read that most of the times I can't be smarter than the
> planner and I thought that this would be one of the circumstances
> theplanner could outperform my handwritten function.

It's quite easy to outsmart the planner with large amounts of data, but
it's surprising how well it does most of the time.  Generally you can
just write whatever is obvious and the planer will do something good
with it.  If it doesn't do the right thing then you can worry about
performance, rather than most languages where you have to worry about
performance from the start.

> here is the explain:
> "Hash Join  (cost=137408.51..271599.78 rows=209674 width=221)"
> "  Hash Cond: ("outer".itemid = "inner".itemid)"
> "  ->  HashAggregate  (cost=32994.81..36664.11 rows=209674 width=58)"
> "        ->  Hash Join  (cost=8544.62..31946.44 rows=209674 width=58)"
> "              Hash Cond: ("outer".authorid = "inner".authorid)"
> "              ->  Seq Scan on catalog_itemauthor ia  (cost=0.00..10297.21 rows=629021 width=12)"
> "              ->  Hash  (cost=8309.00..8309.00 rows=94248 width=54)"
> "                    ->  Seq Scan on catalog_author a  (cost=0.00..8309.00 rows=94248 width=54)"
> "                          Filter: ((name IS NOT NULL) AND (length(btrim((name)::text, E' \011'::text)) > 1))"
> "  ->  Hash  (cost=79538.96..79538.96 rows=833496 width=189)"
> "        ->  Seq Scan on catalog_items  (cost=0.00..79538.96 rows=833496 width=189)"

The seqscan of catalog_items looks a little strange to me, have you set
work_mem to some obscenely big value?

> Sam... I did your same error the first time I wrote the above
> statement... missing the where clause but I can't understand the
> need for your longer version "to ensure that characters trimmed from
> the authors' name are consistent.

You're trimming an extra close square bracket (']') in one of the trim
statements and not in the other.  If you just do it in one place then
you don't have to worry about inconsistency.

> I prefer to put constraint on the length of varchar as an early
> warning for something that is unexpected.
> eg. if the loop was not working as expected I'd get an error. Infact
> that's exactly what happened during development of the above
> function.

OK, Knuth generally seemed to recommend doing similar things.  Most
modern programming people tend to advocate removing artificial limits as
some fundamental good.  I'm never quite so sure, either way!

  Sam

Re: loop vs. aggregate was: update and group by/aggregate

From

Ivan Sergio Borgonovo

Date:

27 August 2008, 12:58:30

On Wed, 27 Aug 2008 12:45:42 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Wed, Aug 27, 2008 at 01:19:30PM +0200, Ivan Sergio Borgonovo
> wrote:
> > but this looks much slower than the function:
> > function: 113sec
> > vs.
> > single statement: 488sec
> > I repeated the test 3 times with similar results.
>
> Have you been vacuuming (non-full) between runs? and as always,
> are the stats reasonably up to date?

there is autovacuum running regularly and I:
vacuumed full, analyze and reindexed everything recently:
that means that all changes to DB were done BEFORE maintenance stuff
but then I executes several selects on tables (including the ones
involved).
I tried to run the function and the statement 3 times one after the
other... so they should have been in similar situation.

> > Can anybody explain why aggregates under perform so badly?

> You could try just running the SELECT part to see how long the
> aggregation takes.

33sec
Surprising.

> > I just read that most of the times I can't be smarter than the
> > planner and I thought that this would be one of the
> > circumstances theplanner could outperform my handwritten
> > function.

> It's quite easy to outsmart the planner with large amounts of
> data, but it's surprising how well it does most of the time.
> Generally you can just write whatever is obvious and the planer
> will do something good with it.  If it doesn't do the right thing
> then you can worry about performance, rather than most languages
> where you have to worry about performance from the start.

I really thought that in this case the planner was going to outsmart
me since well I think in that statement it could see a lot more
optimisation than me knowing the nature of the data.

> > here is the explain:
> > "Hash Join  (cost=137408.51..271599.78 rows=209674 width=221)"
> > "  Hash Cond: ("outer".itemid = "inner".itemid)"
> > "  ->  HashAggregate  (cost=32994.81..36664.11 rows=209674
> > width=58)" "        ->  Hash Join  (cost=8544.62..31946.44
> > rows=209674 width=58)" "              Hash Cond:
> > ("outer".authorid = "inner".authorid)" "              ->  Seq
> > Scan on catalog_itemauthor ia  (cost=0.00..10297.21 rows=629021
> > width=12)" "              ->  Hash  (cost=8309.00..8309.00
> > rows=94248 width=54)" "                    ->  Seq Scan on
> > catalog_author a  (cost=0.00..8309.00 rows=94248 width=54)"
> > "                          Filter: ((name IS NOT NULL) AND
> > (length(btrim((name)::text, E' \011'::text)) > 1))" "  ->  Hash
> > (cost=79538.96..79538.96 rows=833496 width=189)" "        ->
> > Seq Scan on catalog_items  (cost=0.00..79538.96 rows=833496
> > width=189)"
>
> The seqscan of catalog_items looks a little strange to me, have
> you set work_mem to some obscenely big value?

32Mb
I think postgresql users and image would greatly benefit from some
sort of tool that gives suggestions on how to set postgresql.conf
accordingly to the hardware... and maybe asking some questions to
the user.
Tom suggested to raise that number in the range of tens of Mb for
another problem. I saw the hot spot was at 16Mb and considering I
was expecting the DB to get bigger I increased it to 32Mb.

> > Sam... I did your same error the first time I wrote the above
> > statement... missing the where clause but I can't understand the
> > need for your longer version "to ensure that characters trimmed
> > from the authors' name are consistent.

> You're trimming an extra close square bracket (']') in one of the
> trim statements and not in the other.  If you just do it in one
> place then you don't have to worry about inconsistency.

Guess: that was a typo. regexp brain context switching :)

> > I prefer to put constraint on the length of varchar as an early
> > warning for something that is unexpected.
> > eg. if the loop was not working as expected I'd get an error.
> > Infact that's exactly what happened during development of the
> > above function.

> OK, Knuth generally seemed to recommend doing similar things.  Most
> modern programming people tend to advocate removing artificial
> limits as some fundamental good.  I'm never quite so sure, either
> way!

I do know Knuth and I think I share his opinions. I don't know
"modern programming people" and I'm alien to the concept of
"fundamental good".
But well, I'm here to learn. Could you point me to some explanation
on why it should be a "fundamental good" in DB context?
I'd say that:
a) you may discover that your forecast of upper limit was wrong and
you could see things breaking suddenly
b) you incur in a small overhead (is it?)

I do think that most of the time it is worth (and part of the
problem) to make a reasonable forecast. If I can't really make one
or it requires too much effort to make a reasonable one at least I
start developing with an "unreasonable" constraint just to catch some
errors earlier, drop the constraint and leave a comment.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: loop vs. aggregate was: update and group by/aggregate

From

Sam Mason

Date:

27 August 2008, 13:47:33

On Wed, Aug 27, 2008 at 02:58:18PM +0200, Ivan Sergio Borgonovo wrote:
> On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam@samason.me.uk> wrote:
> > Have you been vacuuming (non-full) between runs? and as always,
> > are the stats reasonably up to date?
>
> there is autovacuum running regularly and I:
> vacuumed full, analyze and reindexed everything recently:
> that means that all changes to DB were done BEFORE maintenance stuff
> but then I executes several selects on tables (including the ones
> involved).
> I tried to run the function and the statement 3 times one after the
> other... so they should have been in similar situation.

I'd probably start by doing a normal vacuum, then run a command that
touches every row then run a normal vacuum again. between every
iteration you want to run a normal vaccum otherwise the later runs will
see more garbage than the previous run (updating a row leaves a dead row
behind).

> > > Can anybody explain why aggregates under perform so badly?
>
> > You could try just running the SELECT part to see how long the
> > aggregation takes.
>
> 33sec
> Surprising.

OK, so it's the UPDATE that's taking the time.  I'm not experienced
enough to say why though.

> > It's quite easy to outsmart the planner with large amounts of
> > data, but it's surprising how well it does most of the time.
> > Generally you can just write whatever is obvious and the planer
> > will do something good with it.  If it doesn't do the right thing
> > then you can worry about performance, rather than most languages
> > where you have to worry about performance from the start.
>
> I really thought that in this case the planner was going to outsmart
> me since well I think in that statement it could see a lot more
> optimisation than me knowing the nature of the data.

an experienced human thinking for several seconds (or probably minutes
in this example) is almost always going to be better than a computer
thinking for a millisecond.

> > have you set work_mem to some obscenely big value?
>
> 32Mb

OK.  It's just that it was doing a hash aggregation involving 160MB of
data.  That will cause it to disk and I'd think it would prefer to do
something else.

> Tom suggested to raise that number in the range of tens of Mb for
> another problem. I saw the hot spot was at 16Mb and considering I
> was expecting the DB to get bigger I increased it to 32Mb.

32MB seems a reasonable default on modern hardware.

> > You're trimming an extra close square bracket (']') in one of the
> > trim statements and not in the other.  If you just do it in one
> > place then you don't have to worry about inconsistency.
>
> Guess: that was a typo. regexp brain context switching :)

I do the same thing far too often, hence I tend to do similar query
rewrites, as I did on yours, to prevent this (human bug/feature) from
happening.

> I do know Knuth and I think I share his opinions. I don't know
> "modern programming people" and I'm alien to the concept of
> "fundamental good".

A lot of programming courses will try and teach you to remove all
constraints from your code, whether they're going to be hit or not.
In the real world, resources are finite and effort has to be spent
appropriately.

In my eyes this is the difference between computer science and software
engineering.  Software engineering is just interested in making
something that works now, computer science is about pushing back of the
boundaries of what's possible.  Too often the two get confused.

> But well, I'm here to learn. Could you point me to some explanation
> on why it should be a "fundamental good" in DB context?

as in why using TEXT is good over a large VARCHAR ? it's an engineering
choice normally.  Some database systems optimize one or the other a
lot more, so people tend to prefer one for arbitrary reasons.  In PG
it doesn't really matter in terms of performance and you should use
whichever expresses the data you're trying to store appropriately.

> I do think that most of the time it is worth (and part of the
> problem) to make a reasonable forecast. If I can't really make one
> or it requires too much effort to make a reasonable one at least I
> start developing with an "unreasonable" constraint just to catch some
> errors earlier, drop the constraint and leave a comment.

Yes, however this requires experience of what's worked so far and is
difficult to teach.

  Sam

Re: loop vs. aggregate was: update and group by/aggregate

From

Ivan Sergio Borgonovo

Date:

27 August 2008, 15:15:05

On Wed, 27 Aug 2008 14:47:24 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Wed, Aug 27, 2008 at 02:58:18PM +0200, Ivan Sergio Borgonovo
> wrote:
> > On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam@samason.me.uk>
> > wrote:
> > > Have you been vacuuming (non-full) between runs? and as always,
> > > are the stats reasonably up to date?
> >
> > there is autovacuum running regularly and I:
> > vacuumed full, analyze and reindexed everything recently:
> > that means that all changes to DB were done BEFORE maintenance
> > stuff but then I executes several selects on tables (including
> > the ones involved).
> > I tried to run the function and the statement 3 times one after
> > the other... so they should have been in similar situation.

> I'd probably start by doing a normal vacuum, then run a command

[snip]
Pardon me if I'll waste 2 posts... I'll try to do it later but I
doubt it will make any difference.

> > > > Can anybody explain why aggregates under perform so badly?

> > > You could try just running the SELECT part to see how long the
> > > aggregation takes.
> >
> > 33sec
> > Surprising.

> OK, so it's the UPDATE that's taking the time.  I'm not experienced
> enough to say why though.

I can't get it either.
I'd say that:
- use of memory is larger in the single statement that use aggregate
- if swapping may not be a factor, walking down large memory
structure may introduce some overhead

Still the aggregate version is 4 times slower. Hard to justify such
difference just for the overhead of walking down an array in memory
considering the box has 4Gb.
But both versions happen inside one transaction... I'd think that
one large update should be faster than several updates, that should
make the aggregate version faster.
It would be nice if someone come up with an explanation.

> > > It's quite easy to outsmart the planner with large amounts of
> > > data, but it's surprising how well it does most of the time.
> > > Generally you can just write whatever is obvious and the planer
> > > will do something good with it.  If it doesn't do the right
> > > thing then you can worry about performance, rather than most
> > > languages where you have to worry about performance from the
> > > start.

Well... but the function version doesn't exploit any information on
actual data contained in the tables.
In the single statement version the planner know everything is going
to be touched. Does the planner consider all the statement in a
function to optimise it or is the optimising work happening
statement by statement?

> > I really thought that in this case the planner was going to
> > outsmart me since well I think in that statement it could see a
> > lot more optimisation than me knowing the nature of the data.

> an experienced human thinking for several seconds (or probably
> minutes in this example) is almost always going to be better than
> a computer thinking for a millisecond.

I just couldn't come up with the single statement version.
I asked if anyone knew if I could rewrite the function in a single
statement because I thought it was going to run faster and be easier
to understand once written.
When I wrote the function optimisation was the last of my thought.

> > > have you set work_mem to some obscenely big value?
> >
> > 32Mb

> OK.  It's just that it was doing a hash aggregation involving
> 160MB of data.  That will cause it to disk and I'd think it would
> prefer to do something else.

On a 4Gb RAM box running Apache, PHP and Postgres, considering the
main table is going to contain 1M records (currently ~800K, data
size should be around 300-400Mb) is 32Mb a reasonable choice?

> > > You're trimming an extra close square bracket (']') in one of
> > > the trim statements and not in the other.  If you just do it
> > > in one place then you don't have to worry about inconsistency.
> >
> > Guess: that was a typo. regexp brain context switching :)

> I do the same thing far too often, hence I tend to do similar query
> rewrites, as I did on yours, to prevent this (human bug/feature)
> from happening.

oh well every time I cut&paste I know I'm doing something I'll be
punished for ;) and there is something to be fixed.
In fact once you made me notice I did... I thought I just could
clean catalog_author.Name during import.
Still it is annoying that this doesn't work and you still have to
have 2 copies of the same snippet around:
select trim(' \t' from Name) as _Name from catalog_author where
length(_Name)>1;
can't work.

> > I do know Knuth and I think I share his opinions. I don't know
> > "modern programming people" and I'm alien to the concept of
> > "fundamental good".

> A lot of programming courses will try and teach you to remove all
> constraints from your code, whether they're going to be hit or not.

Maybe that's why I'm so ignorant. I studied physics ;)

> In the real world, resources are finite and effort has to be spent
> appropriately.

> In my eyes this is the difference between computer science and
> software engineering.  Software engineering is just interested in
> making something that works now, computer science is about pushing
> back of the boundaries of what's possible.  Too often the two get
> confused.

Well... coming from a different background I'd say if you can't
measure where the limit are, you can't know if you're pushing them
back.

> > But well, I'm here to learn. Could you point me to some
> > explanation on why it should be a "fundamental good" in DB
> > context?

> as in why using TEXT is good over a large VARCHAR ? it's an
> engineering choice normally.  Some database systems optimize one
> or the other a lot more, so people tend to prefer one for
> arbitrary reasons.  In PG it doesn't really matter in terms of
> performance and you should use whichever expresses the data you're
> trying to store appropriately.

As up to my knowledge using varchar(N) should have a small overhead
in postgresql and not the other way around.
Some implementation (MySQL? sqlite?) may take advantage of knowing
the size of data... but pg is treating text and varchar the same,
just adding a check to the later.
I may be wrong. I still haven't took enough attention to this kind
of implementation details to be sure of what I'm writing.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: loop vs. aggregate was: update and group by/aggregate

From

Sam Mason

Date:

27 August 2008, 17:09:27

On Wed, Aug 27, 2008 at 05:14:46PM +0200, Ivan Sergio Borgonovo wrote:
> On Wed, 27 Aug 2008 14:47:24 +0100 Sam Mason <sam@samason.me.uk> wrote:
> > OK, so it's the UPDATE that's taking the time.  I'm not experienced
> > enough to say why though.
>
> I can't get it either.
> I'd say that:
> - use of memory is larger in the single statement that use aggregate
> - if swapping may not be a factor, walking down large memory
> structure may introduce some overhead

Doing everything in memory in PG would quicker than lots of transitions
into and out of the kernel to read data from the cache.

I'm guessing the performance is because the code is using an index
scan to actually update the table.  I've not tried doing this myself,
but you may have luck convincing PG to use index scans by increasing
effective_cache_size and/o reducing random_page_cost.

> Well... but the function version doesn't exploit any information on
> actual data contained in the tables.
> In the single statement version the planner know everything is going
> to be touched. Does the planner consider all the statement in a
> function to optimise it or is the optimising work happening
> statement by statement?

It's just working statement by statement; it'll plan right at the
beginning, the first time the function is called, as well.

> I just couldn't come up with the single statement version.
> I asked if anyone knew if I could rewrite the function in a single
> statement because I thought it was going to run faster and be easier
> to understand once written.

I'd certainly expect the UPDATE to be quicker!

> When I wrote the function optimisation was the last of my thought.

but then you got lucky with your implementation. it would be easy to bung
everything into a big hash table and work from there, doing some of the
work in the database and having it coming out being ordered makes it
work quickly. so you've kind of implicitly optimized it.

> On a 4Gb RAM box running Apache, PHP and Postgres, considering the
> main table is going to contain 1M records (currently ~800K, data
> size should be around 300-400Mb) is 32Mb a reasonable choice?

no idea, it certainly isn't a bad choice. putting some connection
pooling software infront of PG is probably a better thing to think about
than work_mem size if you want to worry about something!

> Still it is annoying that this doesn't work and you still have to
> have 2 copies of the same snippet around:
> select trim(' \t' from Name) as _Name from catalog_author where
> length(_Name)>1;
> can't work.

yes, SQL is a bit annoying. you have to make a subquery to do anything
like that

> > In the real world, resources are finite and effort has to be spent
> > appropriately.
>
> > In my eyes this is the difference between computer science and
> > software engineering.  Software engineering is just interested in
> > making something that works now, computer science is about pushing
> > back of the boundaries of what's possible.  Too often the two get
> > confused.
>
> Well... coming from a different background I'd say if you can't
> measure where the limit are, you can't know if you're pushing them
> back.

But you'll know if you're solving an immediate problem or some
theoretical possibility.

> As up to my knowledge using varchar(N) should have a small overhead
> in postgresql and not the other way around.

you're right but it's small enough not to worry about it

  Sam