Re: loop vs. aggregate was: update and group by/aggregate - Mailing list pgsql-general

From Ivan Sergio Borgonovo
Subject Re: loop vs. aggregate was: update and group by/aggregate
Date
Msg-id 20080827145818.48b38a5e@dawn.webthatworks.it
Whole thread Raw
In response to Re: loop vs. aggregate was: update and group by/aggregate  (Sam Mason <sam@samason.me.uk>)
Responses Re: loop vs. aggregate was: update and group by/aggregate  (Sam Mason <sam@samason.me.uk>)
List pgsql-general
On Wed, 27 Aug 2008 12:45:42 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Wed, Aug 27, 2008 at 01:19:30PM +0200, Ivan Sergio Borgonovo
> wrote:
> > but this looks much slower than the function:
> > function: 113sec
> > vs.
> > single statement: 488sec
> > I repeated the test 3 times with similar results.
>
> Have you been vacuuming (non-full) between runs? and as always,
> are the stats reasonably up to date?

there is autovacuum running regularly and I:
vacuumed full, analyze and reindexed everything recently:
that means that all changes to DB were done BEFORE maintenance stuff
but then I executes several selects on tables (including the ones
involved).
I tried to run the function and the statement 3 times one after the
other... so they should have been in similar situation.

> > Can anybody explain why aggregates under perform so badly?

> You could try just running the SELECT part to see how long the
> aggregation takes.

33sec
Surprising.

> > I just read that most of the times I can't be smarter than the
> > planner and I thought that this would be one of the
> > circumstances theplanner could outperform my handwritten
> > function.

> It's quite easy to outsmart the planner with large amounts of
> data, but it's surprising how well it does most of the time.
> Generally you can just write whatever is obvious and the planer
> will do something good with it.  If it doesn't do the right thing
> then you can worry about performance, rather than most languages
> where you have to worry about performance from the start.

I really thought that in this case the planner was going to outsmart
me since well I think in that statement it could see a lot more
optimisation than me knowing the nature of the data.

> > here is the explain:
> > "Hash Join  (cost=137408.51..271599.78 rows=209674 width=221)"
> > "  Hash Cond: ("outer".itemid = "inner".itemid)"
> > "  ->  HashAggregate  (cost=32994.81..36664.11 rows=209674
> > width=58)" "        ->  Hash Join  (cost=8544.62..31946.44
> > rows=209674 width=58)" "              Hash Cond:
> > ("outer".authorid = "inner".authorid)" "              ->  Seq
> > Scan on catalog_itemauthor ia  (cost=0.00..10297.21 rows=629021
> > width=12)" "              ->  Hash  (cost=8309.00..8309.00
> > rows=94248 width=54)" "                    ->  Seq Scan on
> > catalog_author a  (cost=0.00..8309.00 rows=94248 width=54)"
> > "                          Filter: ((name IS NOT NULL) AND
> > (length(btrim((name)::text, E' \011'::text)) > 1))" "  ->  Hash
> > (cost=79538.96..79538.96 rows=833496 width=189)" "        ->
> > Seq Scan on catalog_items  (cost=0.00..79538.96 rows=833496
> > width=189)"
>
> The seqscan of catalog_items looks a little strange to me, have
> you set work_mem to some obscenely big value?

32Mb
I think postgresql users and image would greatly benefit from some
sort of tool that gives suggestions on how to set postgresql.conf
accordingly to the hardware... and maybe asking some questions to
the user.
Tom suggested to raise that number in the range of tens of Mb for
another problem. I saw the hot spot was at 16Mb and considering I
was expecting the DB to get bigger I increased it to 32Mb.

> > Sam... I did your same error the first time I wrote the above
> > statement... missing the where clause but I can't understand the
> > need for your longer version "to ensure that characters trimmed
> > from the authors' name are consistent.

> You're trimming an extra close square bracket (']') in one of the
> trim statements and not in the other.  If you just do it in one
> place then you don't have to worry about inconsistency.

Guess: that was a typo. regexp brain context switching :)

> > I prefer to put constraint on the length of varchar as an early
> > warning for something that is unexpected.
> > eg. if the loop was not working as expected I'd get an error.
> > Infact that's exactly what happened during development of the
> > above function.

> OK, Knuth generally seemed to recommend doing similar things.  Most
> modern programming people tend to advocate removing artificial
> limits as some fundamental good.  I'm never quite so sure, either
> way!

I do know Knuth and I think I share his opinions. I don't know
"modern programming people" and I'm alien to the concept of
"fundamental good".
But well, I'm here to learn. Could you point me to some explanation
on why it should be a "fundamental good" in DB context?
I'd say that:
a) you may discover that your forecast of upper limit was wrong and
you could see things breaking suddenly
b) you incur in a small overhead (is it?)

I do think that most of the time it is worth (and part of the
problem) to make a reasonable forecast. If I can't really make one
or it requires too much effort to make a reasonable one at least I
start developing with an "unreasonable" constraint just to catch some
errors earlier, drop the constraint and leave a comment.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


pgsql-general by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: PITR - base backup question
Next
From: "Michael Nolan"
Date:
Subject: Re: PITR - base backup question