Re: loop vs. aggregate was: update and group by/aggregate - Mailing list pgsql-general

From Ivan Sergio Borgonovo
Subject Re: loop vs. aggregate was: update and group by/aggregate
Date
Msg-id 20080827171446.25d9aff4@dawn.webthatworks.it
Whole thread Raw
In response to Re: loop vs. aggregate was: update and group by/aggregate  (Sam Mason <sam@samason.me.uk>)
Responses Re: loop vs. aggregate was: update and group by/aggregate  (Sam Mason <sam@samason.me.uk>)
List pgsql-general
On Wed, 27 Aug 2008 14:47:24 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Wed, Aug 27, 2008 at 02:58:18PM +0200, Ivan Sergio Borgonovo
> wrote:
> > On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam@samason.me.uk>
> > wrote:
> > > Have you been vacuuming (non-full) between runs? and as always,
> > > are the stats reasonably up to date?
> >
> > there is autovacuum running regularly and I:
> > vacuumed full, analyze and reindexed everything recently:
> > that means that all changes to DB were done BEFORE maintenance
> > stuff but then I executes several selects on tables (including
> > the ones involved).
> > I tried to run the function and the statement 3 times one after
> > the other... so they should have been in similar situation.

> I'd probably start by doing a normal vacuum, then run a command

[snip]
Pardon me if I'll waste 2 posts... I'll try to do it later but I
doubt it will make any difference.

> > > > Can anybody explain why aggregates under perform so badly?

> > > You could try just running the SELECT part to see how long the
> > > aggregation takes.
> >
> > 33sec
> > Surprising.

> OK, so it's the UPDATE that's taking the time.  I'm not experienced
> enough to say why though.

I can't get it either.
I'd say that:
- use of memory is larger in the single statement that use aggregate
- if swapping may not be a factor, walking down large memory
structure may introduce some overhead

Still the aggregate version is 4 times slower. Hard to justify such
difference just for the overhead of walking down an array in memory
considering the box has 4Gb.
But both versions happen inside one transaction... I'd think that
one large update should be faster than several updates, that should
make the aggregate version faster.
It would be nice if someone come up with an explanation.

> > > It's quite easy to outsmart the planner with large amounts of
> > > data, but it's surprising how well it does most of the time.
> > > Generally you can just write whatever is obvious and the planer
> > > will do something good with it.  If it doesn't do the right
> > > thing then you can worry about performance, rather than most
> > > languages where you have to worry about performance from the
> > > start.

Well... but the function version doesn't exploit any information on
actual data contained in the tables.
In the single statement version the planner know everything is going
to be touched. Does the planner consider all the statement in a
function to optimise it or is the optimising work happening
statement by statement?

> > I really thought that in this case the planner was going to
> > outsmart me since well I think in that statement it could see a
> > lot more optimisation than me knowing the nature of the data.

> an experienced human thinking for several seconds (or probably
> minutes in this example) is almost always going to be better than
> a computer thinking for a millisecond.

I just couldn't come up with the single statement version.
I asked if anyone knew if I could rewrite the function in a single
statement because I thought it was going to run faster and be easier
to understand once written.
When I wrote the function optimisation was the last of my thought.

> > > have you set work_mem to some obscenely big value?
> >
> > 32Mb

> OK.  It's just that it was doing a hash aggregation involving
> 160MB of data.  That will cause it to disk and I'd think it would
> prefer to do something else.

On a 4Gb RAM box running Apache, PHP and Postgres, considering the
main table is going to contain 1M records (currently ~800K, data
size should be around 300-400Mb) is 32Mb a reasonable choice?

> > > You're trimming an extra close square bracket (']') in one of
> > > the trim statements and not in the other.  If you just do it
> > > in one place then you don't have to worry about inconsistency.
> >
> > Guess: that was a typo. regexp brain context switching :)

> I do the same thing far too often, hence I tend to do similar query
> rewrites, as I did on yours, to prevent this (human bug/feature)
> from happening.

oh well every time I cut&paste I know I'm doing something I'll be
punished for ;) and there is something to be fixed.
In fact once you made me notice I did... I thought I just could
clean catalog_author.Name during import.
Still it is annoying that this doesn't work and you still have to
have 2 copies of the same snippet around:
select trim(' \t' from Name) as _Name from catalog_author where
length(_Name)>1;
can't work.

> > I do know Knuth and I think I share his opinions. I don't know
> > "modern programming people" and I'm alien to the concept of
> > "fundamental good".

> A lot of programming courses will try and teach you to remove all
> constraints from your code, whether they're going to be hit or not.

Maybe that's why I'm so ignorant. I studied physics ;)

> In the real world, resources are finite and effort has to be spent
> appropriately.

> In my eyes this is the difference between computer science and
> software engineering.  Software engineering is just interested in
> making something that works now, computer science is about pushing
> back of the boundaries of what's possible.  Too often the two get
> confused.

Well... coming from a different background I'd say if you can't
measure where the limit are, you can't know if you're pushing them
back.

> > But well, I'm here to learn. Could you point me to some
> > explanation on why it should be a "fundamental good" in DB
> > context?

> as in why using TEXT is good over a large VARCHAR ? it's an
> engineering choice normally.  Some database systems optimize one
> or the other a lot more, so people tend to prefer one for
> arbitrary reasons.  In PG it doesn't really matter in terms of
> performance and you should use whichever expresses the data you're
> trying to store appropriately.

As up to my knowledge using varchar(N) should have a small overhead
in postgresql and not the other way around.
Some implementation (MySQL? sqlite?) may take advantage of knowing
the size of data... but pg is treating text and varchar the same,
just adding a check to the later.
I may be wrong. I still haven't took enough attention to this kind
of implementation details to be sure of what I'm writing.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


pgsql-general by date:

Previous
From: Steve Clark
Date:
Subject: 8.3.1 query plan
Next
From: "Phoenix Kiula"
Date:
Subject: Re: Dumping/Restoring with constraints?