Re: loop vs. aggregate was: update and group by/aggregate - Mailing list pgsql-general

From Sam Mason
Subject Re: loop vs. aggregate was: update and group by/aggregate
Date
Msg-id 20080827170919.GR7271@frubble.xen.chris-lamb.co.uk
Whole thread Raw
In response to Re: loop vs. aggregate was: update and group by/aggregate  (Ivan Sergio Borgonovo <mail@webthatworks.it>)
List pgsql-general
On Wed, Aug 27, 2008 at 05:14:46PM +0200, Ivan Sergio Borgonovo wrote:
> On Wed, 27 Aug 2008 14:47:24 +0100 Sam Mason <sam@samason.me.uk> wrote:
> > OK, so it's the UPDATE that's taking the time.  I'm not experienced
> > enough to say why though.
>
> I can't get it either.
> I'd say that:
> - use of memory is larger in the single statement that use aggregate
> - if swapping may not be a factor, walking down large memory
> structure may introduce some overhead

Doing everything in memory in PG would quicker than lots of transitions
into and out of the kernel to read data from the cache.

I'm guessing the performance is because the code is using an index
scan to actually update the table.  I've not tried doing this myself,
but you may have luck convincing PG to use index scans by increasing
effective_cache_size and/o reducing random_page_cost.

> Well... but the function version doesn't exploit any information on
> actual data contained in the tables.
> In the single statement version the planner know everything is going
> to be touched. Does the planner consider all the statement in a
> function to optimise it or is the optimising work happening
> statement by statement?

It's just working statement by statement; it'll plan right at the
beginning, the first time the function is called, as well.

> I just couldn't come up with the single statement version.
> I asked if anyone knew if I could rewrite the function in a single
> statement because I thought it was going to run faster and be easier
> to understand once written.

I'd certainly expect the UPDATE to be quicker!

> When I wrote the function optimisation was the last of my thought.

but then you got lucky with your implementation. it would be easy to bung
everything into a big hash table and work from there, doing some of the
work in the database and having it coming out being ordered makes it
work quickly. so you've kind of implicitly optimized it.

> On a 4Gb RAM box running Apache, PHP and Postgres, considering the
> main table is going to contain 1M records (currently ~800K, data
> size should be around 300-400Mb) is 32Mb a reasonable choice?

no idea, it certainly isn't a bad choice. putting some connection
pooling software infront of PG is probably a better thing to think about
than work_mem size if you want to worry about something!

> Still it is annoying that this doesn't work and you still have to
> have 2 copies of the same snippet around:
> select trim(' \t' from Name) as _Name from catalog_author where
> length(_Name)>1;
> can't work.

yes, SQL is a bit annoying. you have to make a subquery to do anything
like that

> > In the real world, resources are finite and effort has to be spent
> > appropriately.
>
> > In my eyes this is the difference between computer science and
> > software engineering.  Software engineering is just interested in
> > making something that works now, computer science is about pushing
> > back of the boundaries of what's possible.  Too often the two get
> > confused.
>
> Well... coming from a different background I'd say if you can't
> measure where the limit are, you can't know if you're pushing them
> back.

But you'll know if you're solving an immediate problem or some
theoretical possibility.

> As up to my knowledge using varchar(N) should have a small overhead
> in postgresql and not the other way around.

you're right but it's small enough not to worry about it


  Sam

pgsql-general by date:

Previous
From: Andrew Sullivan
Date:
Subject: Re: Dumping/Restoring with constraints?
Next
From: Steve Clark
Date:
Subject: Re: 8.3.1 query plan