Thread: update and group by/aggregate
How can I update catalog_items.Authors create table catalog_items( ItemID bigint primary key, Authors varchar(1024) ); taking results from select ia.ItemID, array_accum(trim(' \t]' from a.Name)) from catalog_itemauthor ia join catalog_author a on a.AuthorID=ia.AuthorID where a.Name is not null and length(trim(' \t' from a.Name))>1 group by ia.ItemID; Currently I'm achieving the same result with a plpsql function with a for loop, and I'm planning to see if using aggregates is going to be faster and then see if it is even worth to keep an extra column... create or replace function UpdateAuthors() returns void as $$ declare _row record; _ItemID bigint; _Authors varchar(1024); _Name varchar(50); begin _Authors:=''; _ItemID:=null; for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia join {catalog_author} a on a.AuthorID=ia.AuthorID order by ia.ItemID loop if(_row.ItemID<>_ItemID) then if(length(_Authors)>2) then _Authors:=substring(_Authors from 3); update {catalog_items} set Authors=_Authors where ItemID=_ItemID; end if; _Authors:=''; end if; _ItemID:=_row.ItemID; _Name:=trim(' \t' from _row.Name); if(length(_Name)>0) then _Authors:= _Authors || ', ' || _Name; end if; end loop; return; end; $$ language plpgsql volatile; BTW I've noticed that array_append gives back not uniform results: select array_append(ARRAY['az','e','i'],''); {az,e,i,""} while I'd expect {"az","e","i",""} -- Ivan Sergio Borgonovo http://www.webthatworks.it
2008/8/27 Ivan Sergio Borgonovo <mail@webthatworks.it>: > How can I update catalog_items.Authors > > create table catalog_items( > ItemID bigint primary key, > Authors varchar(1024) > ); > > taking results from > > select ia.ItemID, array_accum(trim(' \t]' from a.Name)) > from catalog_itemauthor ia > join catalog_author a on a.AuthorID=ia.AuthorID > where a.Name is not null and length(trim(' \t' from a.Name))>1 > group by ia.ItemID; > > Currently I'm achieving the same result with a plpsql function with > a for loop, and I'm planning to see if using aggregates is going to > be faster and then see if it is even worth to keep an extra column... > > create or replace function UpdateAuthors() > returns void > as > $$ > declare > _row record; > _ItemID bigint; > _Authors varchar(1024); > _Name varchar(50); > begin > _Authors:=''; > _ItemID:=null; > for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia > join {catalog_author} a on a.AuthorID=ia.AuthorID > order by ia.ItemID > loop > if(_row.ItemID<>_ItemID) then > if(length(_Authors)>2) then > _Authors:=substring(_Authors from 3); > update {catalog_items} set Authors=_Authors > where ItemID=_ItemID; > end if; > _Authors:=''; > end if; > _ItemID:=_row.ItemID; > _Name:=trim(' \t' from _row.Name); > if(length(_Name)>0) then > _Authors:= _Authors || ', ' || _Name; > end if; > end loop; > return; > end; > $$ language plpgsql volatile; > > BTW > I've noticed that array_append gives back not uniform results: > > select array_append(ARRAY['az','e','i'],''); > {az,e,i,""} it's correct, double quotes are used only for elements with some speces or for empty string if you would to print array content well, use array_to_string function regards Pavel Stehule > while I'd expect > {"az","e","i",""} > > -- > Ivan Sergio Borgonovo > http://www.webthatworks.it > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Wed, Aug 27, 2008 at 10:32:43AM +0200, Ivan Sergio Borgonovo wrote: > How can I update catalog_items.Authors > > create table catalog_items( > ItemID bigint primary key, > Authors varchar(1024) > ); The type "varchar(1024)" looks a little awkward, wouldn't an unadorned TEXT be easier? if you explicitly want to limit things to 1024 characters then what you're doing is correct. > taking results from > > select ia.ItemID, array_accum(trim(' \t]' from a.Name)) > from catalog_itemauthor ia > join catalog_author a on a.AuthorID=ia.AuthorID > where a.Name is not null and length(trim(' \t' from a.Name))>1 > group by ia.ItemID; UPDATE catalog_items SET authors=array_to_string(x.authors,', ') FROM ( SELECT ia.itemid, array_accum(trim(' \t]' from a.name)) AS authors FROM catalog_itemauthor ia JOIN catalog_author a ON a.authorid = ia.authorid WHERE a.name IS NOT NULL AND length(trim(' \t' from a.name))>1 GROUP BY ia.itemid) x; is a reasonably direct translation. Though I may be tempted to use something more like: UPDATE catalog_items SET authors=array_to_string(x.authors,', ') FROM ( SELECT ia.itemid, array_accum(a.name) AS authors FROM catalog_itemauthor ia, ( SELECT authorid, trim(' \t' from name) AS name FROM catalog_author) a WHERE ia.authorid = a.authorid AND a.name IS NOT NULL AND length(a.name) > 1 GROUP BY ia.itemid) x; to ensure that the characters trimmed from the authors' names are consistent. Sam
Oops, both my statements were horribly broken. They needed a WHERE clause for the UPDATE. On Wed, Aug 27, 2008 at 11:44:20AM +0100, Sam Mason wrote: > UPDATE catalog_items SET authors=array_to_string(x.authors,', ') > FROM ( > SELECT ia.itemid, array_accum(trim(' \t]' from a.name)) AS authors > FROM catalog_itemauthor ia > JOIN catalog_author a ON a.authorid = ia.authorid > WHERE a.name IS NOT NULL > AND length(trim(' \t' from a.name))>1 > GROUP BY ia.itemid) x; should be: UPDATE catalog_items i SET authors=array_to_string(x.authors,', ') FROM ( SELECT ia.itemid, array_accum(trim(' \t]' from a.name)) AS authors FROM catalog_itemauthor ia JOIN catalog_author a ON a.authorid = ia.authorid WHERE a.name IS NOT NULL AND length(trim(' \t' from a.name))>1 GROUP BY ia.itemid) x WHERE i.itemid = x.itemid; > UPDATE catalog_items SET authors=array_to_string(x.authors,', ') > FROM ( > SELECT ia.itemid, array_accum(a.name) AS authors > FROM catalog_itemauthor ia, ( > SELECT authorid, trim(' \t' from name) AS name > FROM catalog_author) a > WHERE ia.authorid = a.authorid > AND a.name IS NOT NULL > AND length(a.name) > 1 > GROUP BY ia.itemid) x; should be: UPDATE catalog_items i SET authors=array_to_string(x.authors,', ') FROM ( SELECT ia.itemid, array_accum(a.name) AS authors FROM catalog_itemauthor ia, ( SELECT authorid, trim(' \t' from name) AS name FROM catalog_author) a WHERE ia.authorid = a.authorid AND a.name IS NOT NULL AND length(a.name) > 1 GROUP BY ia.itemid) x WHERE i.itemid = x.itemid; Sorry! Sam
On Wed, 27 Aug 2008 10:32:43 +0200 Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > create or replace function UpdateAuthors() > returns void > as > $$ > declare > _row record; > _ItemID bigint; > _Authors varchar(1024); > _Name varchar(50); > begin > _Authors:=''; > _ItemID:=null; > for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia > join {catalog_author} a on a.AuthorID=ia.AuthorID > order by ia.ItemID > loop > if(_row.ItemID<>_ItemID) then > if(length(_Authors)>2) then > _Authors:=substring(_Authors from 3); > update {catalog_items} set Authors=_Authors > where ItemID=_ItemID; > end if; > _Authors:=''; > end if; > _ItemID:=_row.ItemID; > _Name:=trim(' \t' from _row.Name); > if(length(_Name)>0) then > _Authors:= _Authors || ', ' || _Name; > end if; > end loop; > return; > end; > $$ language plpgsql volatile; Replying to myself: update catalog_items set Authors=s.Authors from ( select ia.ItemID, array_to_string(array_accum(trim(' \t]' from a.Name)),', ') as Authors from catalog_itemauthor ia join catalog_author a on a.AuthorID=ia.AuthorID where a.Name is not null and length(trim(' \t' from a.Name))>1 group by ia.ItemID) as s where s.ItemID=catalog_items.ItemID; but this looks much slower than the function: function: 113sec vs. single statement: 488sec I repeated the test 3 times with similar results. Can anybody explain why aggregates under perform so badly? I just read that most of the times I can't be smarter than the planner and I thought that this would be one of the circumstances theplanner could outperform my handwritten function. here is the explain: "Hash Join (cost=137408.51..271599.78 rows=209674 width=221)" " Hash Cond: ("outer".itemid = "inner".itemid)" " -> HashAggregate (cost=32994.81..36664.11 rows=209674 width=58)" " -> Hash Join (cost=8544.62..31946.44 rows=209674 width=58)" " Hash Cond: ("outer".authorid = "inner".authorid)" " -> Seq Scan on catalog_itemauthor ia (cost=0.00..10297.21 rows=629021 width=12)" " -> Hash (cost=8309.00..8309.00 rows=94248 width=54)" " -> Seq Scan on catalog_author a (cost=0.00..8309.00 rows=94248 width=54)" " Filter: ((name IS NOT NULL) AND (length(btrim((name)::text, E' \011'::text)) > 1))" " -> Hash (cost=79538.96..79538.96 rows=833496 width=189)" " -> Seq Scan on catalog_items (cost=0.00..79538.96 rows=833496 width=189)" thanks to Pavel who suggested the use of array_to_string Sam... I did your same error the first time I wrote the above statement... missing the where clause but I can't understand the need for your longer version "to ensure that characters trimmed from the authors' name are consistent. I prefer to put constraint on the length of varchar as an early warning for something that is unexpected. eg. if the loop was not working as expected I'd get an error. Infact that's exactly what happened during development of the above function. If Authors was a text field I'd be waiting forever, fill the DB with rubbish etc... There are not unlimited fields... there are fields where the upper limit may be hard to guess. 1024 is actually an inherited choice... but I did some research to find a reasonable upper limit (highest # of authors * longest author). thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Wed, Aug 27, 2008 at 01:19:30PM +0200, Ivan Sergio Borgonovo wrote: > but this looks much slower than the function: > function: 113sec > vs. > single statement: 488sec > I repeated the test 3 times with similar results. Have you been vacuuming (non-full) between runs? and as always, are the stats reasonably up to date? > Can anybody explain why aggregates under perform so badly? You could try just running the SELECT part to see how long the aggregation takes. > I just read that most of the times I can't be smarter than the > planner and I thought that this would be one of the circumstances > theplanner could outperform my handwritten function. It's quite easy to outsmart the planner with large amounts of data, but it's surprising how well it does most of the time. Generally you can just write whatever is obvious and the planer will do something good with it. If it doesn't do the right thing then you can worry about performance, rather than most languages where you have to worry about performance from the start. > here is the explain: > "Hash Join (cost=137408.51..271599.78 rows=209674 width=221)" > " Hash Cond: ("outer".itemid = "inner".itemid)" > " -> HashAggregate (cost=32994.81..36664.11 rows=209674 width=58)" > " -> Hash Join (cost=8544.62..31946.44 rows=209674 width=58)" > " Hash Cond: ("outer".authorid = "inner".authorid)" > " -> Seq Scan on catalog_itemauthor ia (cost=0.00..10297.21 rows=629021 width=12)" > " -> Hash (cost=8309.00..8309.00 rows=94248 width=54)" > " -> Seq Scan on catalog_author a (cost=0.00..8309.00 rows=94248 width=54)" > " Filter: ((name IS NOT NULL) AND (length(btrim((name)::text, E' \011'::text)) > 1))" > " -> Hash (cost=79538.96..79538.96 rows=833496 width=189)" > " -> Seq Scan on catalog_items (cost=0.00..79538.96 rows=833496 width=189)" The seqscan of catalog_items looks a little strange to me, have you set work_mem to some obscenely big value? > Sam... I did your same error the first time I wrote the above > statement... missing the where clause but I can't understand the > need for your longer version "to ensure that characters trimmed from > the authors' name are consistent. You're trimming an extra close square bracket (']') in one of the trim statements and not in the other. If you just do it in one place then you don't have to worry about inconsistency. > I prefer to put constraint on the length of varchar as an early > warning for something that is unexpected. > eg. if the loop was not working as expected I'd get an error. Infact > that's exactly what happened during development of the above > function. OK, Knuth generally seemed to recommend doing similar things. Most modern programming people tend to advocate removing artificial limits as some fundamental good. I'm never quite so sure, either way! Sam
On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam@samason.me.uk> wrote: > On Wed, Aug 27, 2008 at 01:19:30PM +0200, Ivan Sergio Borgonovo > wrote: > > but this looks much slower than the function: > > function: 113sec > > vs. > > single statement: 488sec > > I repeated the test 3 times with similar results. > > Have you been vacuuming (non-full) between runs? and as always, > are the stats reasonably up to date? there is autovacuum running regularly and I: vacuumed full, analyze and reindexed everything recently: that means that all changes to DB were done BEFORE maintenance stuff but then I executes several selects on tables (including the ones involved). I tried to run the function and the statement 3 times one after the other... so they should have been in similar situation. > > Can anybody explain why aggregates under perform so badly? > You could try just running the SELECT part to see how long the > aggregation takes. 33sec Surprising. > > I just read that most of the times I can't be smarter than the > > planner and I thought that this would be one of the > > circumstances theplanner could outperform my handwritten > > function. > It's quite easy to outsmart the planner with large amounts of > data, but it's surprising how well it does most of the time. > Generally you can just write whatever is obvious and the planer > will do something good with it. If it doesn't do the right thing > then you can worry about performance, rather than most languages > where you have to worry about performance from the start. I really thought that in this case the planner was going to outsmart me since well I think in that statement it could see a lot more optimisation than me knowing the nature of the data. > > here is the explain: > > "Hash Join (cost=137408.51..271599.78 rows=209674 width=221)" > > " Hash Cond: ("outer".itemid = "inner".itemid)" > > " -> HashAggregate (cost=32994.81..36664.11 rows=209674 > > width=58)" " -> Hash Join (cost=8544.62..31946.44 > > rows=209674 width=58)" " Hash Cond: > > ("outer".authorid = "inner".authorid)" " -> Seq > > Scan on catalog_itemauthor ia (cost=0.00..10297.21 rows=629021 > > width=12)" " -> Hash (cost=8309.00..8309.00 > > rows=94248 width=54)" " -> Seq Scan on > > catalog_author a (cost=0.00..8309.00 rows=94248 width=54)" > > " Filter: ((name IS NOT NULL) AND > > (length(btrim((name)::text, E' \011'::text)) > 1))" " -> Hash > > (cost=79538.96..79538.96 rows=833496 width=189)" " -> > > Seq Scan on catalog_items (cost=0.00..79538.96 rows=833496 > > width=189)" > > The seqscan of catalog_items looks a little strange to me, have > you set work_mem to some obscenely big value? 32Mb I think postgresql users and image would greatly benefit from some sort of tool that gives suggestions on how to set postgresql.conf accordingly to the hardware... and maybe asking some questions to the user. Tom suggested to raise that number in the range of tens of Mb for another problem. I saw the hot spot was at 16Mb and considering I was expecting the DB to get bigger I increased it to 32Mb. > > Sam... I did your same error the first time I wrote the above > > statement... missing the where clause but I can't understand the > > need for your longer version "to ensure that characters trimmed > > from the authors' name are consistent. > You're trimming an extra close square bracket (']') in one of the > trim statements and not in the other. If you just do it in one > place then you don't have to worry about inconsistency. Guess: that was a typo. regexp brain context switching :) > > I prefer to put constraint on the length of varchar as an early > > warning for something that is unexpected. > > eg. if the loop was not working as expected I'd get an error. > > Infact that's exactly what happened during development of the > > above function. > OK, Knuth generally seemed to recommend doing similar things. Most > modern programming people tend to advocate removing artificial > limits as some fundamental good. I'm never quite so sure, either > way! I do know Knuth and I think I share his opinions. I don't know "modern programming people" and I'm alien to the concept of "fundamental good". But well, I'm here to learn. Could you point me to some explanation on why it should be a "fundamental good" in DB context? I'd say that: a) you may discover that your forecast of upper limit was wrong and you could see things breaking suddenly b) you incur in a small overhead (is it?) I do think that most of the time it is worth (and part of the problem) to make a reasonable forecast. If I can't really make one or it requires too much effort to make a reasonable one at least I start developing with an "unreasonable" constraint just to catch some errors earlier, drop the constraint and leave a comment. -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Wed, Aug 27, 2008 at 02:58:18PM +0200, Ivan Sergio Borgonovo wrote: > On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam@samason.me.uk> wrote: > > Have you been vacuuming (non-full) between runs? and as always, > > are the stats reasonably up to date? > > there is autovacuum running regularly and I: > vacuumed full, analyze and reindexed everything recently: > that means that all changes to DB were done BEFORE maintenance stuff > but then I executes several selects on tables (including the ones > involved). > I tried to run the function and the statement 3 times one after the > other... so they should have been in similar situation. I'd probably start by doing a normal vacuum, then run a command that touches every row then run a normal vacuum again. between every iteration you want to run a normal vaccum otherwise the later runs will see more garbage than the previous run (updating a row leaves a dead row behind). > > > Can anybody explain why aggregates under perform so badly? > > > You could try just running the SELECT part to see how long the > > aggregation takes. > > 33sec > Surprising. OK, so it's the UPDATE that's taking the time. I'm not experienced enough to say why though. > > It's quite easy to outsmart the planner with large amounts of > > data, but it's surprising how well it does most of the time. > > Generally you can just write whatever is obvious and the planer > > will do something good with it. If it doesn't do the right thing > > then you can worry about performance, rather than most languages > > where you have to worry about performance from the start. > > I really thought that in this case the planner was going to outsmart > me since well I think in that statement it could see a lot more > optimisation than me knowing the nature of the data. an experienced human thinking for several seconds (or probably minutes in this example) is almost always going to be better than a computer thinking for a millisecond. > > have you set work_mem to some obscenely big value? > > 32Mb OK. It's just that it was doing a hash aggregation involving 160MB of data. That will cause it to disk and I'd think it would prefer to do something else. > Tom suggested to raise that number in the range of tens of Mb for > another problem. I saw the hot spot was at 16Mb and considering I > was expecting the DB to get bigger I increased it to 32Mb. 32MB seems a reasonable default on modern hardware. > > You're trimming an extra close square bracket (']') in one of the > > trim statements and not in the other. If you just do it in one > > place then you don't have to worry about inconsistency. > > Guess: that was a typo. regexp brain context switching :) I do the same thing far too often, hence I tend to do similar query rewrites, as I did on yours, to prevent this (human bug/feature) from happening. > I do know Knuth and I think I share his opinions. I don't know > "modern programming people" and I'm alien to the concept of > "fundamental good". A lot of programming courses will try and teach you to remove all constraints from your code, whether they're going to be hit or not. In the real world, resources are finite and effort has to be spent appropriately. In my eyes this is the difference between computer science and software engineering. Software engineering is just interested in making something that works now, computer science is about pushing back of the boundaries of what's possible. Too often the two get confused. > But well, I'm here to learn. Could you point me to some explanation > on why it should be a "fundamental good" in DB context? as in why using TEXT is good over a large VARCHAR ? it's an engineering choice normally. Some database systems optimize one or the other a lot more, so people tend to prefer one for arbitrary reasons. In PG it doesn't really matter in terms of performance and you should use whichever expresses the data you're trying to store appropriately. > I do think that most of the time it is worth (and part of the > problem) to make a reasonable forecast. If I can't really make one > or it requires too much effort to make a reasonable one at least I > start developing with an "unreasonable" constraint just to catch some > errors earlier, drop the constraint and leave a comment. Yes, however this requires experience of what's worked so far and is difficult to teach. Sam
On Wed, 27 Aug 2008 14:47:24 +0100 Sam Mason <sam@samason.me.uk> wrote: > On Wed, Aug 27, 2008 at 02:58:18PM +0200, Ivan Sergio Borgonovo > wrote: > > On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam@samason.me.uk> > > wrote: > > > Have you been vacuuming (non-full) between runs? and as always, > > > are the stats reasonably up to date? > > > > there is autovacuum running regularly and I: > > vacuumed full, analyze and reindexed everything recently: > > that means that all changes to DB were done BEFORE maintenance > > stuff but then I executes several selects on tables (including > > the ones involved). > > I tried to run the function and the statement 3 times one after > > the other... so they should have been in similar situation. > I'd probably start by doing a normal vacuum, then run a command [snip] Pardon me if I'll waste 2 posts... I'll try to do it later but I doubt it will make any difference. > > > > Can anybody explain why aggregates under perform so badly? > > > You could try just running the SELECT part to see how long the > > > aggregation takes. > > > > 33sec > > Surprising. > OK, so it's the UPDATE that's taking the time. I'm not experienced > enough to say why though. I can't get it either. I'd say that: - use of memory is larger in the single statement that use aggregate - if swapping may not be a factor, walking down large memory structure may introduce some overhead Still the aggregate version is 4 times slower. Hard to justify such difference just for the overhead of walking down an array in memory considering the box has 4Gb. But both versions happen inside one transaction... I'd think that one large update should be faster than several updates, that should make the aggregate version faster. It would be nice if someone come up with an explanation. > > > It's quite easy to outsmart the planner with large amounts of > > > data, but it's surprising how well it does most of the time. > > > Generally you can just write whatever is obvious and the planer > > > will do something good with it. If it doesn't do the right > > > thing then you can worry about performance, rather than most > > > languages where you have to worry about performance from the > > > start. Well... but the function version doesn't exploit any information on actual data contained in the tables. In the single statement version the planner know everything is going to be touched. Does the planner consider all the statement in a function to optimise it or is the optimising work happening statement by statement? > > I really thought that in this case the planner was going to > > outsmart me since well I think in that statement it could see a > > lot more optimisation than me knowing the nature of the data. > an experienced human thinking for several seconds (or probably > minutes in this example) is almost always going to be better than > a computer thinking for a millisecond. I just couldn't come up with the single statement version. I asked if anyone knew if I could rewrite the function in a single statement because I thought it was going to run faster and be easier to understand once written. When I wrote the function optimisation was the last of my thought. > > > have you set work_mem to some obscenely big value? > > > > 32Mb > OK. It's just that it was doing a hash aggregation involving > 160MB of data. That will cause it to disk and I'd think it would > prefer to do something else. On a 4Gb RAM box running Apache, PHP and Postgres, considering the main table is going to contain 1M records (currently ~800K, data size should be around 300-400Mb) is 32Mb a reasonable choice? > > > You're trimming an extra close square bracket (']') in one of > > > the trim statements and not in the other. If you just do it > > > in one place then you don't have to worry about inconsistency. > > > > Guess: that was a typo. regexp brain context switching :) > I do the same thing far too often, hence I tend to do similar query > rewrites, as I did on yours, to prevent this (human bug/feature) > from happening. oh well every time I cut&paste I know I'm doing something I'll be punished for ;) and there is something to be fixed. In fact once you made me notice I did... I thought I just could clean catalog_author.Name during import. Still it is annoying that this doesn't work and you still have to have 2 copies of the same snippet around: select trim(' \t' from Name) as _Name from catalog_author where length(_Name)>1; can't work. > > I do know Knuth and I think I share his opinions. I don't know > > "modern programming people" and I'm alien to the concept of > > "fundamental good". > A lot of programming courses will try and teach you to remove all > constraints from your code, whether they're going to be hit or not. Maybe that's why I'm so ignorant. I studied physics ;) > In the real world, resources are finite and effort has to be spent > appropriately. > In my eyes this is the difference between computer science and > software engineering. Software engineering is just interested in > making something that works now, computer science is about pushing > back of the boundaries of what's possible. Too often the two get > confused. Well... coming from a different background I'd say if you can't measure where the limit are, you can't know if you're pushing them back. > > But well, I'm here to learn. Could you point me to some > > explanation on why it should be a "fundamental good" in DB > > context? > as in why using TEXT is good over a large VARCHAR ? it's an > engineering choice normally. Some database systems optimize one > or the other a lot more, so people tend to prefer one for > arbitrary reasons. In PG it doesn't really matter in terms of > performance and you should use whichever expresses the data you're > trying to store appropriately. As up to my knowledge using varchar(N) should have a small overhead in postgresql and not the other way around. Some implementation (MySQL? sqlite?) may take advantage of knowing the size of data... but pg is treating text and varchar the same, just adding a check to the later. I may be wrong. I still haven't took enough attention to this kind of implementation details to be sure of what I'm writing. -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Wed, Aug 27, 2008 at 05:14:46PM +0200, Ivan Sergio Borgonovo wrote: > On Wed, 27 Aug 2008 14:47:24 +0100 Sam Mason <sam@samason.me.uk> wrote: > > OK, so it's the UPDATE that's taking the time. I'm not experienced > > enough to say why though. > > I can't get it either. > I'd say that: > - use of memory is larger in the single statement that use aggregate > - if swapping may not be a factor, walking down large memory > structure may introduce some overhead Doing everything in memory in PG would quicker than lots of transitions into and out of the kernel to read data from the cache. I'm guessing the performance is because the code is using an index scan to actually update the table. I've not tried doing this myself, but you may have luck convincing PG to use index scans by increasing effective_cache_size and/o reducing random_page_cost. > Well... but the function version doesn't exploit any information on > actual data contained in the tables. > In the single statement version the planner know everything is going > to be touched. Does the planner consider all the statement in a > function to optimise it or is the optimising work happening > statement by statement? It's just working statement by statement; it'll plan right at the beginning, the first time the function is called, as well. > I just couldn't come up with the single statement version. > I asked if anyone knew if I could rewrite the function in a single > statement because I thought it was going to run faster and be easier > to understand once written. I'd certainly expect the UPDATE to be quicker! > When I wrote the function optimisation was the last of my thought. but then you got lucky with your implementation. it would be easy to bung everything into a big hash table and work from there, doing some of the work in the database and having it coming out being ordered makes it work quickly. so you've kind of implicitly optimized it. > On a 4Gb RAM box running Apache, PHP and Postgres, considering the > main table is going to contain 1M records (currently ~800K, data > size should be around 300-400Mb) is 32Mb a reasonable choice? no idea, it certainly isn't a bad choice. putting some connection pooling software infront of PG is probably a better thing to think about than work_mem size if you want to worry about something! > Still it is annoying that this doesn't work and you still have to > have 2 copies of the same snippet around: > select trim(' \t' from Name) as _Name from catalog_author where > length(_Name)>1; > can't work. yes, SQL is a bit annoying. you have to make a subquery to do anything like that > > In the real world, resources are finite and effort has to be spent > > appropriately. > > > In my eyes this is the difference between computer science and > > software engineering. Software engineering is just interested in > > making something that works now, computer science is about pushing > > back of the boundaries of what's possible. Too often the two get > > confused. > > Well... coming from a different background I'd say if you can't > measure where the limit are, you can't know if you're pushing them > back. But you'll know if you're solving an immediate problem or some theoretical possibility. > As up to my knowledge using varchar(N) should have a small overhead > in postgresql and not the other way around. you're right but it's small enough not to worry about it Sam