Re: High update activity, PostgreSQL vs BigDBMS - Mailing list pgsql-performance

From Craig A. James
Subject Re: High update activity, PostgreSQL vs BigDBMS
Date
Msg-id 45A09FF4.7060906@modgraph-usa.com
Whole thread Raw
In response to Re: High update activity, PostgreSQL vs BigDBMS  (Guy Rouillier <guyr-ml1@burntmail.com>)
Responses Re: High update activity, PostgreSQL vs BigDBMS
Re: High update activity, PostgreSQL vs BigDBMS
List pgsql-performance
Guy,

> The application is fairly straightforward, but as you say, what is
> working okay with BigDBMS isn't working as well under PG.  I'm going to
> try other configuration suggestions made by others before I attempt
> logic changes.  The core logic is unchangeable; millions of rows of data
> in a single table will be updated throughout the day.  If PG can't
> handle high volume updates well, this may be brick wall.

Here are a couple things I learned.

ANALYZE is VERY important, surprisingly so even for small tables.  I had a case last week where a temporary "scratch"
tablewith just 100 rows was joined to two more tables of 6 and 12 million rows.  You might think that a 100-row table
wouldn'tneed to be analyzed, but it does: Without the ANALYZE, Postgres generated a horrible plan that took many
minutesto run; with the ANALYZE, it took milliseconds.  Any time a table's contents change dramatically, ANALYZE it,
ESPECIALLYif it's a small table.  After all, changing 20 rows in a 100-row table has a much larger affect on its
statisticsthan changing 20 rows in a million-row table. 

Postgres functions like count() and max() are "plug ins" which has huge architectural advantages.  But in pre-8.1
releases,there was a big speed penalty for this: functions like count() were very, very slow, requiring a full table
scan. I think this is vastly improved from 8.0x to 8.1 and forward; others might be able to comment whether count() is
nowas fast in Postgres as Oracle.  The "idiom" to replace count() was "select col from tbl order by col desc limit 1".
Itworked miracles for my app. 

Postgres has explicit garbage collection via VACUUM, and you have to design your application with this in mind.  In
Postgres,update is delete+insert, meaning updates create garbage.  If you have very "wide" tables, but only a subset of
thecolumns are updated frequently, put these columns in a separate table with an index to join the two tables.  For
example,my original design was something like this: 

   integer primary key
   very large text column
   ... a bunch of integer columns, float columns, and small text columns

The properties were updated by the application, but the large text column never changed.  This led to huge
garbage-collectionproblems as the large text field was repeatedly deleted and reinserted by the updates.  By separating
theseinto two tables, one with the large text column, and the other table with the dynamic, but smaller, columns,
garbageis massively reduced, and performance increased, both immediately (smaller data set to update) and long term
(smallervacuums).  You can use views to recreate your original combined columns, so the changes to your app are limited
towhere updates occur. 

If you have a column that is *frequently* updated (say, for example, a user's last-access timestamp each time s/he hits
yourweb server) then you definitely want this in its own table, not mixed in with the user's name, address, etc. 

Partitioning in Postgres is more powerful than in Oracle.  Use it if you can.

Partial indexes are VERY nice in Postgres, if your data is poorly distributed (for example, a mostly-NULL column with a
smallpercentage of very important values). 

I'm sure there are more things that others can contribute.

Craig


pgsql-performance by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: High update activity, PostgreSQL vs BigDBMS
Next
From: Shane Ambler
Date:
Subject: Re: High update activity, PostgreSQL vs BigDBMS