Thread: optimization ideas for frequent, large(ish) updates in frequently accessed DB?

optimization ideas for frequent, large(ish) updates in frequently accessed DB?

From
"Marinos J. Yannikos"
Date:
Hi,

one of our tables has to be updated frequently, but concurrently running
SELECT-queries must also have low latency times (it's being accessed
through a web interface).

I'm looking for ideas that might improve the interactive performance of
the system, without slowing down the updates too much. Here are the
characteristics of the table and its use:

- approx. 2 million rows

- approx. 4-5 million rows per day are replaced in short bursts of
1-200k rows (average ~3000 rows per update)

- the table needs 6 indexes (not all indexes are used all the time, but
keeping them all the time slows the system down less than re-creating
some of them just before they're needed and dropping them afterwards)

- an "update" means that 1-200k rows with a common value in a particular
field are replaced with an arbitrary number of new rows (with the same
value in that field), i.e.:

begin transaction;
   delete from t where id=5;
   insert into t (id,...) values (5,...);
   ... [1-200k rows]
end;

The problem is, that a large update of this kind can delay SELECT
queries running in parallel for several seconds, so the web interface
used by several people will be unusable for a short while.

Currently, I'm using temporary tables:

create table xyz as select * from t limit 0;
insert into xyz ...
...
begin transaction;
delete from t where id=5;
insert into t select * from xyz;
end;
drop table xyz;

This is slightly faster than inserting directly into t (and probably
faster than using COPY, even though using that might reduce the overall
load on the database).

What other possibilities are there, other than splitting up the 15
columns of that table into several smaller tables, which is something
I'd like to avoid? Would replication help? (I doubt it, but haven't
tried it yet...) Writing to one table (without indexes) and creating
indexes and renaming it to the "read table" periodically in a double
buffering-like fashion wouldn't work either(?), since the views and
triggers would have to be re-created every time as well and other
problems might arise.

The postgresql.conf options are already reasonably tweaked for
performance(IMHO), but perhaps some settings are particularly critical:

shared_buffers=100000
(I tried many values, this seems to work well for us - 12GB RAM)
wal_buffers=500
sort_mem=800000
checkpoint_segments=16
effective_cache_size=1000000
etc.

Any help/suggestions would be greatly appreciated... Even if it's
something like "you need a faster db box, there's no other way" ;-)

Regards,
  Marinos

Marinos,

> shared_buffers=100000
> (I tried many values, this seems to work well for us - 12GB RAM)
> wal_buffers=500
> sort_mem=800000
> checkpoint_segments=16
> effective_cache_size=1000000
> etc.

800MB for sort mem?   Are you sure you typed that correctly?   You must be
counting on not having a lot of concurrent queries.  It sure will speed up
index updating, though!

I think you might do well to experiment with using the checkpoint_delay and
checkpoint_sibilings settings in order to get more efficient batch processing
of updates while selects are going on.  I would also suggest increasing
checkpoint segments as much as your disk space will allow; I know one
reporting database I run that does batch loads is using 128 (which is about a
gig of disk, I think).

What have you set max_fsm_relations and max_fsm_pages to?  The latter should
be very high for you, like 10,000,000

For that matter, what *version* of PostgreSQL are you running?

Also, make sure that your tables get vaccuumed regularly.

> Any help/suggestions would be greatly appreciated... Even if it's
> something like "you need a faster db box, there's no other way" ;-)

Well, a battery-backed RAID controller with a fast cache would certainly help.

You'll also be glad to know that a *lot* of the improvements in the upcoming
PostgreSQL 7.5 are aimed at giving better peformance on large, high-activity
databases like yours.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?

From
Manfred Koizar
Date:
On Fri, 13 Feb 2004 01:58:34 +0100, "Marinos J. Yannikos"
<mjy@geizhals.at> wrote:
>I'm looking for ideas that might improve the interactive performance of
>the system, without slowing down the updates too much.

IOW, you could accept slower updates.  Did you actually try and throttle
down the insert rate?

> Here are the
>characteristics of the table and its use:
>
>- approx. 2 million rows

Doesn't sound worrying.  What's the min/max/average size of these rows?
How large is this table?
    SELECT relpages FROM pg_class WHERE relname='...';

What else is in this database, how many tables, how large is the
database (du $PGDATA)?

>- approx. 4-5 million rows per day are replaced in short bursts of
>1-200k rows (average ~3000 rows per update)

How often do you VACUUM [ANALYSE]?

>- the table needs 6 indexes (not all indexes are used all the time, but
>keeping them all the time slows the system down less than re-creating
>some of them just before they're needed and dropping them afterwards)

I agree.

>- an "update" means that 1-200k rows with a common value in a particular
>field are replaced with an arbitrary number of new rows (with the same
>value in that field), i.e.:
>
>begin transaction;
>   delete from t where id=5;
>   insert into t (id,...) values (5,...);
>   ... [1-200k rows]
>end;

This is a wide variation in the number of rows.  You told us the average
batch size is 3000.  Is this also a *typical* batch size?  And what is
the number of rows where you start to get the feeling that it slows down
other sessions?

Where do the new values come from?  I don't think they are typed in :-)
Do they come from external sources or from the same database?  If the
latter, INSERT INTO ... SELECT ... might help.

>The problem is, that a large update of this kind can delay SELECT
>queries running in parallel for several seconds, so the web interface
>used by several people will be unusable for a short while.

Silly question:  By SELECT you mean pure SELECT transactions and not
some transaction that *mostly* reads from the database?  I mean, you are
sure your SELECT queries are slowed down and not blocked by the
"updates".

Show us the EXPLAIN ANALYSE output for the same SELECT, once when it is
fast and once when it is slow.  BTW, what is fast and what is slow?

>Currently, I'm using temporary tables:
> [...]
>This is slightly faster than inserting directly into t (and probably
>faster than using COPY, even though using that might reduce the overall
>load on the database).

You might try using a prepared INSERT statement or COPY.

>shared_buffers=100000
>(I tried many values, this seems to work well for us - 12GB RAM)
>wal_buffers=500
>sort_mem=800000
>checkpoint_segments=16
>effective_cache_size=1000000

See Josh's comments.

>Any help/suggestions would be greatly appreciated... Even if it's
>something like "you need a faster db box, there's no other way" ;-)

We have to find out, what is the bottleneck.  Tell us about your
environment (hardware, OS, ...).  Run top and/or vmstat and look for
significant differences between times of normal processing and slow
phases.  Post top/vmstat output here if you need help.

Servus
 Manfred

Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?

From
Manfred Koizar
Date:
Marinos, while you are busy answering my first set of questions :-),
here is an idea that might help even out resource consumption.

On Fri, 13 Feb 2004 01:58:34 +0100, "Marinos J. Yannikos"
<mjy@geizhals.at> wrote:
>begin transaction;
>   delete from t where id=5;
>   insert into t (id,...) values (5,...);
>   ... [1-200k rows]
>end;
>
>The problem is, that a large update of this kind can delay SELECT
>queries running in parallel for several seconds, so the web interface
>used by several people will be unusable for a short while.

    CREATE TABLE idmap (
        internalid int NOT NULL PRIMARY KEY,
        visibleid int NOT NULL,
        active bool NOT NULL
    );
    CREATE INDEX ipmap_visible ON idmap(visibleid);

Populate this table with
    INSERT INTO idmap
    SELECT id, id, true
      FROM t;

Change
    SELECT ...
      FROM t
     WHERE t.id = 5;

to
    SELECT ...
      FROM t INNER JOIN idmap ON (idmap.internalid = t.id AND
                                  idmap.active)
     WHERE idmap.visibleid = 5;

When you have to replace the rows in t for id=5, start by

    INSERT INTO idmap VALUES (12345, 5, false);

Then repeatedly
    INSERT INTO t (id, ...) VALUES (12345, ...);
at a rate as slow as you can accept.  You don't have to wrap all INSERTs
into a single transaction, but batching together a few hundred to a few
thousand INSERTs will improve performance.

When all the new values are in the database, you switch to the new id in
one short transaction:
    BEGIN;
    UPDATE idmap SET active = false WHERE visibleid = 5 AND active;
    UPDATE idmap SET active = true WHERE internalid = 12345;
    COMMIT;

Do the cleanup in off-peak hours (pseudocode):

    FOR delid IN (SELECT internalid FROM idmap WHERE NOT active)
    BEGIN
        DELETE FROM t WHERE id = delid;
        DELETE FROM idmap WHERE internalid = delid;
    END;
    VACUUM ANALYSE t;
    VACUUM ANALYSE idmap;

To prevent this cleanup from interfering with INSERTs in progress, you
might want to add a "beinginserted" flag to idmap.

HTH.
Servus
 Manfred

Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?

From
Manfred Koizar
Date:
On Fri, 13 Feb 2004 16:21:29 +0100, I wrote:
>Populate this table with
>    INSERT INTO idmap
>    SELECT id, id, true
>      FROM t;

This should be
    INSERT INTO idmap
    SELECT DISTINCT id, id, true
      FROM t;

Servus
 Manfred

Re: optimization ideas for frequent, large(ish) updates

From
"Marinos J. Yannikos"
Date:
Josh Berkus wrote:

> 800MB for sort mem?   Are you sure you typed that correctly?   You must be
> counting on not having a lot of concurrent queries.  It sure will speed up
> index updating, though!

800MB is correct, yes... There are usually only 10-30 postgres processes
  active (imagine 5-10 people working on the web front-end while cron
jobs access the db occasionally). Very few queries can use such large
amounts of memory for sorting, but they do exist.

> I think you might do well to experiment with using the checkpoint_delay and
> checkpoint_sibilings settings in order to get more efficient batch processing
> of updates while selects are going on.
[commit_*?]

I thought that could improve only concurrent transactions...

> What have you set max_fsm_relations and max_fsm_pages to?  The latter should
> be very high for you, like 10,000,000

good guess ;-) the former is set to 10,000 (I'm not sure how useful this
is for those temporary tables)

> For that matter, what *version* of PostgreSQL are you running?

7.4.1

> Also, make sure that your tables get vaccuumed regularly.

There is a noticeable difference between a properly vacuumed db (nightly
"vacuum full") and a non-vacuumed one and people will start complaining
immediately if something goes wrong there...

> Well, a battery-backed RAID controller with a fast cache would certainly help.

http://www.lsilogic.com/products/ultra320_scsi_megaraid_storage_adapters/320x4128t.html
(RAID-5 with 9 15k rpm drives; at a hindsight, perhaps we should have
tried a 0+1)

> You'll also be glad to know that a *lot* of the improvements in the upcoming
> PostgreSQL 7.5 are aimed at giving better peformance on large, high-activity
> databases like yours.

That's good to hear...

Regards,
  Marinos

Re: optimization ideas for frequent, large(ish) updates

From
Christopher Kings-Lynne
Date:
> 800MB is correct, yes... There are usually only 10-30 postgres processes
>  active (imagine 5-10 people working on the web front-end while cron
> jobs access the db occasionally). Very few queries can use such large
> amounts of memory for sorting, but they do exist.

But remember that means that if you have 4 people doign 2 sorts each at
the same time, postgres will use 6.4GB RAM maximum.  The sort_mem
parameter means that if a sort is larger than the max, it will be done
in disk swap.

Chris


Re: optimization ideas for frequent, large(ish) updates

From
Jeff Trout
Date:
On Feb 14, 2004, at 9:02 PM, Marinos J. Yannikos wrote:

> Josh Berkus wrote:
>
>> 800MB for sort mem?   Are you sure you typed that correctly?   You
>> must be counting on not having a lot of concurrent queries.  It sure
>> will speed up index updating, though!
>
> 800MB is correct, yes... There are usually only 10-30 postgres
> processes  active (imagine 5-10 people working on the web front-end
> while cron jobs access the db occasionally). Very few queries can use
> such large amounts of memory for sorting, but they do exist.
>

Remember that it is going to allocate 800MB per sort.  It is not "you
can allocate up to 800MB, so if you need 1 meg, use one meg".  Some
queries may end up having a few sort steps.

In terms of sort mem it is best to set a system default to a nice good
value for most queries.  and then in your reporting queries or other
ones set sort_mem for that session (set sort_mem = 800000) then only
that session will use the looney sort_mem

It would be interesting to know if your machine is swapping.
--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


Re: optimization ideas for frequent, large(ish) updates

From
Jeff Trout
Date:
On Feb 14, 2004, at 9:02 PM, Marinos J. Yannikos wrote:

> Josh Berkus wrote:
>
>> 800MB for sort mem?   Are you sure you typed that correctly?   You
>> must be counting on not having a lot of concurrent queries.  It sure
>> will speed up index updating, though!
>
> 800MB is correct, yes... There are usually only 10-30 postgres
> processes  active (imagine 5-10 people working on the web front-end
> while cron jobs access the db occasionally). Very few queries can use
> such large amounts of memory for sorting, but they do exist.
>

Remember that it is going to allocate 800MB per sort.  It is not "you
can allocate up to 800MB, so if you need 1 meg, use one meg".  Some
queries may end up having a few sort steps.

In terms of sort mem it is best to set a system default to a nice good
value for most queries.  and then in your reporting queries or other
ones set sort_mem for that session (set sort_mem = 800000) then only
that session will use the looney sort_mem

It would be interesting to know if your machine is swapping.
--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


Re: optimization ideas for frequent, large(ish) updates

From
"Marinos J. Yannikos"
Date:
Jeff Trout wrote:

> Remember that it is going to allocate 800MB per sort.  It is not "you
> can allocate up to 800MB, so if you need 1 meg, use one meg".  Some
> queries may end up having a few sort steps.

I didn't know that it always allocates the full amount of memory
specificed in the configuration (e.g. the annotated configuration guide
says: "Note that for a complex query, several sorts might be running in
parallel, and each one _will be allowed to use_ as much memory as this
value specifies before it starts to put data into temporary files.").
The individual postgres processes don't look like they're using the full
amount either (but that could be because the memory isn't written to).

> In terms of sort mem it is best to set a system default to a nice good
> value for most queries.  and then in your reporting queries or other
> ones set sort_mem for that session (set sort_mem = 800000) then only
> that session will use the looney sort_mem

Queries from the web front-end use up to ~130MB sort memory (according
to pgsql_tmp), so I set this to 150MB - thanks.

> It would be interesting to know if your machine is swapping.

It's not being monitored closely (other than with the occasional "top"),
  but it's highly unlikely:

Mem:  12441864k total, 10860648k used,  1581216k free,    84552k buffers
Swap:  4008176k total,     2828k used,  4005348k free,  9762628k cached

(that's a typical situation - the "2828k used" are probably some rarely
used processes that have lower priority than the cache ...)

Regards,
  Marinos


Re: optimization ideas for frequent, large(ish) updates

From
Tom Lane
Date:
"Marinos J. Yannikos" <mjy@geizhals.at> writes:
> Jeff Trout wrote:
>> Remember that it is going to allocate 800MB per sort.

> I didn't know that it always allocates the full amount of memory
> specificed in the configuration

It doesn't ... but it could use *up to* that much before starting to
spill to disk.  If you are certain your sorts won't use that much,
then you could set the limit lower, hm?

Also keep in mind that sort_mem controls hash table size as well as sort
size.  The hashtable code is not nearly as accurate as the sort code
about honoring the specified limit exactly.  So you really oughta figure
that you could need some multiple of sort_mem per active backend.

            regards, tom lane