Re: performance of insert/delete/update - Mailing list pgsql-performance

From scott.marlowe
Subject Re: performance of insert/delete/update
Date
Msg-id Pine.LNX.4.33.0211211542260.23979-100000@css120.ihs.com
Whole thread Raw
In response to Re: performance of insert/delete/update  ("Josh Berkus" <josh@agliodbs.com>)
Responses Re: performance of insert/delete/update  (Josh Berkus <josh@agliodbs.com>)
Re: performance of insert/delete/update  ("Josh Berkus" <josh@agliodbs.com>)
List pgsql-performance
On Thu, 21 Nov 2002, Josh Berkus wrote:

> Scott,
>
> > The absolutely most important thing to do to speed up inserts and
> > updates
> > is to squeeze as many as you can into one transaction.  Within
> > reason, of
> > course.  There's no great gain in putting more than a few thousand
> > together at a time.  If your application is only doing one or two
> > updates
> > in a transaction, it's going to be slower in terms of records written
> > per
> > second than an application that is updating 100 rows in a
> > transaction.
>
> This only works up to the limit of the memory you have available for
> Postgres.  If the updates in one transaction exceed your available
> memory, you'll see a lot of swaps to disk log that will slow things
> down by a factor of 10-50 times.

Sorry, but that isn't true.  MVCC means we don't have to hold all the data
in memory, we can have multiple versions of the same tuples on disk, and
use memory for what it's meant for, buffering.

The performance gain
comes from the fact that postgresql doesn't have to perform the data
consistency checks needed during an insert until after all the rows are
inserted, and it can "gang check" them/

> > Reducing triggers and foreign keys on the inserted tables to a
> > minimum
> > helps.
>
> ... provided that this will not jeapordize your data integrity.  If you
> have indispensable triggers in PL/pgSQL, re-qriting them in C will make
> them, and thus updates on their tables, faster.

Agreed.  But you've probably seen the occasional "I wasn't sure if we
needed that check or not, so I threw it in just in case" kind of database
design.  :-)

I definitely don't advocate just tossing all your FKs to make it run
faster.

Also note that many folks have replaced foreign keys with triggers and
gained in performance, as fks in pgsql still have some deadlock issues to
be worked out.

> Also, for foriegn keys, it speeds up inserts and updates on parent
> tables with many child records if the foriegn key column in the child
> table is indexed.

Absolutely.

> > Putting WAL (e.g. $PGDATA/pg_xlog directory) on it's own drive(s).
> >
> > Putting indexes that have to be updated during inserts onto their own
> >
> > drive(s).
> >
> > Performing regular vacuums on heavily updated tables.
> >
> > Also, if your hardware is reliable, you can turn off fsync in
> > postgresql.conf.  That can increase performance by anywhere from 2 to
> > 10
> > times, depending on your application.
>
> It can be dangerous though ... in the event of a power outage, for
> example, your database could be corrupted and difficult to recover.  So
> ... "at your own risk".

No, the database will not be corrupted, at least not in my experience.
however, you MAY lose data from transactions that you thought were
committed.  I think Tom posted something about this a few days back.

> I've found that switching from fsync to fdatasync on Linux yields
> marginal performance gain ... about 10-20%.

I'll have to try that.

> Also, if you are doing large updates (many records at once) you may
> want to increase WAL_FILES and CHECKPOINT_BUFFER in postgresql.conf to
> allow for large transactions.

Actually, postgresql will create more WAL files if it needs to to handle
the size of a transaction.  BUT, it won't create extra ones for heavier
parallel load without being told to.  I've inserted 100,000 rows at a
time with no problem on a machine with only 1 WAL file specified, and it
didn't burp.  It does run faster having multiple wal files when under
parallel load.

> Finally, you want to structure your queries so that you do the minimum
> number of update writes possible, or insert writes.  For example, a
> procedure that inserts a row, does some calculations, and then modifies
> several fields in that row is going to slow stuff down significantly
> compared to doing the calculations as variables and only a single
> insert.  Certainly don't hit a table with 8 updates, each updating one
> field instead of a single update statement.

This is critical, and bites many people coming from a row level locking
database to an MVCC database.  In MVCC every update creates a new on disk
tuple.  I think someone on the list a while back was updating their
database something like this:

update table set field1='abc' where id=1;
update table set field2='def' where id=1;
update table set field3='ghi' where id=1;
update table set field4='jkl' where id=1;
update table set field5='mno' where id=1;
update table set field6='pqr' where id=1;

and they had to vacuum something like every 5 minutes.

Also, things like:

update table set field1=field1+1

are killers in an MVCC database as well.


pgsql-performance by date:

Previous
From: "scott.marlowe"
Date:
Subject: Re: [ADMIN] H/W RAID 5 on slower disks versus no raid on
Next
From: Josh Berkus
Date:
Subject: Re: performance of insert/delete/update