Re: Performance tuning - Mailing list pgsql-performance

From Richard Huxton
Subject Re: Performance tuning
Date
Msg-id 42355C40.1070207@archonet.com
Whole thread Raw
In response to Performance tuning  (Jacques Caron <jc@directinfos.com>)
List pgsql-performance
Jacques Caron wrote:
> I'm preparing a set of servers which will eventually need to handle a
> high volume of queries (both reads and writes, but most reads are very
> simple index-based queries returning a limited set of rows, when not
> just one), and I would like to optimize things as much as possible, so I
> have a few questions on the exact way PostgreSQL's MVCC works, and how
> transactions, updates and vacuuming interact. I hope someone will be
> able to point me in the right direction (feel free to give pointers if I
> missed the places where this is described).
>
>  From what I understand (and testing confirms it), bundling many queries
> in one single transaction is more efficient than having each query be a
> separate transaction (like with autocommit on). However, I wonder about
> the limits of this:
>
> - are there any drawbacks to grouping hundreds or thousands of queries
> (inserts/updates) over several minutes in one single transaction? Other
> than the fact that the inserts/updates will not be visible until
> committed, of course. Essentially turning autocommit off, and doing a
> commit once in a while.

1. If any locks are held then they will be held for much longer, causing
other processes to block.
2. PG needs to be able to roll back the changes - thousands of simple
inserts are fine, millions will probably not be.

> - does this apply only to inserts/selects/updates or also for selects?
> Another way to put this is: does a transaction with only one select
> actually have much transaction-related work to do? Or, does a
> transaction with only selects actually have any impact anywhere? Does it
> really leave a trace anywhere? Again, I understand that selects grouped
> in a transaction will not see updates done after the start of the
> transaction (unless done by the same process).

There are implications if a SELECT has side-effects (I can call a
function in a select - that might do anything).

> - if during a single transaction several UPDATEs affect the same row,
> will MVCC generate as many row versions as there are updates (like would
> be the case with autocommit) or will they be grouped into one single row
> version?

I believe there will be many versions. Certainly for 8.0 that must be
the case to support savepoints within a transaction.

> Another related issue is that many of the tables are indexed on a date
> field, and one process does a lot of updates on "recent" rows (which
> lead to many dead tuples), but after that "older" rows tend to remain
> pretty much unchanged for quite a while. Other than splitting the tables
> into "old" and "recent" tables, is there any way to make vacuum more
> efficient? Scanning the whole table for dead tuples when only a small
> portion of the table actually has any does not feel like being very
> efficient in this situation.

Not really.

> Other issue: every five minutes or so, I see a noticeable performance
> drop as PostgreSQL checkpoints. This is 7.4.3 with pretty lousy
> hardware, I know 8.0 with decent hardware and separate disk(s) for
> pg_xlog will definitely help, but I really wonder if there is any way to
> reduce the amount of work that needs to be done at that point (I'm a
> strong believer of fixing software before hardware). I have already
> bumped checkpoint_segments to 8, but I'm not quite sure I understand how
> this helps (or doesn't help) things. Logs show 3 to 6 "recycled
> transaction log file" lines at that time, that seems quite a lot of work
> for a load that's still pretty low. Does grouping of more queries in
> transactions help with this? Are there other parameters that can affect
> things, or is just a matter of how much inserts/updates/deletes are
> done, and the amount of data that was changed?

You might be better off reducing the number of checkpoint segments, and
decreasing the timeout. There is a balance between doing a lot of work
in one go, and the overhead of many smaller bursts of activity.

> Last point: some of the servers have expandable data (and will be
> replicated with slony-I) and will run with fsync off. I have read
> conflicting statements as to what exactly this does: some sources
> indicate that setting fsync off actually switches off WAL/checkpointing,
> others that it just prevents the fsync (or equivalent) system calls.
> Since I still see checkpointing in that case, I guess it's not exactly
> the former, but I would love to understand more about it. Really, I
> would love to be able to set some tables or databases to "go as fast as
> you can and don't worry about transactions, MVCC or anything like that",
> but I'm not sure that option exists...

Setting fsync=false means the sync isn't done, so data might still be
cached below PG's level. I'm not sure it's ever going to be possible to
mark a table as "ignore transactions" - it would be a lot of work, and
means you couldn't guarantee transactions that included that table in
any way.

--
   Richard Huxton
   Archonet Ltd

pgsql-performance by date:

Previous
From: Miroslav Šulc
Date:
Subject: Re: How to read query plan
Next
From: Stephan Szabo
Date:
Subject: Re: adding 'limit' leads to very slow query