Re: Excessive (and slow) fsync() within single transaction - Mailing list pgsql-general

From Merlin Moncure
Subject Re: Excessive (and slow) fsync() within single transaction
Date
Msg-id b42b73150912082140r4f892144n9a4921c433cadf95@mail.gmail.com
Whole thread Raw
In response to Excessive (and slow) fsync() within single transaction  (Stephen Tyler <stephen@stephen-tyler.com>)
List pgsql-general
On Tue, Dec 8, 2009 at 7:28 PM, Stephen Tyler <stephen@stephen-tyler.com> wrote:
> I've been trying to track down a performance issue I have.  In simple terms,
> my select performance is very good (generally either CPU limited, or disk
> limited, depending upon the query), and small updates seem OK.
>
> But a huge UPDATE is incredibly slow.  CPU is on average below 1%, and disk
> IO is generally below 0.1% of what I would expect.  It seems to be doing
> tiny bursts of activity, interspersed with long periods of inactivity.  I've
> observed the same behaviour in many other cases where the amount of writing
> is large.
>
> My system:
>
> Mac Pro 2009 Quad 2.93 with 16G of ECC RAM
> Snow Leopard 10.6.2 in 64bit mode, fully patched
> Database on RAID 0 array of Intel X-25M SSDs
> Postgres 8.4.1, 64 bit, compiled from source
>
> From outside of postgres, I can sustain over 400MB/s read, and 140MB/s write
> indefinitely.  And around 10K random read I/Os/sec, and 4K random write
> I/Os.  With no hint of pauses.  I have never witnessed anything other than
> speedy performance from the SSDs except from within postgresql.
>
> An example of a problematic SQL command is:
> update link_relurl as u set sid = m.sid from link_meta as m where u.link =
> m.link;
>
> Both tables are around 20Gbytes, and 200M rows, with link as the primary
> integer key.  So in this example, I am joining 2 tables or 200M rows and
> writing a new value to a column of one of those tables.
>
> The command is issued via psql.  There are no other connections to the
> database.
>
> My expectation would be that postgresql would issue an fsync() (or perhaps a
> few such calls) at the end of the transaction.  But this does not seem to be
> the case:
>
> Using the DTrace utility dtruss to time all calls to fsync being made by
> postgres:
> $> sudo dtruss -n postgres -t fsync -e
>   107/0x2dd:        7 fsync(0x3, 0xD, 0x100805400)         = 0 0
>   107/0x2dd:        5 fsync(0x5, 0x8, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x21, 0x1A, 0x100805400)         = 0 0
>   107/0x2dd:        5 fsync(0x1A, 0x8, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x19, 0x12, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x4, 0x2, 0x100805400)         = 0 0
>   107/0x2dd:        5 fsync(0x7, 0x12, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x1C, 0xB, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x1E, 0xB, 0x100805400)         = 0 0
>   107/0x2dd:      174 fsync(0x24, 0x103102AD8, 0x0)         = 0 0
>   107/0x2dd:     2018 fsync(0x20, 0x1B, 0x100805400)         = 0 0
>   107/0x2dd:        7 fsync(0x12, 0xC, 0x100805400)         = 0 0
>   107/0x2dd:      987 fsync(0x23, 0x1B, 0x100805400)         = 0 0
> .....
>
> With thousands of such lines.  The columns are PID, elapsed time in
> microseconds, and the system call with arguments.  In the lines above, the
> longest fsync is only 2 milliseconds.  Most of the lines are similar to the
> above.
>
> So firstly, why are there so many calls to fsync()?
>
> Unfortunately, some of the calls to fsync are incredibly slow:
>   107/0x2dd:  22526373 fsync(0x19, 0x1, 0x100805400)         = 0 0
>   107/0x2dd:  22637723 fsync(0x10, 0x9, 0x100805400)         = 0 0
>   107/0x2dd:  61745976 fsync(0xA, 0xF, 0x100805400)         = 0 0
>
> So in this triplet of consecutive calls we have elapsed times of 22, 22 and
> 61 seconds!
>
> I'm no expert on DTrace, nor the inner workings of postgresql, but if
> anybody has any ideas on what might be causing this behaviour, how to track
> it down, or how to fix or avoid it, I would be very grateful.

curious, have you checked the log for frequent checkpoint warnings?
Are you logging checkpoints and seeing if they line up with the slow
fsync times? my guess here is that you overflowing the write back
cache on the x25-m and the long fsync times are waiting for the disk
to flush the cache, or some background housecleaning is going on.
This behavior is very similar to what I've observed on certain raid
controllers when the cache overflows.

possible resolutions might include:
*) you might luck out and be dealing with known hopefully corrected
issue with intel drives
*) tweak checkpoints/bgwriter to try and reduce the penalty of fsync
*) disable fsync...risky, but I bet will make the problem go away
*) optimize your query. maybe try rewriting as delete/insert select
from scratch table?

merlin

pgsql-general by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: how to ensure a client waits for a previous transaction to finish?
Next
From: Guillaume Lelarge
Date:
Subject: Re: Rules and conditions