Re: Hint bits vs. OS readahead - Mailing list pgsql-hackers

From Decibel!
Subject Re: Hint bits vs. OS readahead
Date
Msg-id 810239FB-A1F6-48D2-99DD-9EDBBD8F6BFC@decibel.org
Whole thread Raw
In response to Hint bits vs. OS readahead  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
List pgsql-hackers
On Dec 5, 2008, at 7:50 PM, Andrew Gierth wrote:
> While waiting for a large restore to complete (and investigating why
> parts of it were so slow), I came across this scenario. This isn't
> quite the same as some previous discussion of hint bits, but I thought
> it was something that could probably be taken into account in future.
> This also may be relevent to the tuplestore discussion.
>
> The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM;
> 16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in
> software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB
>
> The scenario: pg_restore of a dump containing a large partitioned  
> table
> (a dozen partitions of ~10GB each). The actual loading of the data
> proceeds as expected, the interesting part is the creation of indexes
> afterwards.
>
> Watching the progress of the backend, a large proportion of the time
> is taken up by the heap scan to retrieve the data. The problem is, of
> course, that the backend settles down into an access pattern like  
> this:
>
>   lseek(0x64,0x3ef7c000,SEEK_SET)
>   read(0x64,0x864123340,0x2000) => 8192/0x2000
>   lseek(0x64,0x3ef3e000,SEEK_SET)
>   write(0x64,0x864125340,0x2000) => 8192/0x2000
>
> where fd 0x64 is the table heap file; the read is obvious, the  
> write is
> caused by writing a previously hinted page back to disk when the  
> backend
> wants to reuse the buffer. Notice that this write is happening in the
> same backend (and on the same fd).
>
> At least on unpatched FreeBSD this access pattern destroys OS-level
> readahead, though lower-level readahead on the actual disk drives
> themselves hides this fact to a significant extent (each read() call
> forces a SCSI transaction, but this transaction completes quite
> quickly due to read caching on the drive).
>
> In order to test how bad the effect was, I patched FreeBSD to use
> separate sequential-behaviour tracking for reads and writes (this
> patch turns out to be trivial, affecting only a couple of dozen
> lines). The effect was fairly dramatic; the total time taken for
> CREATE INDEX was cut by a factor of slightly better than 2 (typically
> from ~700 seconds per partition to ~320 seconds on my data).
>
> [for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ]
>
> The obvious question is whether this is something which should be left
> as the OS'es problem, or whether it would be worth having pg do some
> special handling of file opens to distinguish read and write accesses,
> or sequential from random accesses when both are likely to be  
> happening
> at the same time. I've so far had conflicting answers about how well
> Linux handles this case (and not being a Linux user I have no easy way
> to test it myself).

We don't do restores very often, but we have noticed that recovery  
mode is painfully slow for us, either from a crash or to bring up a  
PITR snapshot. We're running on 16 core IA64 machines with 96GB  
hitting iSCSI SANs (some SATA, some SAS). Under ideal conditions, PG  
can read or write at 100+MB/s. Typically, we're pushing ~5MB/s, but  
during recovery we'll only do 600-700kB/s. I've never straced a  
backend to see exactly what's going on.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




pgsql-hackers by date:

Previous
From: Decibel!
Date:
Subject: Re: WIP: default values for function parameters
Next
From: Andrew Gierth
Date:
Subject: Re: Regexps vs. locale