Re: Hint bits vs. OS readahead - Mailing list pgsql-hackers
From | Decibel! |
---|---|
Subject | Re: Hint bits vs. OS readahead |
Date | |
Msg-id | 810239FB-A1F6-48D2-99DD-9EDBBD8F6BFC@decibel.org Whole thread Raw |
In response to | Hint bits vs. OS readahead (Andrew Gierth <andrew@tao11.riddles.org.uk>) |
List | pgsql-hackers |
On Dec 5, 2008, at 7:50 PM, Andrew Gierth wrote: > While waiting for a large restore to complete (and investigating why > parts of it were so slow), I came across this scenario. This isn't > quite the same as some previous discussion of hint bits, but I thought > it was something that could probably be taken into account in future. > This also may be relevent to the tuplestore discussion. > > The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM; > 16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in > software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB > > The scenario: pg_restore of a dump containing a large partitioned > table > (a dozen partitions of ~10GB each). The actual loading of the data > proceeds as expected, the interesting part is the creation of indexes > afterwards. > > Watching the progress of the backend, a large proportion of the time > is taken up by the heap scan to retrieve the data. The problem is, of > course, that the backend settles down into an access pattern like > this: > > lseek(0x64,0x3ef7c000,SEEK_SET) > read(0x64,0x864123340,0x2000) => 8192/0x2000 > lseek(0x64,0x3ef3e000,SEEK_SET) > write(0x64,0x864125340,0x2000) => 8192/0x2000 > > where fd 0x64 is the table heap file; the read is obvious, the > write is > caused by writing a previously hinted page back to disk when the > backend > wants to reuse the buffer. Notice that this write is happening in the > same backend (and on the same fd). > > At least on unpatched FreeBSD this access pattern destroys OS-level > readahead, though lower-level readahead on the actual disk drives > themselves hides this fact to a significant extent (each read() call > forces a SCSI transaction, but this transaction completes quite > quickly due to read caching on the drive). > > In order to test how bad the effect was, I patched FreeBSD to use > separate sequential-behaviour tracking for reads and writes (this > patch turns out to be trivial, affecting only a couple of dozen > lines). The effect was fairly dramatic; the total time taken for > CREATE INDEX was cut by a factor of slightly better than 2 (typically > from ~700 seconds per partition to ~320 seconds on my data). > > [for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ] > > The obvious question is whether this is something which should be left > as the OS'es problem, or whether it would be worth having pg do some > special handling of file opens to distinguish read and write accesses, > or sequential from random accesses when both are likely to be > happening > at the same time. I've so far had conflicting answers about how well > Linux handles this case (and not being a Linux user I have no easy way > to test it myself). We don't do restores very often, but we have noticed that recovery mode is painfully slow for us, either from a crash or to bring up a PITR snapshot. We're running on 16 core IA64 machines with 96GB hitting iSCSI SANs (some SATA, some SAS). Under ideal conditions, PG can read or write at 100+MB/s. Typically, we're pushing ~5MB/s, but during recovery we'll only do 600-700kB/s. I've never straced a backend to see exactly what's going on. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
pgsql-hackers by date: