Linux 2.6.6 also - Mailing list pgsql-hackers

From Gregory Stark
Subject Linux 2.6.6 also
Date
Msg-id 874qqos3pq.fsf@stark.xeocode.com
Whole thread Raw
Responses Re: Linux 2.6.6 also  (Manfred Spraul <manfred@colorfullife.com>)
List pgsql-hackers
This patch also looks relevant to Postgres for two reasons. 

This part seems like it might expose some bugs that otherwise might have
remained hidden:
  This affects I/O scheduling potentially quite significantly.  It is no  longer the case that the kernel will submit
pagesfor I/O in the order in  which the application dirtied them.  We instead submit them in file-offset  order all the
time.

The part about part-file fdatasync calls seems like could be really useful.
It seems like that's just speculation about future directions though?



<akpm@osdl.org>[PATCH] make the pagecache lock irq-safe.Intro to these patches:- Major surgery against the pagecache,
radix-treeand writeback code.  This  work is to address the O_DIRECT-vs-buffered data exposure horrors which  we've
beenstruggling with for months.  As a side-effect, 32 bytes are saved from struct inode and eight bytes  are removed
fromstruct page.  At a cost of approximately 2.5 bits per page  in the radix tree nodes on 4k pagesize, assuming the
pagecacheis densely  populated.  Not all pages are pagecache; other pages gain the full 8 byte  saving.  This change
willbreak any arch code which is using page->list and will  also break any arch code which is using page->lru of memory
whichwas  obtained from slab.  The basic problem which we (mainly Daniel McNeil) have been struggling  with is in
gettinga really reliable fsync() across the page lists while  other processes are performing writeback against the same
file. It's like  juggling four bars of wet soap with your eyes shut while someone is  whacking you with a baseball bat.
Daniel pretty much has the problem  plugged but I suspect that's just because we don't have testcases to  trigger the
remainingproblems.  The complexity and additional locking  which those patches add is worrisome.  So the approach taken
hereis to remove the page lists altogether and  replace the list-based writeback and wait operations with in-order
radix-treewalks.  The radix-tree code has been enhanced to support "tagging" of pages, for  later searches for pages
whichhave a particular tag set.  This means that  we can ask the radix tree code "find me the next 16 dirty pages
startingat  pagecache index N" and it will do that in O(log64(N)) time.  This affects I/O scheduling potentially quite
significantly. It is no  longer the case that the kernel will submit pages for I/O in the order in  which the
applicationdirtied them.  We instead submit them in file-offset  order all the time.  This is likely to be advantageous
whenapplications are seeking all over  a large file randomly writing small amounts of data.  I haven't performed  much
benchmarking,but tiobench random write throughput seems to be  increased by 30%.  Other tests appear to be unaltered.
dbenchmay have got  10-20% quicker, but it's variable.  There is one large file which everyone seeks all over randomly
writing small amounts of data: the blockdev mapping which caches filesystem  metadata.  The kernel's IO submission
patternsfor this are now ideal.  Because writeback and wait-for-writeback use a tree walk instead of a  list walk they
areno longer livelockable.  This probably means that we no  longer need to hold i_sem across O_SYNC writes and perhaps
fsync()and  fdatasync().  This may be beneficial for databases: multiple processes  writing and syncing different parts
ofthe same file at the same time can  now all submit and wait upon writes to just their own little bit of the  file, so
wecan get a lot more data into the queues.  It is trivial to implement a part-file-fdatasync() as well, so
applicationscan say "sync the file from byte N to byte M", and multiple  applications can do this concurrently.  This
iseasy for ext2 filesystems,  but probably needs lots of work for data-journalled filesystems and XFS and  it probably
doesn'toffer much benefit over an i_semless O_SYNC write.  These patches can end up making ext3 (even) slower:    for i
in1 2 3 4    do        dd if=/dev/zero of=$i bs=1M count=2000 &    done            runs awfully slow on SMP.  This is,
yetagain, because all the file  blocks are jumbled up and the per-file linear writeout causes tons of  seeking.  The
abovetest runs sweetly on UP because the on UP we don't  allocate blocks to different files in parallel.  Mingming and
Badariare working on getting block reservation working for  ext3 (preallocation on steroids).  That should fix ext3
up.Thispatch:- Later, we'll need to access the radix trees from inside disk I/O  completion handlers.  So make
mapping->page_lockirq-safe.  And rename it  to tree_lock to reliably break any missed conversions.
 


-- 
greg



pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Linux 2.6.6 changes
Next
From: Andrew Sullivan
Date:
Subject: Re: signal 11 on AIX: 7.4.2