Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Jan Kara
Subject Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date
Msg-id 20140114100040.GB21327@quack.suse.cz
Whole thread Raw
In response to Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance  (Hannu Krosing <hannu@2ndQuadrant.com>)
List pgsql-hackers
On Tue 14-01-14 09:08:40, Hannu Krosing wrote:
> >>> Effectively you end up with buffered read/write that's also mapped into
> >>> the page cache.  It's a pretty awful way to hack around mmap.
> >> Well, the problem is that you can't really use mmap() for the things we
> >> do. Postgres' durability works by guaranteeing that our journal entries
> >> (called WAL := Write Ahead Log) are written & synced to disk before the
> >> corresponding entries of tables and indexes reach the disk. That also
> >> allows to group together many random-writes into a few contiguous writes
> >> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
> >> the data is then (slowly, in the background) synced to disk.
> > Which is the exact algorithm most journalling filesystems use for
> > ensuring durability of their metadata updates.  Indeed, here's an
> > interesting piece of architecture that you might like to consider:
> >
> > * Neither XFS and BTRFS use the kernel page cache to back their
> >   metadata transaction engines.
> But file system code is supposed to know much more about the
> underlying disk than a mere application program like postgresql.
> 
> We do not want to start duplicating OS if we can avoid it.
> 
> What we would like is to have a way to tell the kernel
> 
> 1) "here is the modified copy of file page, it is now safe to write
>     it back" - the current 'lazy' write
> 
> 2) "here is the page, write it back now, before returning success
>     to me" - unbuffered write or write + sync
> 
> but we also would like to have
> 
> 3) "here is the page as it is currently on disk, I may need it soon,
>     so keep it together with your other clean pages accessed at time X"
>     - this is the non-dirtying write discussed
>    
>     the page may be in buffer cache, in which case just update its LRU
>     position (to either current time or time provided by postgresql), or
>     it may not be there, in which case put it there if reasonable by it's
>     LRU position.
> 
> And we would like all this to work together with other current linux
> kernel goodness of managing the whole disk-side interaction of
> efficient reading and writing and managing the buffers :) So when I was speaking about the proposed vrange() syscall
inthis thread,
 
I thought that instead of injecting pages into pagecache for aging as you
describe in 3), you would mark pages as volatile (i.e. for reclaim by
kernel) through vrange() syscall. Next time you need the page, you check
whether the kernel reclaimed the page or not. If yes, you reload it from
disk, if not, you unmark it and use it.

Now the aging of pages marked as volatile as it is currently implemented
needn't be perfect for your needs but you still have time to influence what
gets implemented... Actually developers of the vrange() syscall were
specifically looking for some ideas what to base aging on. Currently I
think it is first marked - first evicted.
                            Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Next
From: Jan Kara
Date:
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance