Re: checkpoint writeback via sync_file_range - Mailing list pgsql-hackers

From Robert Haas
Subject Re: checkpoint writeback via sync_file_range
Date
Msg-id CA+TgmobXuvgwNpp3y0vMf6_1n_wDO3SV=DuZC75KM0avEkZ5PA@mail.gmail.com
Whole thread Raw
In response to Re: checkpoint writeback via sync_file_range  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On Tue, Jan 10, 2012 at 11:38 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> What you're doing here doesn't care though, and I hadn't considered that
> SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
> docs.  Used this way, it's basically fsync without the wait or guarantee; it
> just tries to push what's already dirty further ahead of the write queue
> than those writes would otherwise be.

Well, my goal was to make sure they got into the write queue rather
than just sitting in memory while the kernel twiddles its thumbs.  My
hope is that the kernel is smart enough that, when you put something
under write-out, the kernel writes it out as quickly as it can without
causing too much degradation in foreground activity.  If that turns
out to be an incorrect assumption, we'll need a different approach,
but I thought it might be worth trying something simple first and
seeing what happens.

> One idea I was thinking about here was building a little hash table inside
> of the fsync absorb code, tracking how many absorb operations have happened
> for whatever the most popular relation files are.  The idea is that we might
> say "use sync_file_range every time <N> calls for a relation have come in",
> just to keep from ever accumulating too many writes to any one file before
> trying to nudge some of it out of there. The bat that keeps hitting me in
> the head here is that right now, a single fsync might have a full 1GB of
> writes to flush out, perhaps because it extended a table and then write more
> than that to it.  And in everything but a SSD or giant SAN cache situation,
> 1GB of I/O is just too much to fsync at a time without the OS choking a
> little on it.

That's not a bad idea, but there's definitely some potential down
side: you might end up reducing write-combining quite significantly if
you keep pushing things out to files when it isn't really needed yet.
I was aiming to only push things out when we're 100% sure that they're
going to have to be fsync'd, and certainly any already-written buffers
that are in the OS cache at the start of a checkpoint fall into that
category.  That having been said, experimental evidence is king.

> I'll put this into my testing queue after the upcoming CF starts.

Thanks!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: JSON for PG 9.2
Next
From: Satoshi Nagayasu
Date:
Subject: Re: log messages for archive recovery progress