Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Linux kernel impact on PostgreSQL performance
Date
Msg-id CA+TgmoZLmYBVpEKy3b+5s+7DbqxOCs66U_FnEWwXrGEVJTK0+g@mail.gmail.com
Whole thread Raw
In response to Re: Linux kernel impact on PostgreSQL performance  (Mel Gorman <mgorman@suse.de>)
Responses Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance  (Dave Chinner <david@fromorbit.com>)
List pgsql-hackers
On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgorman@suse.de> wrote:
>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>> setting zone_reclaim_mode; is there some other problem besides that?
>
> Really?
>
> zone_reclaim_mode is often a complete disaster unless the workload is
> partitioned to fit within NUMA nodes. On older kernels enabling it would
> sometimes cause massive stalls. I'm actually very surprised to hear it
> fixes anything and would be interested in hearing more about what sort
> of circumstnaces would convince you to enable that thing.

By "set" I mean "set to zero".  We've seen multiple of instances of
people complaining about large amounts of system memory going unused
because this setting defaulted to 1.

>> The other thing that comes to mind is the kernel's caching behavior.
>> We've talked a lot over the years about the difficulties of getting
>> the kernel to write data out when we want it to and to not write data
>> out when we don't want it to.
>
> Is sync_file_range() broke?

I don't know.  I think a few of us have played with it and not been
able to achieve a clear win.  Whether the problem is with the system
call or the programmer is harder to determine.  I think the problem is
in part that it's not exactly clear when we should call it.  So
suppose we want to do a checkpoint.  What we used to do a long time
ago is write everything, and then fsync it all, and then call it good.But that produced horrible I/O storms.  So what
wedo now is do the
 
writes over a period of time, with sleeps in between, and then fsync
it all at the end, hoping that the kernel will write some of it before
the fsyncs arrive so that we don't get a huge I/O spike.

And that sorta works, and it's definitely better than doing it all at
full speed, but it's pretty imprecise.  If the kernel doesn't write
enough of the data out in advance, then there's still a huge I/O storm
when we do the fsyncs and everything grinds to a halt.  If it writes
out more data than needed in advance, it increases the total number of
physical writes because we get less write-combining, and that hurts
performance, too.  I basically feel like the I/O scheduler sucks,
though whether it sucks because it's not theoretically possible to do
any better or whether it sucks because of some more tractable reason
is not clear to me.  In an ideal world, when I call fsync() a bunch of
times from one process, other processes on the same machine should
begin to observe 30+-second (or sometimes 300+-second) times for read
or write of an 8kB block.  Imagine a hypothetical UNIX-like system
where when one process starts running at 100% CPU, every other process
on the machine gets timesliced in only once per minute.  That's
obviously ridiculous, and yet it's pretty much exactly what happens
with I/O.

>> When it writes data back to disk too
>> aggressively, we get lousy throughput because the same page can get
>> written more than once when caching it for longer would have allowed
>> write-combining.
>
> Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
> If it's dirty_writeback_centisecs then that would be particularly tricky
> because poor interactions there would come down to luck basically.

See above; I think it's related to fsync.

>> When it doesn't write data to disk aggressively
>> enough, we get huge latency spikes at checkpoint time when we call
>> fsync() and the kernel says "uh, what? you wanted that data *on the
>> disk*? sorry boss!" and then proceeds to destroy the world by starving
>> the rest of the system for I/O for many seconds or minutes at a time.
>
> Ok, parts of that are somewhat expected. It *may* depend on the
> underlying filesystem. Some of them handle fsync better than others. If
> you are syncing the whole file though when you call fsync then you are
> potentially burned by having to writeback dirty_ratio amounts of memory
> which could take a substantial amount of time.

Yeah.  ext3 apparently fsyncs the whole filesystem, which is terrible
for throughput, but if you happen to have xlog (which is flushed
regularly) on the same filesystem as the data files (which are flushed
only periodically) then at least you don't have the problem of the
write queue getting too large.   But I think most of our users are on
ext4 at this point, probably some xfs and other things.

We track the number of un-fsync'd blocks we've written to each file,
and have gotten desperate enough to think of approaches like - ok,
well if the total number of un-fsync'd blocks in the system exceeds
some threshold, then fsync the file with the most such blocks, not
because we really need the data on disk just yet but so that the write
queue won't get too large for the kernel to deal with.  And I think
there may even be some test results from such crocks showing some
benefit.  But really, I don't understand why we have to baby the
kernel like this.  Ensuring scheduling fairness is a basic job of the
kernel; if we wanted to have to control caching behavior manually, we
could use direct I/O.  Having accepted the double buffering that comes
with NOT using direct I/O, ideally we could let the kernel handle
scheduling and call it good.

>> We've made some desultory attempts to use sync_file_range() to improve
>> things here, but I'm not sure that's really the right tool, and if it
>> is we don't know how to use it well enough to obtain consistent
>> positive results.
>
> That implies that either sync_file_range() is broken in some fashion we
> (or at least I) are not aware of and that needs kicking.

So the problem is - when do you call it?  What happens is: before a
checkpoint, we may have already written some blocks to a file.  During
the checkpoint, we're going to write some more.  At the end of the
checkpoint, we'll need all blocks written before and during the
checkpoint to be on disk.  If we call sync_file_range() at the
beginning of the checkpoint, then in theory that should get the ball
rolling, but we may be about to rewrite some of those blocks, or at
least throw some more on the pile.  If we call sync_file_range() near
the end of the checkpoint, just before calling fsync, there's not
enough time for the kernel to reorder I/O to a sufficient degree to do
any good.  What we want, sorta, is to have the kernel start writing it
out just at the right time to get it on disk by the time we're aiming
to complete the checkpoint, but it's not clear exactly how to do that.We can't just write all the blocks,
sync_file_range(),wait, and then
 
fsync() because the "write all the blocks" step can trigger an I/O
storm if the kernel decides there's too much dirty data.

I suppose what we really want to do during a checkpoint is write data
into the O/S cache at a rate that matches what the kernel can
physically get down to the disk, and have the kernel schedule those
writes in as timely a fashion as it can without disrupting overall
system throughput too much.  But the feedback mechanisms that exist
today are just too crude for that.  You can easily write() to the
point where the whole system freezes up, or equally wait between
write()s when the system could easily have handled more right away.
And it's very hard to tell how much you can fsync() at once before
performance falls off a cliff.  A certain number of writes get
absorbed by various layers of caching between us and the physical
hardware - and then at some point, they're all full, and further
writes lead to disaster.  But I don't know of any way to assess how
close we are to that point at any give time except to cross it, and at
that point, it's too late.

>> On a related note, there's also the problem of double-buffering.  When
>> we read a page into shared_buffers, we leave a copy behind in the OS
>> buffers, and similarly on write-out.  It's very unclear what to do
>> about this, since the kernel and PostgreSQL don't have intimate
>> knowledge of what each other are doing, but it would be nice to solve
>> somehow.
>
> If it's mapped, clean and you do not need any more than
> madvise(MADV_DONTNEED). If you are accessing teh data via a file handle,
> then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do
> not know how it behaved historically but right now it will usually sync
> the data and then discard the pages. I say usually because it will not
> necessarily sync if the storage is congested and there is no guarantee it
> will be discarded. In older kernels, there was a bug where small calls to
> posix_fadvise() would not work at all. This was fixed in 3.9.
>
> The flipside is also meant to hold true. If you know data will be needed
> in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> the implementation it does a forced read-ahead on the range of pages of
> interest. It doesn't look like it would block.
>
> The completely different approach for double buffering is direct IO but
> there may be reasons why you are avoiding that and are unhappy with the
> interfaces that are meant to work.
>
> Just from the start, it looks like there are a number of problem areas.
> Some may be fixed -- in which case we should identify what fixed it, what
> kernel version and see can it be verified with a test case or did we
> manage to break something else in the process. Other bugs may still
> exist because we believe some interface works how users want when it is
> in fact unfit for purpose for some reason.

It's all read, not mapped, because we have a need to prevent pages
from being written back to their backing files until WAL is fsync'd,
and there's no way to map a file and modify the page but not let it be
written back to disk until some other event happens.  We've
experimented with don't-need but it's tricky.

Here's an example.  Our write-ahead log files (WAL) are all 16MB;
eventually, when they're no longer needed for any purpose, older files
cease to be needed, but there's a continued demand for new files
driven by database modifications.  Experimentation some years ago
revealed that it's faster to rename and overwrite the old files than
to remove them and create new ones, so that's what we do.  Ideally
this means that at steady state we're just recycling the files over
and over and never creating or destroying any, though I'm not sure
whether we ever actually achieve that ideal.  However, benchmarking
has showed that making the wrong decision about whether to don't-need
those files has a significant effect on performance.   If there's
enough cache around to keep all the files in memory, then we don't
want to don't-need them because then access will be slow when the old
files are recycled.  If however there is cache pressure then we want
to don't-need them as quickly as possible to make room for other,
higher priority data.

Now that may not really be the kernel's fault; it's a general property
of ring buffers that you want to an LRU policy if they fit in cache
and immediate eviction of everything but the active page if they
don't.  But I think it demonstrates the general difficulty of using
posix_fadvise.  Similar cases arise for prefetching: gee, we'd like to
prefetch this data because we're going to use it soon, but if the
system is under enough pressure, the data may get evicted again before
"soon" actually arrives.

Thanks for taking the time to write all of these comments, and listen
to our concerns.  I really appreciate it, whether anything tangible
comes of it or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Next
From: Kevin Grittner
Date:
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance