Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Mel Gorman
Subject Re: Linux kernel impact on PostgreSQL performance
Date
Msg-id 20140113222645.GM27046@suse.de
Whole thread Raw
In response to Re: Linux kernel impact on PostgreSQL performance  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Re: Linux kernel impact on PostgreSQL performance
Re: Linux kernel impact on PostgreSQL performance
List pgsql-hackers
On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 

Ok, there are at least three potential problems there that you may or
may not have run into.

First, THP when it was first introduced was a bit of a disaster. In 3.0,
it was *very* heavy handed and would trash the system reclaiming memory
to satisfy an allocation. When it did this, it would also writeback a
bunch of data and block on it to boot. It was not the smartest move of
all time but was improved over time and in some cases the patches were
also backported by 3.0.101. This is a problem that should have
alleviated over time.

The general symptoms of the problem would be massive stalls and
monitoring the /proc/PID/stack of interesting processes would show it to
be somewhere in do_huge_pmd_anonymous_page -> alloc_page_nodemask ->
try_to_free_pages -> migrate_pages or something similar. You may have
worked around it by disabling THP with a command line switch or
/sys/kernel/mm/transparent_hugepage/enabled in the past.

This is "not meant to happen" any more or at least it has been a while
since a bug was filed against me in this area. There are corner cases
though. If the underlying filesystem is NFS, the problem might still be
experienced.

That is the simple case.

You might have also hit the case where THPages filled with zeros did not
use the zero page. That would have looked like a larger footprint than
anticipated and lead to another range of problems. This is also addressed
since but maybe not recently enough. It's less likely this is your problem
though as I expect you actually use your buffers, not leave them filled
with zeros.

You mention NUMA but that's trickier to figure out that problem without more
context.  THP can cause unexpected interleaving between NUMA nodes. Memory
that would have been local on a 4K page boundary becomes remote accesses
when THP is enabled and performance would be hit (maybe 3-5% depending on
the machine). It's not the only possibility though. If memory was being
used sparsely and THP was in use then the overall memory footprint may be
higher than it should be. This potentially would cause allocations to spill
over to remote nodes while kswapd wakes up to reclaim local memory. That
would lead to weird buffer aging inversion problems. This is a hell of a
lot of guessing though and we'd need a better handle on the reproduction
case to pin it down.

> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?
> 

Really?

zone_reclaim_mode is often a complete disaster unless the workload is
partitioned to fit within NUMA nodes. On older kernels enabling it would
sometimes cause massive stalls. I'm actually very surprised to hear it
fixes anything and would be interested in hearing more about what sort
of circumstnaces would convince you to enable that thing.

> The other thing that comes to mind is the kernel's caching behavior.
> We've talked a lot over the years about the difficulties of getting
> the kernel to write data out when we want it to and to not write data
> out when we don't want it to. 

Is sync_file_range() broke?

> When it writes data back to disk too
> aggressively, we get lousy throughput because the same page can get
> written more than once when caching it for longer would have allowed
> write-combining. 

Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
If it's dirty_writeback_centisecs then that would be particularly tricky
because poor interactions there would come down to luck basically.

> When it doesn't write data to disk aggressively
> enough, we get huge latency spikes at checkpoint time when we call
> fsync() and the kernel says "uh, what? you wanted that data *on the
> disk*? sorry boss!" and then proceeds to destroy the world by starving
> the rest of the system for I/O for many seconds or minutes at a time.

Ok, parts of that are somewhat expected. It *may* depend on the
underlying filesystem. Some of them handle fsync better than others. If
you are syncing the whole file though when you call fsync then you are
potentially burned by having to writeback dirty_ratio amounts of memory
which could take a substantial amount of time.

> We've made some desultory attempts to use sync_file_range() to improve
> things here, but I'm not sure that's really the right tool, and if it
> is we don't know how to use it well enough to obtain consistent
> positive results.
> 

That implies that either sync_file_range() is broken in some fashion we
(or at least I) are not aware of and that needs kicking.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.
> 

If it's mapped, clean and you do not need any more than
madvise(MADV_DONTNEED). If you are accessing teh data via a file handle,
then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do
not know how it behaved historically but right now it will usually sync
the data and then discard the pages. I say usually because it will not
necessarily sync if the storage is congested and there is no guarantee it
will be discarded. In older kernels, there was a bug where small calls to
posix_fadvise() would not work at all. This was fixed in 3.9.

The flipside is also meant to hold true. If you know data will be needed
in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
the implementation it does a forced read-ahead on the range of pages of
interest. It doesn't look like it would block.

The completely different approach for double buffering is direct IO but
there may be reasons why you are avoiding that and are unhappy with the
interfaces that are meant to work.

Just from the start, it looks like there are a number of problem areas.
Some may be fixed -- in which case we should identify what fixed it, what
kernel version and see can it be verified with a test case or did we
manage to break something else in the process. Other bugs may still
exist because we believe some interface works how users want when it is
in fact unfit for purpose for some reason.

-- 
Mel Gorman
SUSE Labs



pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Disallow arrays with non-standard lower bounds
Next
From: Mel Gorman
Date:
Subject: Re: Linux kernel impact on PostgreSQL performance