Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers
From | Mel Gorman |
---|---|
Subject | Re: Linux kernel impact on PostgreSQL performance |
Date | |
Msg-id | 20140114102143.GA4963@suse.de Whole thread Raw |
In response to | Re: Linux kernel impact on PostgreSQL performance (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Linux kernel impact on PostgreSQL performance
|
List | pgsql-hackers |
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote: > On 01/13/2014 02:26 PM, Mel Gorman wrote: > > Really? > > > > zone_reclaim_mode is often a complete disaster unless the workload is > > partitioned to fit within NUMA nodes. On older kernels enabling it would > > sometimes cause massive stalls. I'm actually very surprised to hear it > > fixes anything and would be interested in hearing more about what sort > > of circumstnaces would convince you to enable that thing. > > So the problem with the default setting is that it pretty much isolates > all FS cache for PostgreSQL to whichever socket the postmaster is > running on, and makes the other FS cache unavailable. I'm not being pedantic but the default depends on the NUMA characteristics of the machine so I need to know if it was enabled or disabled. Some machines will default zone_reclaim_mode to 0 and others will default it to 1. In my experience the majority of bugs that involved zone_reclaim_mode were due to zone_reclaim_mode enabled by default. If I see a bug that involves a file-based workload on a NUMA machine with stalls and/or excessive IO when there is plenty of memory free then zone_reclaim_mode is the first thing I check. I'm guessing from context that in your experience it gets enabled by default on the machines you care about. This would indeed limit FS cache usage to the node where the process is initiating IO (postmaster I guess). > This means that, > for example, if you have two memory banks, then only one of them is > available for PostgreSQL filesystem caching ... essentially cutting your > available cache in half. > > And however slow moving cached pages between memory banks is, it's an > order of magnitude faster than moving them from disk. But this isn't > how the NUMA stuff is configured; it seems to assume that it's less > expensive to get pages from disk than to move them between banks, so Yes, this is right. The history behind this "logic" is that it was assumed NUMA machines would only ever be used for HPC and that the workloads would always be partitioned to run within NUMA nodes. This has not been the case for a long time and I would argue that we should leave that thing disabled by default in all cases. Last time I tried it was met with resistance but maybe it's time to try again. > whatever you've got cached on the other bank, it flushes it to disk as > fast as possible. I understand the goal was to make memory usage local > to the processors stuff was running on, but that includes an implicit > assumption that no individual process will ever want more than one > memory bank worth of cache. > > So disabling all of the NUMA optimizations is the way to go for any > workload I personally deal with. > I would hesitate to recommend "all" on the grounds that zone_reclaim_mode is brain damage and I'd hate to lump all tuning parameters into the same box. There is an interesting side-line here. If all IO is initiated by one process in postgres then the memory locality will be sub-optimal. The consumer of the data may or may not be running on the same node as the process that read the data from disk. It is possible to migrate this from user space but the interface is clumsy and assumes the data is mapped. Automatic NUMA balancing does not help you here because that thing also depends on the data being mapped. It does nothing for data accessed via read/write. There is nothing fundamental that prevents this, it was not implemented because it was not deemed to be important enough. The amount of effort spent on addressing this would depend on how important NUMA locality is for postgres performance. -- Mel Gorman SUSE Labs
pgsql-hackers by date: