Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: Linux kernel impact on PostgreSQL performance
Date
Msg-id 52D601B2.70806@nasby.net
Whole thread Raw
In response to Re: Linux kernel impact on PostgreSQL performance  (Mel Gorman <mgorman@suse.de>)
Responses Re: Linux kernel impact on PostgreSQL performance
List pgsql-hackers
On 1/14/14, 4:21 AM, Mel Gorman wrote:
> There is an interesting side-line here. If all IO is initiated by one
> process in postgres then the memory locality will be sub-optimal.
> The consumer of the data may or may not be running on the same
> node as the process that read the data from disk. It is possible to
> migrate this from user space but the interface is clumsy and assumes the
> data is mapped.

That's really not the case in Postgres. There's essentially 3 main areas for IO requests to come from:

- Individual "backends". These are processes forked off of our startup process (postmaster) for the purpose of serving
userconnections. This is always "foreground" IO and should be avoided as much as possible (but is still a large
percentage).
- autovacuum. This is a set of "clean-up" processes, meant to be low impact, background only. Similar to garbage
collectionis GC languages.
 
- bgwriter. This process is meant to greatly reduce the need for user backends to write data out.

Generally speaking, read requests are most likely to come from user backends. autovacuum can issue them too, but it's
gota throttling mechanism so generally shouldn't be that much of the workload.
 

Ideally most write traffic would come from bgwriter (and autovacuum, though again we don't care too much about it). In
realitythough, that's going to depend very highly on a user's actual workload. To start, backends normally must write
allwrite-ahead-log traffic before they finalize (COMMIT) a transaction for the user. COMMIT is sort of similar in idea
tofsync... "When this returns I guarantee I've permanently stored your data."
 

The amount of WAL data generated for a transaction will vary enormously, even as a percentage of raw page data written.
Insome cases a very small (10s-100s of bytes) amount of WAL data will cover 1 or more base data pages (8k by default,
upto 64k). But to protect against torn page writes, by default we write a complete copy of a data page to WAL the first
timethe page is dirtied after a checkpoint. So the opposite scenario is we actually write slightly MORE data to WAL
thanwe do to the data pages.
 

What makes WAL even trickier is that bgwritter tries to write WAL data out before backends need to. In a system with a
fairlylow transaction rate that can work... but with a higher rate most WAL data will be written by a backend trying to
issuea COMMIT. Note however that COMMIT needs to write ALL WAL data up to a given point, so one backend that only needs
towrite 100 bytes can easily end up flushing (and fsync'ing) megabytes of data written by some other backend.
 

Further complicating things is temporary storage, either in the form of user defined temporary tables, or temporary
storageneeded by the database itself. It's hard to characterize these workloads other than to say that typically
readingand writing to them will want to move a relatively large amount of data at once.
 

BTW, because Postgres doesn't have terribly sophisticated memory management, it's very common to create temporary file
datathat will never, ever, ever actually NEED to hit disk. Where I work being able to tell the kernel to avoid flushing
thosefiles unless the kernel thinks it's got better things to do with that memory would be EXTREMELY valuable, because
it'sall temp data anyway: if the database or server crashes it's just going to get throw away. It might be a good idea
forthe Postgres to look at simply putting this data into plain memory now and relying on the OS to swap it as needed.
That'dbe more problematic for temp tables, but in that case mmap might work very well, because that data is currently
nevershared by other processes, though if we start doing parallel query execution that will change.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: integrate pg_upgrade analyze_new_cluster.sh into vacuumdb
Next
From: Craig Ringer
Date:
Subject: Re: WAL Rate Limiting