Thread: O_DIRECT for relations and SLRUs (Prototype)
Hi all, (Added Kevin in CC) There have been over the ages discussions about getting better O_DIRECT support to close the gap with other players in the database market, but I have not actually seen on those lists a patch which makes use of O_DIRECT for relations and SLRUs (perhaps I missed it, anyway that would most likely conflict now). Attached is a toy patch that I have begun using for tests in this area. That's nothing really serious at this stage, but you can use that if you would like to see the impact of O_DIRECT. Of course, things get significantly slower. The patch is able to compile, pass regression tests, and looks stable. So that's usable for experiments. The patch uses a GUC called direct_io, enabled to true to ease regression testing when applying it. Note that pg_attribute_aligned() cannot be used as that's not an option with clang and a couple of other comilers as far as I know, so the patch uses a simple set of placeholder buffers large enough to be aligned with the OS pages, which should be 4k for Linux by the way, and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care much with such details. If there is interest for such things, perhaps we could get a patch sorted out, with some angles of attack like: - Move to use of page-aligned buffers for relations and SLRUs. - Split use of O_DIRECT for SLRU and relations into separate GUCs. - Perhaps other things. However this is a large and very controversial topic, and of course more complex than the experiment attached, still this prototype is fun to play with. Thanks for reading! -- Michael
Attachment
Hi! > 12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а): > > Attached is a toy patch that I have begun using for tests in this > area. That's nothing really serious at this stage, but you can use > that if you would like to see the impact of O_DIRECT. Of course, > things get significantly slower. Cool! I've just gathered a group of students to task them with experimenting with shared buffer eviction algorithms during theirFebruary internship at Yandex-Sirius edu project. Your patch seems very handy for benchmarks in this area. Thanks! Best regards, Andrey Borodin.
On Sun, Jan 13, 2019 at 5:13 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Hi! > > > 12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а): > > > > Attached is a toy patch that I have begun using for tests in this > > area. That's nothing really serious at this stage, but you can use > > that if you would like to see the impact of O_DIRECT. Of course, > > things get significantly slower. > > Cool! > I've just gathered a group of students to task them with experimenting with shared buffer eviction algorithms during theirFebruary internship at Yandex-Sirius edu project. Your patch seems very handy for benchmarks in this area. +1, thanks for sharing the patch. Even though just turning on O_DIRECT is the trivial part of this project, it's good to encourage discussion. We may indeed become more sensitive to the quality of buffer eviction algorithms, but it seems like the main work to regain lost performance will be the background IO scheduling piece: 1. We need a new "bgreader" process to do read-ahead. I think you'd want a way to tell it with explicit hints (for example, perhaps sequential scans would advertise that they're reading sequentially so that it starts to slurp future blocks into the buffer pool, and streaming replicas might look ahead in the WAL and tell it what's coming). In theory this might be better than the heuristics OSes use to guess our access pattern and pre-fetch into the page cache, since we have better information (and of course we're skipping a buffer layer). 2. We need a new kind of bgwriter/syncer that aggressively creates clean pages so that foreground processes rarely have to evict (since that is now super slow), but also efficiently finds ranges of dirty blocks that it can write in big sequential chunks. 3. We probably want SLRUs to use the main buffer pool, instead of their own mini-pools, so they can benefit from the above. Whether we need multiple bgreader and bgwriter processes or perhaps a general IO scheduler process may depend on whether we also want to switch to async (multiplexing from a single process). Starting simple with a traditional sync IO and N processes seems OK to me. -- Thomas Munro http://www.enterprisedb.com
On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote: > 1. We need a new "bgreader" process to do read-ahead. I think you'd > want a way to tell it with explicit hints (for example, perhaps > sequential scans would advertise that they're reading sequentially so > that it starts to slurp future blocks into the buffer pool, and > streaming replicas might look ahead in the WAL and tell it what's > coming). In theory this might be better than the heuristics OSes use > to guess our access pattern and pre-fetch into the page cache, since > we have better information (and of course we're skipping a buffer > layer). Yes, that could be interesting mainly for analytics by being able to snipe better than the OS readahead. > 2. We need a new kind of bgwriter/syncer that aggressively creates > clean pages so that foreground processes rarely have to evict (since > that is now super slow), but also efficiently finds ranges of dirty > blocks that it can write in big sequential chunks. Okay, that's a new idea. A bgwriter able to do syncs in chunks would be also interesting with O_DIRECT, no? > 3. We probably want SLRUs to use the main buffer pool, instead of > their own mini-pools, so they can benefit from the above. Wasn't there a thread about that on -hackers actually? I cannot see any reference to it. > Whether we need multiple bgreader and bgwriter processes or perhaps a > general IO scheduler process may depend on whether we also want to > switch to async (multiplexing from a single process). Starting simple > with a traditional sync IO and N processes seems OK to me. So you mean that we could just have a simple switch as a first step? Or I misunderstood you :) One of the reasons why I have begun this thread is that since we have heard about the fsync issues on Linux, I think that there is room for giving our user base more control of their fate without relying on the Linux community decisions to potentially eat data and corrupt a cluster with a page dirty bit cleared without its data actually flushed. Even the latest kernels are not fixing all the patterns with open fds across processes, switching the problem from one corner of the table to another, and there are folks patching the Linux kernel to make Postgres more reliable from this perspective, and living happily with this option. As long as the option can be controlled and defaults to false, it seems to be that we could do something. Even if the performance is bad, this gives the user control of how he/she wants things to be done. -- Michael
Attachment
> 13 янв. 2019 г., в 14:02, Michael Paquier <michael@paquier.xyz> написал(а): > >> 3. We probably want SLRUs to use the main buffer pool, instead of >> their own mini-pools, so they can benefit from the above. > > Wasn't there a thread about that on -hackers actually? I cannot see > any reference to it. I think it's here https://www.postgresql.org/message-id/flat/CAEepm%3D0o-%3Dd8QPO%3DYGFiBSqq2p6KOvPVKG3bggZi5Pv4nQw8nw%40mail.gmail.com#bacee3e6612c53c31658b18650e7ffd9 > As long as the option can be controlled and > defaults to false, it seems to be that we could do something. Even if > the performance is bad, this gives the user control of how he/she > wants things to be done. I like the idea of having this switch, I believe it will make development in this direction easier. But I think there will be complain from users like "this feature is done wrong" due to really bad performance. Best regards, Andrey Borodin.
On Sun, Jan 13, 2019 at 10:02 PM Michael Paquier <michael@paquier.xyz> wrote: > On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote: > > 1. We need a new "bgreader" process to do read-ahead. I think you'd > > want a way to tell it with explicit hints (for example, perhaps > > sequential scans would advertise that they're reading sequentially so > > that it starts to slurp future blocks into the buffer pool, and > > streaming replicas might look ahead in the WAL and tell it what's > > coming). In theory this might be better than the heuristics OSes use > > to guess our access pattern and pre-fetch into the page cache, since > > we have better information (and of course we're skipping a buffer > > layer). > > Yes, that could be interesting mainly for analytics by being able to > snipe better than the OS readahead. > > > 2. We need a new kind of bgwriter/syncer that aggressively creates > > clean pages so that foreground processes rarely have to evict (since > > that is now super slow), but also efficiently finds ranges of dirty > > blocks that it can write in big sequential chunks. > > Okay, that's a new idea. A bgwriter able to do syncs in chunks would > be also interesting with O_DIRECT, no? Well I'm just describing the stuff that the OS is doing for us in another layer. Evicting dirty buffers currently consists of a buffered pwrite(), which we can do a huge number of per second (given enough spare RAM), but with O_DIRECT | O_SYNC we'll be limited by storage device random IOPS, so workloads that evict dirty buffers in foreground processes regularly will suffer. bgwriter should make sure we always find clean buffers without waiting when we need them. Yeah, I think pwrite() larger than 8KB at a time would be a goal, to get large IO request sizes all the way down to the storage. > > 3. We probably want SLRUs to use the main buffer pool, instead of > > their own mini-pools, so they can benefit from the above. > > Wasn't there a thread about that on -hackers actually? I cannot see > any reference to it. https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com > > Whether we need multiple bgreader and bgwriter processes or perhaps a > > general IO scheduler process may depend on whether we also want to > > switch to async (multiplexing from a single process). Starting simple > > with a traditional sync IO and N processes seems OK to me. > > So you mean that we could just have a simple switch as a first step? > Or I misunderstood you :) I just meant that if we take over all the read-ahead and write-behind work and use classic synchronous IO syscalls like pread()/pwrite(), we'll probably need multiple processes to do it, depending on how much IO concurrency the storage layer can take. -- Thomas Munro http://www.enterprisedb.com
From: Michael Paquier [mailto:michael@paquier.xyz] > One of the reasons why I have begun this thread is that since we have heard > about the fsync issues on Linux, I think that there is room for giving our > user base more control of their fate without relying on the Linux community > decisions to potentially eat data and corrupt a cluster with a page dirty > bit cleared without its data actually flushed. Even the latest kernels > are not fixing all the patterns with open fds across processes, switching > the problem from one corner of the table to another, and there are folks > patching the Linux kernel to make Postgres more reliable from this > perspective, and living happily with this option. As long as the option > can be controlled and defaults to false, it seems to be that we could do > something. Even if the performance is bad, this gives the user control > of how he/she wants things to be done. Thank you for starting an interesting topic. We probably want the direct I/O. On a INSERT and UPDATE heavy system withPostgreSQL 9.2, we suffered from occasional high response times due to the Linux page cache activity. Postgres processescompeted for the page cache to read/write the data files, write online and archive WAL files, and write the serverlog files (auto_explain and autovacuum workers emitted a lot of logs.) The user with Oracle experience asked why PostgreSQLdoesn't handle database I/O by itself... And I wonder how useful the direct I/O for low latency devices like the persistent memory. The overhead of the page cachemay become relatively higher. Regards Takayuki Tsunakawa
> 12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а): > > Note that pg_attribute_aligned() cannot be used as that's not an > option with clang and a couple of other comilers as far as I know, so > the patch uses a simple set of placeholder buffers large enough to be > aligned with the OS pages, which should be 4k for Linux by the way, > and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care > much with such details. Is it possible to avoid those memcopy's by aligning available buffers instead? I couldn't understand this from the patch and this thread. Best regards, Andrey Borodin.
On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote: > Is it possible to avoid those memcpy's by aligning available buffers > instead? I couldn't understand this from the patch and this thread. Sure, it had better do that. That's just a lazy implementation. -- Michael
Attachment
On 1/15/19 11:28 AM, Michael Paquier wrote: > On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote: >> Is it possible to avoid those memcpy's by aligning available buffers >> instead? I couldn't understand this from the patch and this thread. > Sure, it had better do that. That's just a lazy implementation. Hi! Could you specify all cases when buffers will not be aligned with BLCKSZ? AFAIC shared and temp buffers are aligned. And what ones are not? -- Regards, Maksim Milyutin
On Tue, Jan 15, 2019 at 07:40:12PM +0300, Maksim Milyutin wrote: > Could you specify all cases when buffers will not be aligned with BLCKSZ? > > AFAIC shared and temp buffers are aligned. And what ones are not? SLRU buffers are not aligned with the OS pages (aka alignment with 4096 at least). There are also a bunch of code paths where the callers of mdread() or mdwrite() don't do that, which makes a correct patch more invasive. -- Michael
Attachment
On Sat, Jan 12, 2019 at 4:36 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > 1. We need a new "bgreader" process to do read-ahead. I think you'd > want a way to tell it with explicit hints (for example, perhaps > sequential scans would advertise that they're reading sequentially so > that it starts to slurp future blocks into the buffer pool, and > streaming replicas might look ahead in the WAL and tell it what's > coming). In theory this might be better than the heuristics OSes use > to guess our access pattern and pre-fetch into the page cache, since > we have better information (and of course we're skipping a buffer > layer). Right, like if we're reading the end of relation file 16384, we can prefetch the beginning of 16384.1, but the OS won't know to do that. > 2. We need a new kind of bgwriter/syncer that aggressively creates > clean pages so that foreground processes rarely have to evict (since > that is now super slow), but also efficiently finds ranges of dirty > blocks that it can write in big sequential chunks. Yeah. > 3. We probably want SLRUs to use the main buffer pool, instead of > their own mini-pools, so they can benefit from the above. Right. I think this is important, and it makes me think that maybe Michael's patch won't help us much in the end. I believe that the number of pages that are needed for clog data, at least, can very significantly depending on workload and machine size, so there's not one number there that is going to work for everybody, and the algorithms the SLRU code uses for page management have O(n) stuff in them, so they don't scale well to large numbers of SLRU buffers anyway. I think we should try to unify the SLRU stuff with shared_buffers, and then have a test patch like Michael's (not for commit) which we can use to see the impact of that, and then try to reduce that impact with the stuff you mention under #1 and #2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company