Thread: Raw device on PostgreSQL
Hey, for an university project I'm currently doing some research on PostgreSQL. I was wondering if hypothetically it would be possible to implement a raw device system to PostgreSQL. I know that the disadvantages would probably be higher than the advantages compared to working with the file system. Just hypothetically: Would it be possible to change the source code of PostgreSQL so a raw device system could be implemented, or would that cause a chain reaction so that basically one would have to rewrite almost the entire code, because too many elements of PostgreSQL rely on the file system? Best regards
Greetings, * Benjamin Schaller (benjamin.schaller@s2018.tu-chemnitz.de) wrote: > for an university project I'm currently doing some research on PostgreSQL. I > was wondering if hypothetically it would be possible to implement a raw > device system to PostgreSQL. I know that the disadvantages would probably be > higher than the advantages compared to working with the file system. Just > hypothetically: Would it be possible to change the source code of PostgreSQL > so a raw device system could be implemented, or would that cause a chain > reaction so that basically one would have to rewrite almost the entire code, > because too many elements of PostgreSQL rely on the file system? yes, it'd be possible, no, you wouldn't have to rewrite all of PG. Instead, if you want it to be performant at all, you'd have to write lots of new code to do all the things the filesystem and kernel do for us today. Thanks, Stephen
Attachment
On 4/28/20 10:43 AM, Benjamin Schaller wrote: > for an university project I'm currently doing some research on > PostgreSQL. I was wondering if hypothetically it would be possible to > implement a raw device system to PostgreSQL. I know that the > disadvantages would probably be higher than the advantages compared to > working with the file system. Just hypothetically: Would it be possible > to change the source code of PostgreSQL so a raw device system could be > implemented, or would that cause a chain reaction so that basically one > would have to rewrite almost the entire code, because too many elements > of PostgreSQL rely on the file system? It would require quite a bit of work since 1) PostgreSQL stores its data in multiple files and 2) PostgreSQL currently supports only synchronous buffered IO. To get the performance benefits from using raw devices I think you would want to add support for asynchronous IO to PostgreSQL rather than implementing your own layer to emulate the kernel's buffered IO. Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It was not recorded but the slides are available. https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/ Andreas
On Tue, Apr 28, 2020 at 02:10:51PM +0200, Andreas Karlsson wrote: >On 4/28/20 10:43 AM, Benjamin Schaller wrote: >>for an university project I'm currently doing some research on >>PostgreSQL. I was wondering if hypothetically it would be possible >>to implement a raw device system to PostgreSQL. I know that the >>disadvantages would probably be higher than the advantages compared >>to working with the file system. Just hypothetically: Would it be >>possible to change the source code of PostgreSQL so a raw device >>system could be implemented, or would that cause a chain reaction so >>that basically one would have to rewrite almost the entire code, >>because too many elements of PostgreSQL rely on the file system? > >It would require quite a bit of work since 1) PostgreSQL stores its >data in multiple files and 2) PostgreSQL currently supports only >synchronous buffered IO. > Not sure how that's related to raw devices, which is what Benjamin was asking about. AFAICS most of the changes would be in smgr.c and md.c, but I might be wrong. I'd imagine supporting raw devices would require implementing some sort of custom file system on the device, and I'd expect it to work with relation segments just fine. So why would that be a problem? The synchronous buffered I/O is a bigger challenge, I guess, but then again - you could continue using synchronous I/O even with raw devices. >To get the performance benefits from using raw devices I think you >would want to add support for asynchronous IO to PostgreSQL rather >than implementing your own layer to emulate the kernel's buffered IO. > >Andres Freund did a talk on aync IO in PostgreSQL earlier this year. >It was not recorded but the slides are available. > >https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/ > Yeah, I think the question is what are the expected benefits of using raw devices. It might be an interesting exercise / experiment, but my understanding is that most of the benefits can be achieved by using file systems but with direct I/O and async I/O, which would allow us to continue reusing the existing filesystem code with much less disruption to our code base. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson <andreas@proxel.se> wrote:
It would require quite a bit of work since 1) PostgreSQL stores its data
in multiple files and 2) PostgreSQL currently supports only synchronous
buffered IO.
To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.
Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.
https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I implemented a benchmarkable proof-of-concept patch for direct I/O and asynchronous I/O (for libaio and POSIX). We made that patch public, so it should be on the list somewhere. But, we began to run into performance issues related to buffer manager scaling in terms of locking and, specifically, replacement. We began prototyping alternate buffer managers (going back to the old MRU/LRU model with midpoint insertion and testing a 2Q variant) but that wasn't public. I had also prototyped raw device support, which is a good amount of work and required implementing a custom filesystem (similar to Oracle's ASM) within the storage manager. It's probably a bit harder now than it was then, given the number of different types of file access.
--
Jonah H. Harris
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Yeah, I think the question is what are the expected benefits of using > raw devices. It might be an interesting exercise / experiment, but my > understanding is that most of the benefits can be achieved by using file > systems but with direct I/O and async I/O, which would allow us to > continue reusing the existing filesystem code with much less disruption > to our code base. There's another very large problem with using raw devices: on pretty much every platform, you don't get to do that without running as root. It is not easy to express how hard a sell it would be to even consider allowing Postgres to run as root. Between the security issues, and the generally poor return-on-investment we'd get from reinventing our own filesystem and I/O scheduler, I just don't see this sort of thing ever going forward. Direct and/or async I/O seems a lot more plausible. regards, tom lane
On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Yeah, I think the question is what are the expected benefits of using > raw devices. It might be an interesting exercise / experiment, but my > understanding is that most of the benefits can be achieved by using file > systems but with direct I/O and async I/O, which would allow us to > continue reusing the existing filesystem code with much less disruption > to our code base. Agreed. I've often wondered if the RDBMSs that supported raw devices did so *because* there was no other way to get unbuffered I/O on some systems at the time (for example it looks like Solaris didn't have direct I/O until 2.6 in 1997?). Last I heard, raw devices weren't recommended anymore on the system I'm thinking of because they're more painful to manage than regular filesystems and there's little to no gain. Back in ancient times, before BSD4.2 introduced it in 1983 there was apparently no fsync() system call on any strain of Unix, so I guess database reliability must have been an uphill battle on early Unix buffered I/O (I wonder if the Ingres/Postgres people asked them to add that?!). It must have been very appealing to sidestep the whole thing for multiple reasons. One key thing to note is that the well known RDBMSs that can use raw devices also deal with regular filesystems by creating one or more large data files, and then manage the space inside those to hold all their tables and indexes. That is, they already have their own system to manage separate database objects and allocate space etc, and don't have to do any regular filesystem meta-data manipulation during transactions (which has all kinds of problems). That means they already have the complicated code that you need to do that, but we don't: we have one (or more) file per table or index, so our database relies on the filesystem as kind of lower level database of relfilenode->blocks. That's probably the main work required to make this work, and might be a valuable thing to have independently of whether you stick it on a raw device, a big data file, NV RAM or some other kind of storage system -- but it's a really difficult project.
On Wed, Apr 29, 2020 at 8:34 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson <andreas@proxel.se> wrote:To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.
Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.
https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I implemented a benchmarkable proof-of-concept patch for direct I/O and asynchronous I/O (for libaio and POSIX). We made that patch public, so it should be on the list somewhere. But, we began to run into performance issues related to buffer manager scaling in terms of locking and, specifically, replacement. We began prototyping alternate buffer managers (going back to the old MRU/LRU model with midpoint insertion and testing a 2Q variant) but that wasn't public. I had also prototyped raw device support, which is a good amount of work and required implementing a custom filesystem (similar to Oracle's ASM) within the storage manager. It's probably a bit harder now than it was then, given the number of different types of file access.
Here's a hack job merge of that preliminary PoC AIO/DIO patch against 13devel. This was designed to keep the buffer manager clean using AIO and is write-only. I'll have to dig through some of my other old Postgres 8.x patches to find the AIO-based prefetching version with aio_req_t modified to handle read vs. write in FileAIO. Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't tracked it down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play with the O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that much right now :)
The code is nasty, but maybe it will give someone ideas. If I get some time to work on it, I'll rewrite it properly.
Jonah H. Harris
Attachment
On 30/4/20 6:22, Thomas Munro wrote: > On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Yeah, I think the question is what are the expected benefits of using >> raw devices. It might be an interesting exercise / experiment, but my >> understanding is that most of the benefits can be achieved by using file >> systems but with direct I/O and async I/O, which would allow us to >> continue reusing the existing filesystem code with much less disruption >> to our code base. > Agreed. > > [snip] That's probably the main work > required to make this work, and might be a valuable thing to have > independently of whether you stick it on a raw device, a big data > file, NV RAM ^^^^^^ THIS, with NV DIMMs / PMEM (persistent memory) possibly becoming a hot topic in the not-too-distant future > or some other kind of storage system -- but it's a really > difficult project. Indeed.... But you might have already pointed out the *only* required feature for this to work: a "database" of relfilenode ---which is actually an int, or rather, a tuple (relfilenode,segment) where both components are 32-bit currently: that is, a 64bit "objectID" of sorts--- to "set of extents" ---yes, extents, not blocks: sequential I/O is still faster in all known storage/persistent (vs RAM) systems---- where the current I/O primitives would be able to write. Some conversion from "absolute" (within the "file") to "relative" (within the "tablespace") offsets would need to happen before delegating to the kernel... or even dereferencing a pointer to an mmap'd region !, but not much more, ISTM (but I'm far from an expert in this area). Out of the top of my head: CREATE TABLESPACE tblspcname [other_options] LOCATION '/dev/nvme1n2' WITH (kind=raw, extent_min=4MB); or something similar to that approac might do it. Please note that I have purposefully specified "namespace 2" in an "enterprise" NVME device, to show the possibility. OR use some filesystem (e.g. XFS) with DAX[1] (mount -o dax ) where available along something equivalent to WITH(kind=mmaped) ... though the locking we currently get "for free" from the kernel would need to be replaced by something else. Indeed it seems like an enormous amount of work.... but it may well pay off. I can't fully assess the effort, though Just my .02€ [1] https://www.kernel.org/doc/Documentation/filesystems/dax.txt Thanks, / J.L.
On Fri, May 1, 2020 at 12:28 PM Jonah H. Harris <jonah.harris@gmail.com> wrote: > Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't trackedit down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play withthe O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that muchright now :) Andres is prototyping with io_uring, which supersedes libaio and can do much more stuff, notably buffered and unbuffered I/O; there's no point in looking at libaio. I agree that we should definitely support POSIX AIO, because that gets you macOS, FreeBSD, NetBSD, AIX, HPUX with one effort (those are the systems that use either kernel threads or true async I/O down to the driver; Solaris and Linux also provide POSIX AIO, but it's emulated with user space threads, which probably wouldn't work well for our multi process design). The third API that we'd want to support is Windows overlapped I/O with completion ports. With those three APIs you can hit all systems in our build farm except Solaris and OpenBSD, so they'd still use synchronous I/O (though we could do our own emulation with worker processes pretty easily).
On Fri, May 1, 2020 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 1, 2020 at 12:28 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't tracked it down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play with the O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that much right now :)
Andres is prototyping with io_uring, which supersedes libaio and can
do much more stuff, notably buffered and unbuffered I/O; there's no
point in looking at libaio. I agree that we should definitely support
POSIX AIO, because that gets you macOS, FreeBSD, NetBSD, AIX, HPUX
with one effort (those are the systems that use either kernel threads
or true async I/O down to the driver; Solaris and Linux also provide
POSIX AIO, but it's emulated with user space threads, which probably
wouldn't work well for our multi process design). The third API that
we'd want to support is Windows overlapped I/O with completion ports.
With those three APIs you can hit all systems in our build farm except
Solaris and OpenBSD, so they'd still use synchronous I/O (though we
could do our own emulation with worker processes pretty easily).
Is it public? I saw the presentations, but couldn't find that patch anywhere.
Jonah H. Harris