Thread: Raw device on PostgreSQL

Raw device on PostgreSQL

From
Benjamin Schaller
Date:
Hey,

for an university project I'm currently doing some research on 
PostgreSQL. I was wondering if hypothetically it would be possible to 
implement a raw device system to PostgreSQL. I know that the 
disadvantages would probably be higher than the advantages compared to 
working with the file system. Just hypothetically: Would it be possible 
to change the source code of PostgreSQL so a raw device system could be 
implemented, or would that cause a chain reaction so that basically one 
would have to rewrite almost the entire code, because too many elements 
of PostgreSQL rely on the file system?

Best regards




Re: Raw device on PostgreSQL

From
Stephen Frost
Date:
Greetings,

* Benjamin Schaller (benjamin.schaller@s2018.tu-chemnitz.de) wrote:
> for an university project I'm currently doing some research on PostgreSQL. I
> was wondering if hypothetically it would be possible to implement a raw
> device system to PostgreSQL. I know that the disadvantages would probably be
> higher than the advantages compared to working with the file system. Just
> hypothetically: Would it be possible to change the source code of PostgreSQL
> so a raw device system could be implemented, or would that cause a chain
> reaction so that basically one would have to rewrite almost the entire code,
> because too many elements of PostgreSQL rely on the file system?

yes, it'd be possible, no, you wouldn't have to rewrite all of PG.
Instead, if you want it to be performant at all, you'd have to write
lots of new code to do all the things the filesystem and kernel do for
us today.

Thanks,

Stephen

Attachment

Re: Raw device on PostgreSQL

From
Andreas Karlsson
Date:
On 4/28/20 10:43 AM, Benjamin Schaller wrote:
> for an university project I'm currently doing some research on 
> PostgreSQL. I was wondering if hypothetically it would be possible to 
> implement a raw device system to PostgreSQL. I know that the 
> disadvantages would probably be higher than the advantages compared to 
> working with the file system. Just hypothetically: Would it be possible 
> to change the source code of PostgreSQL so a raw device system could be 
> implemented, or would that cause a chain reaction so that basically one 
> would have to rewrite almost the entire code, because too many elements 
> of PostgreSQL rely on the file system?

It would require quite a bit of work since 1) PostgreSQL stores its data 
in multiple files and 2) PostgreSQL currently supports only synchronous 
buffered IO.

To get the performance benefits from using raw devices I think you would 
want to add support for asynchronous IO to PostgreSQL rather than 
implementing your own layer to emulate the kernel's buffered IO.

Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It 
was not recorded but the slides are available.

https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

Andreas



Re: Raw device on PostgreSQL

From
Tomas Vondra
Date:
On Tue, Apr 28, 2020 at 02:10:51PM +0200, Andreas Karlsson wrote:
>On 4/28/20 10:43 AM, Benjamin Schaller wrote:
>>for an university project I'm currently doing some research on 
>>PostgreSQL. I was wondering if hypothetically it would be possible 
>>to implement a raw device system to PostgreSQL. I know that the 
>>disadvantages would probably be higher than the advantages compared 
>>to working with the file system. Just hypothetically: Would it be 
>>possible to change the source code of PostgreSQL so a raw device 
>>system could be implemented, or would that cause a chain reaction so 
>>that basically one would have to rewrite almost the entire code, 
>>because too many elements of PostgreSQL rely on the file system?
>
>It would require quite a bit of work since 1) PostgreSQL stores its 
>data in multiple files and 2) PostgreSQL currently supports only 
>synchronous buffered IO.
>

Not sure how that's related to raw devices, which is what Benjamin was
asking about. AFAICS most of the changes would be in smgr.c and md.c,
but I might be wrong.

I'd imagine supporting raw devices would require implementing some sort
of custom file system on the device, and I'd expect it to work with
relation segments just fine. So why would that be a problem?

The synchronous buffered I/O is a bigger challenge, I guess, but then
again - you could continue using synchronous I/O even with raw devices.

>To get the performance benefits from using raw devices I think you 
>would want to add support for asynchronous IO to PostgreSQL rather 
>than implementing your own layer to emulate the kernel's buffered IO.
>
>Andres Freund did a talk on aync IO in PostgreSQL earlier this year. 
>It was not recorded but the slides are available.
>
>https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
>

Yeah, I think the question is what are the expected benefits of using
raw devices. It might be an interesting exercise / experiment, but my
understanding is that most of the benefits can be achieved by using file
systems but with direct I/O and async I/O, which would allow us to
continue reusing the existing filesystem code with much less disruption
to our code base.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Raw device on PostgreSQL

From
"Jonah H. Harris"
Date:
On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson <andreas@proxel.se> wrote:
It would require quite a bit of work since 1) PostgreSQL stores its data
in multiple files and 2) PostgreSQL currently supports only synchronous
buffered IO.

To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.

Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.

https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I implemented a benchmarkable proof-of-concept patch for direct I/O and asynchronous I/O (for libaio and POSIX). We made that patch public, so it should be on the list somewhere. But, we began to run into performance issues related to buffer manager scaling in terms of locking and, specifically, replacement. We began prototyping alternate buffer managers (going back to the old MRU/LRU model with midpoint insertion and testing a 2Q variant) but that wasn't public. I had also prototyped raw device support, which is a good amount of work and required implementing a custom filesystem (similar to Oracle's ASM) within the storage manager. It's probably a bit harder now than it was then, given the number of different types of file access.

-- 
Jonah H. Harris

Re: Raw device on PostgreSQL

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Yeah, I think the question is what are the expected benefits of using
> raw devices. It might be an interesting exercise / experiment, but my
> understanding is that most of the benefits can be achieved by using file
> systems but with direct I/O and async I/O, which would allow us to
> continue reusing the existing filesystem code with much less disruption
> to our code base.

There's another very large problem with using raw devices: on pretty
much every platform, you don't get to do that without running as root.
It is not easy to express how hard a sell it would be to even consider
allowing Postgres to run as root.  Between the security issues, and
the generally poor return-on-investment we'd get from reinventing
our own filesystem and I/O scheduler, I just don't see this sort of
thing ever going forward.  Direct and/or async I/O seems a lot more
plausible.

            regards, tom lane



Re: Raw device on PostgreSQL

From
Thomas Munro
Date:
On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Yeah, I think the question is what are the expected benefits of using
> raw devices. It might be an interesting exercise / experiment, but my
> understanding is that most of the benefits can be achieved by using file
> systems but with direct I/O and async I/O, which would allow us to
> continue reusing the existing filesystem code with much less disruption
> to our code base.

Agreed.

I've often wondered if the RDBMSs that supported raw devices did so
*because* there was no other way to get unbuffered I/O on some systems
at the time (for example it looks like Solaris didn't have direct I/O
until 2.6 in 1997?).  Last I heard, raw devices weren't recommended
anymore on the system I'm thinking of because they're more painful to
manage than regular filesystems and there's little to no gain.  Back
in ancient times, before BSD4.2 introduced it in 1983 there was
apparently no fsync() system call on any strain of Unix, so I guess
database reliability must have been an uphill battle on early Unix
buffered I/O (I wonder if the Ingres/Postgres people asked them to add
that?!).  It must have been very appealing to sidestep the whole thing
for multiple reasons.  One key thing to note is that the well known
RDBMSs that can use raw devices also deal with regular filesystems by
creating one or more large data files, and then manage the space
inside those to hold all their tables and indexes.  That is, they
already have their own system to manage separate database objects and
allocate space etc, and don't have to do any regular filesystem
meta-data manipulation during transactions (which has all kinds of
problems).  That means they already have the complicated code that you
need to do that, but we don't: we have one (or more) file per table or
index, so our database relies on the filesystem as kind of lower level
database of relfilenode->blocks.  That's probably the main work
required to make this work, and might be a valuable thing to have
independently of whether you stick it on a raw device, a big data
file, NV RAM or some other kind of storage system -- but it's a really
difficult project.



Re: Raw device on PostgreSQL

From
"Jonah H. Harris"
Date:
On Wed, Apr 29, 2020 at 8:34 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson <andreas@proxel.se> wrote:
To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.

Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.

https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I implemented a benchmarkable proof-of-concept patch for direct I/O and asynchronous I/O (for libaio and POSIX). We made that patch public, so it should be on the list somewhere. But, we began to run into performance issues related to buffer manager scaling in terms of locking and, specifically, replacement. We began prototyping alternate buffer managers (going back to the old MRU/LRU model with midpoint insertion and testing a 2Q variant) but that wasn't public. I had also prototyped raw device support, which is a good amount of work and required implementing a custom filesystem (similar to Oracle's ASM) within the storage manager. It's probably a bit harder now than it was then, given the number of different types of file access.

Here's a hack job merge of that preliminary PoC AIO/DIO patch against 13devel. This was designed to keep the buffer manager clean using AIO and is write-only. I'll have to dig through some of my other old Postgres 8.x patches to find the AIO-based prefetching version with aio_req_t modified to handle read vs. write in FileAIO. Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't tracked it down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play with the O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that much right now :)

The code is nasty, but maybe it will give someone ideas. If I get some time to work on it, I'll rewrite it properly.

--
Jonah H. Harris

Attachment

Re: Raw device on PostgreSQL

From
Jose Luis Tallon
Date:
On 30/4/20 6:22, Thomas Munro wrote:
> On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Yeah, I think the question is what are the expected benefits of using
>> raw devices. It might be an interesting exercise / experiment, but my
>> understanding is that most of the benefits can be achieved by using file
>> systems but with direct I/O and async I/O, which would allow us to
>> continue reusing the existing filesystem code with much less disruption
>> to our code base.
> Agreed.
>
> [snip] That's probably the main work
> required to make this work, and might be a valuable thing to have
> independently of whether you stick it on a raw device, a big data
> file, NV RAM
    ^^^^^^  THIS, with NV DIMMs / PMEM (persistent memory) possibly 
becoming a hot topic in the not-too-distant future
> or some other kind of storage system -- but it's a really
> difficult project.

Indeed.... But you might have already pointed out the *only* required 
feature for this to work: a "database" of relfilenode ---which is 
actually an int, or rather, a tuple (relfilenode,segment) where both 
components are 32-bit currently: that is, a 64bit "objectID" of sorts--- 
to "set of extents" ---yes, extents, not blocks: sequential I/O is still 
faster in all known storage/persistent (vs RAM) systems---- where the 
current I/O primitives would be able to write.

Some conversion from "absolute" (within the "file") to "relative" 
(within the "tablespace") offsets would need to happen before delegating 
to the kernel... or even dereferencing a pointer to an mmap'd region !, 
but not much more, ISTM (but I'm far from an expert in this area).

Out of the top of my head:

CREATE TABLESPACE tblspcname [other_options] LOCATION '/dev/nvme1n2' 
WITH (kind=raw, extent_min=4MB);

   or something similar to that approac might do it.

     Please note that I have purposefully specified "namespace 2" in an 
"enterprise" NVME device, to show the possibility.

OR

   use some filesystem (e.g. XFS) with DAX[1] (mount -o dax ) where 
available along something equivalent to  WITH(kind=mmaped)


... though the locking we currently get "for free" from the kernel would 
need to be replaced by something else.


Indeed it seems like an enormous amount of work.... but it may well pay 
off. I can't fully assess the effort, though


Just my .02€

[1] https://www.kernel.org/doc/Documentation/filesystems/dax.txt


Thanks,

     / J.L.





Re: Raw device on PostgreSQL

From
Thomas Munro
Date:
On Fri, May 1, 2020 at 12:28 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't
trackedit down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play
withthe O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that
muchright now :) 

Andres is prototyping with io_uring, which supersedes libaio and can
do much more stuff, notably buffered and unbuffered I/O; there's no
point in looking at libaio.  I agree that we should definitely support
POSIX AIO, because that gets you macOS, FreeBSD, NetBSD, AIX, HPUX
with one effort (those are the systems that use either kernel threads
or true async I/O down to the driver; Solaris and Linux also provide
POSIX AIO, but it's emulated with user space threads, which probably
wouldn't work well for our multi process design).  The third API that
we'd want to support is Windows overlapped I/O with completion ports.
With those three APIs you can hit all systems in our build farm except
Solaris and OpenBSD, so they'd still use synchronous I/O (though we
could do our own emulation with worker processes pretty easily).



Re: Raw device on PostgreSQL

From
"Jonah H. Harris"
Date:
On Fri, May 1, 2020 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 1, 2020 at 12:28 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't tracked it down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play with the O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that much right now :)

Andres is prototyping with io_uring, which supersedes libaio and can
do much more stuff, notably buffered and unbuffered I/O; there's no
point in looking at libaio.  I agree that we should definitely support
POSIX AIO, because that gets you macOS, FreeBSD, NetBSD, AIX, HPUX
with one effort (those are the systems that use either kernel threads
or true async I/O down to the driver; Solaris and Linux also provide
POSIX AIO, but it's emulated with user space threads, which probably
wouldn't work well for our multi process design).  The third API that
we'd want to support is Windows overlapped I/O with completion ports.
With those three APIs you can hit all systems in our build farm except
Solaris and OpenBSD, so they'd still use synchronous I/O (though we
could do our own emulation with worker processes pretty easily).

Is it public? I saw the presentations, but couldn't find that patch anywhere. 

--
Jonah H. Harris