Re: Raw device on PostgreSQL - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Raw device on PostgreSQL
Date
Msg-id CA+hUKG+dwjX-o+68sppbX1-3zS6t0b4nmhXQxcv3=qAiViK4yw@mail.gmail.com
Whole thread Raw
In response to Re: Raw device on PostgreSQL  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Raw device on PostgreSQL  (Jose Luis Tallon <jltallon@adv-solutions.net>)
List pgsql-hackers
On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Yeah, I think the question is what are the expected benefits of using
> raw devices. It might be an interesting exercise / experiment, but my
> understanding is that most of the benefits can be achieved by using file
> systems but with direct I/O and async I/O, which would allow us to
> continue reusing the existing filesystem code with much less disruption
> to our code base.

Agreed.

I've often wondered if the RDBMSs that supported raw devices did so
*because* there was no other way to get unbuffered I/O on some systems
at the time (for example it looks like Solaris didn't have direct I/O
until 2.6 in 1997?).  Last I heard, raw devices weren't recommended
anymore on the system I'm thinking of because they're more painful to
manage than regular filesystems and there's little to no gain.  Back
in ancient times, before BSD4.2 introduced it in 1983 there was
apparently no fsync() system call on any strain of Unix, so I guess
database reliability must have been an uphill battle on early Unix
buffered I/O (I wonder if the Ingres/Postgres people asked them to add
that?!).  It must have been very appealing to sidestep the whole thing
for multiple reasons.  One key thing to note is that the well known
RDBMSs that can use raw devices also deal with regular filesystems by
creating one or more large data files, and then manage the space
inside those to hold all their tables and indexes.  That is, they
already have their own system to manage separate database objects and
allocate space etc, and don't have to do any regular filesystem
meta-data manipulation during transactions (which has all kinds of
problems).  That means they already have the complicated code that you
need to do that, but we don't: we have one (or more) file per table or
index, so our database relies on the filesystem as kind of lower level
database of relfilenode->blocks.  That's probably the main work
required to make this work, and might be a valuable thing to have
independently of whether you stick it on a raw device, a big data
file, NV RAM or some other kind of storage system -- but it's a really
difficult project.



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: shared-memory based stats collector
Next
From: Michael Paquier
Date:
Subject: Re: Remove unnecessary relabel stripping