Re: Raw device on PostgreSQL - Mailing list pgsql-hackers

From Jose Luis Tallon
Subject Re: Raw device on PostgreSQL
Date
Msg-id 435d05a4-acd6-856c-3050-4dae70b85d00@adv-solutions.net
Whole thread Raw
In response to Re: Raw device on PostgreSQL  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On 30/4/20 6:22, Thomas Munro wrote:
> On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Yeah, I think the question is what are the expected benefits of using
>> raw devices. It might be an interesting exercise / experiment, but my
>> understanding is that most of the benefits can be achieved by using file
>> systems but with direct I/O and async I/O, which would allow us to
>> continue reusing the existing filesystem code with much less disruption
>> to our code base.
> Agreed.
>
> [snip] That's probably the main work
> required to make this work, and might be a valuable thing to have
> independently of whether you stick it on a raw device, a big data
> file, NV RAM
    ^^^^^^  THIS, with NV DIMMs / PMEM (persistent memory) possibly 
becoming a hot topic in the not-too-distant future
> or some other kind of storage system -- but it's a really
> difficult project.

Indeed.... But you might have already pointed out the *only* required 
feature for this to work: a "database" of relfilenode ---which is 
actually an int, or rather, a tuple (relfilenode,segment) where both 
components are 32-bit currently: that is, a 64bit "objectID" of sorts--- 
to "set of extents" ---yes, extents, not blocks: sequential I/O is still 
faster in all known storage/persistent (vs RAM) systems---- where the 
current I/O primitives would be able to write.

Some conversion from "absolute" (within the "file") to "relative" 
(within the "tablespace") offsets would need to happen before delegating 
to the kernel... or even dereferencing a pointer to an mmap'd region !, 
but not much more, ISTM (but I'm far from an expert in this area).

Out of the top of my head:

CREATE TABLESPACE tblspcname [other_options] LOCATION '/dev/nvme1n2' 
WITH (kind=raw, extent_min=4MB);

   or something similar to that approac might do it.

     Please note that I have purposefully specified "namespace 2" in an 
"enterprise" NVME device, to show the possibility.

OR

   use some filesystem (e.g. XFS) with DAX[1] (mount -o dax ) where 
available along something equivalent to  WITH(kind=mmaped)


... though the locking we currently get "for free" from the kernel would 
need to be replaced by something else.


Indeed it seems like an enormous amount of work.... but it may well pay 
off. I can't fully assess the effort, though


Just my .02€

[1] https://www.kernel.org/doc/Documentation/filesystems/dax.txt


Thanks,

     / J.L.





pgsql-hackers by date:

Previous
From: Atsushi Torikoshi
Date:
Subject: pg_stat_reset_slru(name) doesn't seem to work as documented
Next
From: Victor Wagner
Date:
Subject: Postgresql Windows build and modern perl (>=5.28)