Re: Temporary file access API - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Temporary file access API
Date
Msg-id CA+TgmoZWP8UtkNVLd75Qqoh9VGOVy_0xBK+SHZAdNvnfaikKsQ@mail.gmail.com
Whole thread Raw
In response to Re: Temporary file access API  (Antonin Houska <ah@cybertec.at>)
Responses Re: Temporary file access API
List pgsql-hackers
On Fri, Jul 29, 2022 at 11:36 AM Antonin Houska <ah@cybertec.at> wrote:
> Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is
> not necessarily thrown on I/O failure, if the user prefers to check the number
> of bytes read/written) and to specify the buffer size. Also, 0015 adds a
> function to copy data from one file descriptor to another.

I basically agree with Peter Eisentraut's earlier comment: "I don't
understand what this patch set is supposed to do." The intended
advantages, as documented in buffile.c, are:

+ * 1. Less code is needed to access the file.
+ *
+ * 2.. Buffering reduces the number of I/O system calls.
+ *
+ * 3. The buffer size can be controlled by the user. (The larger the buffer,
+ * the fewer I/O system calls are needed, but the more data needs to be
+ * written to the buffer before the user recognizes that the file access
+ * failed.)
+ *
+ * 4. It should make features like Transparent Data Encryption less invasive.

It does look to me like there is a modest reduction in the total
amount of code from these patches, so I do think they might be
achieving (1).

I'm not so sure about (2), though. A number of these places are cases
where we only do a single write to the file - like in patches 0010,
0011, 0012, and 0014, for example. And even in 0013, where we do
multiple I/Os, it makes no sense to introduce a second buffering
layer. We're reading from a file into a buffer several kilobytes in
size and then shipping the data out. The places where you want
buffering are places where we might be doing system calls for a few
bytes of I/O at a time, not places where we're already doing
relatively large I/Os. An extra layer of buffering probably makes
things slower, not faster.

I don't think that (3) is a separate advantage from (1) and (2), so I
don't have anything separate to say about it.

I find it really unclear whether, or how, it achieves (4). In general,
I don't think that the large number of raw calls to read() and write()
spread across the backend is a particularly good thing. TDE is an
example of a feature that might affect every byte we read or write,
but you can also imagine instrumentation and tracing facilities that
might want to be able to do something every time we do an I/O without
having to modify a zillion separate call sites. Such features can
certainly benefit from consolidation, but I suspect it isn't helpful
to treat every I/O in exactly the same way. For instance, let's say we
settle on a design where everything in the cluster is encrypted but,
just to make things simple for the sake of discussion, let's say there
are no stored nonces anywhere. Can we just route every I/O in the
backend through a wrapper layer that does encryption and decryption?

Probably not. I imagine postgresql.conf and pg_hba.conf aren't going
to be encrypted, for example, because they're intended to be edited by
users. And Bruce has previously proposed designs where the WAL would
be encrypted with a different key than the data files. Another idea,
which I kind of like, is to encrypt WAL on a record level rather than
a block level. Either way, you can't just encrypt every single byte
the same way. There are probably other relevant distinctions: some
data is temporary and need not survive a server restart or be shared
with other backends (except maybe in the case of parallel query) and
some data is permanent. Some data is already block-structured using
blocks of size BLCKSZ; some data is block-structured using some other
block size; and some data is not block-structured. Some data is in
standard-format pages that are part of the buffer pool and some data
isn't. I feel like some of those distinctions probably matter to TDE,
and maybe to other things that we might want to do.

For instance, to go back to my earlier comments, if the data is
already block-structured, we do not need to insert a second layer of
buffering; if it isn't, maybe we should, not just for TDE but for
other reasons as well. If the data is temporary, TDE might want to
encrypt it using a temporary key which is generated on the fly and
only stored in server memory, but if it's data that needs to be
readable after a server restart, we're going to need to use a key that
we'll still have access to in the future. Or maybe that particular
thing isn't relevant, but I just can't believe that every single I/O
needs exactly the same treatment, so there have got to be some
patterns here that do matter. And I can't really tell whether this
patch set has a theory about that and, if so, what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: generic plans and "initial" pruning
Next
From: Robert Haas
Date:
Subject: Re: generic plans and "initial" pruning