Re: Temporary file access API - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Temporary file access API |
Date | |
Msg-id | CA+TgmoZWP8UtkNVLd75Qqoh9VGOVy_0xBK+SHZAdNvnfaikKsQ@mail.gmail.com Whole thread Raw |
In response to | Re: Temporary file access API (Antonin Houska <ah@cybertec.at>) |
Responses |
Re: Temporary file access API
|
List | pgsql-hackers |
On Fri, Jul 29, 2022 at 11:36 AM Antonin Houska <ah@cybertec.at> wrote: > Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is > not necessarily thrown on I/O failure, if the user prefers to check the number > of bytes read/written) and to specify the buffer size. Also, 0015 adds a > function to copy data from one file descriptor to another. I basically agree with Peter Eisentraut's earlier comment: "I don't understand what this patch set is supposed to do." The intended advantages, as documented in buffile.c, are: + * 1. Less code is needed to access the file. + * + * 2.. Buffering reduces the number of I/O system calls. + * + * 3. The buffer size can be controlled by the user. (The larger the buffer, + * the fewer I/O system calls are needed, but the more data needs to be + * written to the buffer before the user recognizes that the file access + * failed.) + * + * 4. It should make features like Transparent Data Encryption less invasive. It does look to me like there is a modest reduction in the total amount of code from these patches, so I do think they might be achieving (1). I'm not so sure about (2), though. A number of these places are cases where we only do a single write to the file - like in patches 0010, 0011, 0012, and 0014, for example. And even in 0013, where we do multiple I/Os, it makes no sense to introduce a second buffering layer. We're reading from a file into a buffer several kilobytes in size and then shipping the data out. The places where you want buffering are places where we might be doing system calls for a few bytes of I/O at a time, not places where we're already doing relatively large I/Os. An extra layer of buffering probably makes things slower, not faster. I don't think that (3) is a separate advantage from (1) and (2), so I don't have anything separate to say about it. I find it really unclear whether, or how, it achieves (4). In general, I don't think that the large number of raw calls to read() and write() spread across the backend is a particularly good thing. TDE is an example of a feature that might affect every byte we read or write, but you can also imagine instrumentation and tracing facilities that might want to be able to do something every time we do an I/O without having to modify a zillion separate call sites. Such features can certainly benefit from consolidation, but I suspect it isn't helpful to treat every I/O in exactly the same way. For instance, let's say we settle on a design where everything in the cluster is encrypted but, just to make things simple for the sake of discussion, let's say there are no stored nonces anywhere. Can we just route every I/O in the backend through a wrapper layer that does encryption and decryption? Probably not. I imagine postgresql.conf and pg_hba.conf aren't going to be encrypted, for example, because they're intended to be edited by users. And Bruce has previously proposed designs where the WAL would be encrypted with a different key than the data files. Another idea, which I kind of like, is to encrypt WAL on a record level rather than a block level. Either way, you can't just encrypt every single byte the same way. There are probably other relevant distinctions: some data is temporary and need not survive a server restart or be shared with other backends (except maybe in the case of parallel query) and some data is permanent. Some data is already block-structured using blocks of size BLCKSZ; some data is block-structured using some other block size; and some data is not block-structured. Some data is in standard-format pages that are part of the buffer pool and some data isn't. I feel like some of those distinctions probably matter to TDE, and maybe to other things that we might want to do. For instance, to go back to my earlier comments, if the data is already block-structured, we do not need to insert a second layer of buffering; if it isn't, maybe we should, not just for TDE but for other reasons as well. If the data is temporary, TDE might want to encrypt it using a temporary key which is generated on the fly and only stored in server memory, but if it's data that needs to be readable after a server restart, we're going to need to use a key that we'll still have access to in the future. Or maybe that particular thing isn't relevant, but I just can't believe that every single I/O needs exactly the same treatment, so there have got to be some patterns here that do matter. And I can't really tell whether this patch set has a theory about that and, if so, what it is. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: