Thread: XTS cipher mode for cluster file encryption

XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

13 October 2021, 22:26:48

As you might have seen from my email in another thread, thanks to
Stephen and Cybertec staff, I am back working on cluster file
encryption/TDE.

Stephen was going to research if XTS cipher mode would be a good fit for
this since it was determined that CTR cipher mode was too vulnerable to
IV reuse, and the LSN provides insufficient uniqueness. Stephen
reported having trouble finding a definitive answer, so I figured I
would research it myself.

Of course, I found the same lack of information that Stephen did. ;-)
None of my classic cryptographic books cover XTS or the XEX encryption
mode it is based on, since XTS was only standardized in 2007 and
recommended in 2010. (Yeah, don't get me started on poor cryptographic
documentation.)

Therefore, I decide to go backward and look at CTR and CBC to see how
the nonce is used there, and if XTS fixes problems with nonce reuse.

First, I originally chose CTR mode since it was a streaming cipher, and
we therefore could skip certain page fields like the LSN. However, CTR
is very sensitive to LSN reuse since the input bits generate encrypted
bits in exactly the same locations on the page. (It uses a simple XOR
against a cipher). Since sometimes pages with different page contents
are encrypted with the same LSN, especially on replicas, this method
failed.

Second is CBC mode. which is a block cipher. I thought that meant that
you could only encrypt 16-byte chunks, meaning you couldn't skip
encryption of certain page fields unless they are 16-byte chunks.
However, there is something called ciphertext stealing
(https://en.wikipedia.org/wiki/Ciphertext_stealing#CBC_ciphertext_stealing)
which allows that. I am not sure if OpenSSL supports this, but looking
at my OpenSSL 1.1.1d manual entry for EVP_aes, cipher stealing is only
mentioned for XTS.

Anyway, CBC mode still needs a nonce for the first 16-byte block, and
then feeds the encrypted output of the first block as a IV to the second
block, etc. This gives us the same problem with finding a nonce per
page. However, since it is a block cipher, the bits don't output in the
same locations they have on input, so that is less of a problem. There
is also the problem that the encrypted output from one 16-byte block
could repeat, causing leakage.

So, let's look how XTS is designed. First, it uses two keys. If you
are using AES128, you need _two_ 128-bit keys. If using AES256, you
need two 256-bit keys. The first of the two keys is used like normal,
to encrypt the data. The second key, which is also secret, is used to
encrypt the values used for the IV for the first 16-byte block (in our
case dboid, relfilenode, blocknum, maybe LSN). This is most clearly
explained here:

https://www.kingston.com/unitedstates/en/solutions/data-security/xts-encryption

That IV is XOR'ed against both the input value and the encryption output
value, as explained here as key tweaking:

https://crossbowerbt.github.io/xts_mode_tweaking.html

The purpose of using it before and after encryption is explained here:

https://crypto.stackexchange.com/questions/24431/what-is-the-benefit-of-applying-the-tweak-a-second-time-using-xts

The second 16-byte block gets an IV that is the multiplication of the
first IV and an alpha value raised to the second power but mapped to a
finite field (Galois field, modulus a prime). This effectively means an
attacker has _no_ idea what the IV is since it involves a secret key,
and each 16-byte block uses a different, unpredictable IV value. XTS
also supports ciphertext stealing by default so we can use the LSN if we
want, but we aren't sure we need to.

Finally, there is an interesting web page about when not to use XTS:

https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

Basically, what XTS does is to make the IV unknown to attackers and
non-repeating except for multiple writes to a specific 16-byte block
(with no LSN change). What isn't clear is if repeated encryption of
different data in the same 16-byte block can leak data.

This probably needs more research and maybe we need to write something
up like the above and let security researchers review it since there
doesn't seem to be enough documentation for us to decide ourselves.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

15 October 2021, 19:22:48

Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> As you might have seen from my email in another thread, thanks to
> Stephen and Cybertec staff, I am back working on cluster file
> encryption/TDE.
>
> Stephen was going to research if XTS cipher mode would be a good fit for
> this since it was determined that CTR cipher mode was too vulnerable to
> IV reuse, and the LSN provides insufficient uniqueness.  Stephen
> reported having trouble finding a definitive answer, so I figured I
> would research it myself.
>
> Of course, I found the same lack of information that Stephen did.  ;-)
> None of my classic cryptographic books cover XTS or the XEX encryption
> mode it is based on, since XTS was only standardized in 2007 and
> recommended in 2010.  (Yeah, don't get me started on poor cryptographic
> documentation.)
>
> Therefore, I decide to go backward and look at CTR and CBC to see how
> the nonce is used there, and if XTS fixes problems with nonce reuse.
>
> First, I originally chose CTR mode since it was a streaming cipher, and
> we therefore could skip certain page fields like the LSN.  However, CTR
> is very sensitive to LSN reuse since the input bits generate encrypted
> bits in exactly the same locations on the page. (It uses a simple XOR
> against a cipher).  Since sometimes pages with different page contents
> are encrypted with the same LSN, especially on replicas, this method
> failed.
>
> Second is CBC mode. which is a block cipher.  I thought that meant that
> you could only encrypt 16-byte chunks, meaning you couldn't skip
> encryption of certain page fields unless they are 16-byte chunks.
> However, there is something called ciphertext stealing
> (https://en.wikipedia.org/wiki/Ciphertext_stealing#CBC_ciphertext_stealing)
> which allows that.  I am not sure if OpenSSL supports this, but looking
> at my OpenSSL  1.1.1d manual entry for EVP_aes, cipher stealing is only
> mentioned for XTS.
>
> Anyway, CBC mode still needs a nonce for the first 16-byte block, and
> then feeds the encrypted output of the first block as a IV to the second
> block, etc. This gives us the same problem with finding a nonce per
> page.  However, since it is a block cipher, the bits don't output in the
> same locations they have on input, so that is less of a problem.  There
> is also the problem that the encrypted output from one 16-byte block
> could repeat, causing leakage.
>
> So, let's look how XTS is designed.  First, it uses two keys.  If you
> are using AES128, you need _two_ 128-bit keys.  If using AES256, you
> need two 256-bit keys.  The first of the two keys is used like normal,
> to encrypt the data.  The second key, which is also secret, is used to
> encrypt the values used for the IV for the first 16-byte block (in our
> case dboid, relfilenode, blocknum, maybe LSN).  This is most clearly
> explained here:
>
>     https://www.kingston.com/unitedstates/en/solutions/data-security/xts-encryption
>
> That IV is XOR'ed against both the input value and the encryption output
> value, as explained here as key tweaking:
>
>     https://crossbowerbt.github.io/xts_mode_tweaking.html
>
> The purpose of using it before and after encryption is explained here:
>
>
https://crypto.stackexchange.com/questions/24431/what-is-the-benefit-of-applying-the-tweak-a-second-time-using-xts
>
> The second 16-byte block gets an IV that is the multiplication of the
> first IV and an alpha value raised to the second power but mapped to a
> finite field (Galois field, modulus a prime).  This effectively means an
> attacker has _no_ idea what the IV is since it involves a secret key,
> and each 16-byte block uses a different, unpredictable IV value. XTS
> also supports ciphertext stealing by default so we can use the LSN if we
> want, but we aren't sure we need to.

Yeah, this all seems to be about where I got to too.

> Finally, there is an interesting web page about when not to use XTS:
>
>     https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

This particular article always struck me as more of a reason for us, at
least, to use XTS than to not- in particular the very first comment it
makes, which seems to be pretty well supported, is: "XTS is the de-facto
standard disk encryption mode."  Much of the rest of it is the well
trodden discussion we've had about how FDE (or TDE in our case) doesn't
protect against all the attack vectors that sometimes people think it
does.  Another point is that XTS isn't authenticated- something else we
know quite well around here and isn't news.

> Basically, what XTS does is to make the IV unknown to attackers and
> non-repeating except for multiple writes to a specific 16-byte block
> (with no LSN change).  What isn't clear is if repeated encryption of
> different data in the same 16-byte block can leak data.

Any time a subset of the data is changed but the rest of it isn't,
there's a leak of information.  This is a really good example of exactly
what that looks like:

https://github.com/robertdavidgraham/ecb-penguin

In our case, if/when this happens (no LSN change, repeated encryption
of the same block), someone might be able to deduce that hint bits were
being updated/changed, and where some of those are in the block.

That said, I don't think that's really a huge issue or something that's
a show stopper or a reason to hold off on using XTS.  Note that what
those bits actually *are* isn't leaked, just that they changed in some
fashion inside of that 16-byte cipher text block.  That they're directly
leaked with CTR is why there was concern raised about using that method,
as discussed above and previously.

> This probably needs more research and maybe we need to write something
> up like the above and let security researchers review it since there
> doesn't seem to be enough documentation for us to decide ourselves.

The one issue identified here is hopefully answered above and given that
what you've found matches what I found, I'd argue that moving forward
with XTS makes sense.

The other bit of research that I wanted to do, and thanks for sending
this and prodding me to go do so, was to look at other implementations
and see what they do for the IV when it comes to using XTS, and this is
what I found:

https://wiki.gentoo.org/wiki/Dm-crypt_full_disk_encryption

Specifically: The default cipher for LUKS is nowadays aes-xts-plain64

and then this:

https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt

where plain64 is defined as:

plain64: the initial vector is the 64-bit little-endian version of the
sector number, padded with zeros if necessary

That is, the default for LUKS is AES, XTS, with a simple IV.  That
strikes me as a pretty ringing endorsement.

Now, to address the concern around re-encrypting a block with the same
key+IV but different data and leaking what parts of the page changed, I
do think we should use the LSN and have it change regularly (including
unlogged tables) but that's just because it's relatively easy for us to
do and means an attacker wouldn't be able to tell what part of the page
changed when the LSN was also changed.  That was also recommended by
NIST and that's a pretty strong endorsement also.

I'm all for getting security folks and whomever else to come and review
this thread and chime in with their thoughts, but I don't think it's a
reason to hold off on moving forward with the approach that we've been
converging towards.

Thanks!

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

15 October 2021, 20:57:03


On 10/15/21 21:22, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
>> As you might have seen from my email in another thread, thanks to
>> Stephen and Cybertec staff, I am back working on cluster file
>> encryption/TDE.
>>
>> Stephen was going to research if XTS cipher mode would be a good fit for
>> this since it was determined that CTR cipher mode was too vulnerable to
>> IV reuse, and the LSN provides insufficient uniqueness.  Stephen
>> reported having trouble finding a definitive answer, so I figured I
>> would research it myself.
>>
>> Of course, I found the same lack of information that Stephen did.  ;-)
>> None of my classic cryptographic books cover XTS or the XEX encryption
>> mode it is based on, since XTS was only standardized in 2007 and
>> recommended in 2010.  (Yeah, don't get me started on poor cryptographic
>> documentation.)
>>
>> Therefore, I decide to go backward and look at CTR and CBC to see how
>> the nonce is used there, and if XTS fixes problems with nonce reuse.
>>
>> First, I originally chose CTR mode since it was a streaming cipher, and
>> we therefore could skip certain page fields like the LSN.  However, CTR
>> is very sensitive to LSN reuse since the input bits generate encrypted
>> bits in exactly the same locations on the page. (It uses a simple XOR
>> against a cipher).  Since sometimes pages with different page contents
>> are encrypted with the same LSN, especially on replicas, this method
>> failed.
>>
>> Second is CBC mode. which is a block cipher.  I thought that meant that
>> you could only encrypt 16-byte chunks, meaning you couldn't skip
>> encryption of certain page fields unless they are 16-byte chunks.
>> However, there is something called ciphertext stealing
>> (https://en.wikipedia.org/wiki/Ciphertext_stealing#CBC_ciphertext_stealing)
>> which allows that.  I am not sure if OpenSSL supports this, but looking
>> at my OpenSSL  1.1.1d manual entry for EVP_aes, cipher stealing is only
>> mentioned for XTS.
>>
>> Anyway, CBC mode still needs a nonce for the first 16-byte block, and
>> then feeds the encrypted output of the first block as a IV to the second
>> block, etc. This gives us the same problem with finding a nonce per
>> page.  However, since it is a block cipher, the bits don't output in the
>> same locations they have on input, so that is less of a problem.  There
>> is also the problem that the encrypted output from one 16-byte block
>> could repeat, causing leakage.
>>
>> So, let's look how XTS is designed.  First, it uses two keys.  If you
>> are using AES128, you need _two_ 128-bit keys.  If using AES256, you
>> need two 256-bit keys.  The first of the two keys is used like normal,
>> to encrypt the data.  The second key, which is also secret, is used to
>> encrypt the values used for the IV for the first 16-byte block (in our
>> case dboid, relfilenode, blocknum, maybe LSN).  This is most clearly
>> explained here:
>>
>>     https://www.kingston.com/unitedstates/en/solutions/data-security/xts-encryption
>>
>> That IV is XOR'ed against both the input value and the encryption output
>> value, as explained here as key tweaking:
>>
>>     https://crossbowerbt.github.io/xts_mode_tweaking.html
>>
>> The purpose of using it before and after encryption is explained here:
>>
>>
https://crypto.stackexchange.com/questions/24431/what-is-the-benefit-of-applying-the-tweak-a-second-time-using-xts
>>
>> The second 16-byte block gets an IV that is the multiplication of the
>> first IV and an alpha value raised to the second power but mapped to a
>> finite field (Galois field, modulus a prime).  This effectively means an
>> attacker has _no_ idea what the IV is since it involves a secret key,
>> and each 16-byte block uses a different, unpredictable IV value. XTS
>> also supports ciphertext stealing by default so we can use the LSN if we
>> want, but we aren't sure we need to.
> 
> Yeah, this all seems to be about where I got to too.
> 
>> Finally, there is an interesting web page about when not to use XTS:
>>
>>     https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/
> 
> This particular article always struck me as more of a reason for us, at
> least, to use XTS than to not- in particular the very first comment it
> makes, which seems to be pretty well supported, is: "XTS is the de-facto
> standard disk encryption mode."  Much of the rest of it is the well
> trodden discussion we've had about how FDE (or TDE in our case) doesn't
> protect against all the attack vectors that sometimes people think it
> does.  Another point is that XTS isn't authenticated- something else we
> know quite well around here and isn't news.
> 
>> Basically, what XTS does is to make the IV unknown to attackers and
>> non-repeating except for multiple writes to a specific 16-byte block
>> (with no LSN change).  What isn't clear is if repeated encryption of
>> different data in the same 16-byte block can leak data.
> 
> Any time a subset of the data is changed but the rest of it isn't,
> there's a leak of information.  This is a really good example of exactly
> what that looks like:
> 
> https://github.com/robertdavidgraham/ecb-penguin
> 
> In our case, if/when this happens (no LSN change, repeated encryption
> of the same block), someone might be able to deduce that hint bits were
> being updated/changed, and where some of those are in the block.
> 
> That said, I don't think that's really a huge issue or something that's
> a show stopper or a reason to hold off on using XTS.  Note that what
> those bits actually *are* isn't leaked, just that they changed in some
> fashion inside of that 16-byte cipher text block.  That they're directly
> leaked with CTR is why there was concern raised about using that method,
> as discussed above and previously.
> 

Yeah. With CTR you pretty learn where the hint bits are exactly, while 
with XTS the whole ciphertext changes.

This also means CTR is much more malleable, i.e. you can tweak the 
ciphertext bits to flip the plaintext, while with XTS that's not really 
possible - it's pretty much guaranteed to break the block structure. Not 
sure if that's an issue for our use case, but if it is then neither of 
the two modes is a solution.

>> This probably needs more research and maybe we need to write something
>> up like the above and let security researchers review it since there
>> doesn't seem to be enough documentation for us to decide ourselves.
> 
> The one issue identified here is hopefully answered above and given that
> what you've found matches what I found, I'd argue that moving forward
> with XTS makes sense.
> 

+1

> The other bit of research that I wanted to do, and thanks for sending
> this and prodding me to go do so, was to look at other implementations
> and see what they do for the IV when it comes to using XTS, and this is
> what I found:
> 
> https://wiki.gentoo.org/wiki/Dm-crypt_full_disk_encryption
> 
> Specifically: The default cipher for LUKS is nowadays aes-xts-plain64
> 
> and then this:
> 
> https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt
> 
> where plain64 is defined as:
> 
> plain64: the initial vector is the 64-bit little-endian version of the
> sector number, padded with zeros if necessary
> 
> That is, the default for LUKS is AES, XTS, with a simple IV.  That
> strikes me as a pretty ringing endorsement.
> 

Seems reasonable, on the assumption the threat models are the same.

> Now, to address the concern around re-encrypting a block with the same
> key+IV but different data and leaking what parts of the page changed, I
> do think we should use the LSN and have it change regularly (including
> unlogged tables) but that's just because it's relatively easy for us to
> do and means an attacker wouldn't be able to tell what part of the page
> changed when the LSN was also changed.  That was also recommended by
> NIST and that's a pretty strong endorsement also.
> 

Not sure - it seems a bit weird to force LSN change even in cases that 
don't generate any WAL. I was not following the encryption thread and
maybe it was discussed/rejected there, but I've always imagined we'd 
have a global nonce generator (similar to a sequence) and we'd store it 
at the end of each block, or something like that.

> I'm all for getting security folks and whomever else to come and review
> this thread and chime in with their thoughts, but I don't think it's a
> reason to hold off on moving forward with the approach that we've been
> converging towards.
> 

+1


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Robert Haas

Date:

15 October 2021, 21:02:38

On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
> Specifically: The default cipher for LUKS is nowadays aes-xts-plain64
>
> and then this:
>
> https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt
>
> where plain64 is defined as:
>
> plain64: the initial vector is the 64-bit little-endian version of the
> sector number, padded with zeros if necessary
>
> That is, the default for LUKS is AES, XTS, with a simple IV.  That
> strikes me as a pretty ringing endorsement.

Yes, that sounds promising. It might not hurt to check for other
precedents as well, but that seems like a pretty good one.

I'm not very convinced that using the LSN for any of this is a good
idea. Something that changes most of the time but not all the time
seems more like it could hurt by masking fuzzy thinking more than it
helps anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: XTS cipher mode for cluster file encryption

From

Andres Freund

Date:

15 October 2021, 21:21:09

Hi,

On 2021-10-15 15:22:48 -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > Finally, there is an interesting web page about when not to use XTS:
> > 
> >     https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/
> 
> This particular article always struck me as more of a reason for us, at
> least, to use XTS than to not- in particular the very first comment it
> makes, which seems to be pretty well supported, is: "XTS is the de-facto
> standard disk encryption mode."

I don't find that line of argument *that* convincing. The reason XTS is the
de-facto standard is that for generic block layer encryption is that you can't
add additional data for each block without very significant overhead
(basically needing journaling to ensure that the data doesn't get out of
sync). But we don't really face the same situation - we *can* add additional
data.

With something like AES-GCM-SIV we can use the additional data to get IV reuse
resistance *and* authentication. And while perhaps we are ok with the IV reuse
guarantees XTS has, it seems pretty clear that we'll want want guaranteed
authenticity at some point. And then we'll need extra data anyway.

Thus, to me, it doesn't seem worth going down the XTS route, just to
temporarily save a bit of implementation effort. We'll have to endure that
pain anyway.

Greetings,

Andres Freund

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

15 October 2021, 21:26:01

On 10/15/21 23:02, Robert Haas wrote:
> On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
>> Specifically: The default cipher for LUKS is nowadays aes-xts-plain64
>>
>> and then this:
>>
>> https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt
>>
>> where plain64 is defined as:
>>
>> plain64: the initial vector is the 64-bit little-endian version of the
>> sector number, padded with zeros if necessary
>>
>> That is, the default for LUKS is AES, XTS, with a simple IV.  That
>> strikes me as a pretty ringing endorsement.
> 
> Yes, that sounds promising. It might not hurt to check for other
> precedents as well, but that seems like a pretty good one.
> 

TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other 
FDE products at [1], and some of them use XTS, but I would take that 
with a grain of salt - some of the products are somewhat obscure, very 
old, or both.

What is probably more interesting is that there's an IEEE standard [2] 
dealing with encrypted shared storage, and that uses XTS too. I'd bet 
there's a bunch of smart cryptographers involved.

[1] https://en.wikipedia.org/wiki/Comparison_of_disk_encryption_software

[2] https://en.wikipedia.org/wiki/IEEE_P1619

> I'm not very convinced that using the LSN for any of this is a good
> idea. Something that changes most of the time but not all the time
> seems more like it could hurt by masking fuzzy thinking more than it
> helps anything.
> 

I haven't been following the discussion about using LSN, but I agree 
that while using it seems convenient, the consequences of some changes 
not incrementing LSN seem potentially disastrous, depending on the 
encryption mode.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

16 October 2021, 14:16:25

On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote:
> > That said, I don't think that's really a huge issue or something that's
> > a show stopper or a reason to hold off on using XTS.  Note that what
> > those bits actually *are* isn't leaked, just that they changed in some
> > fashion inside of that 16-byte cipher text block.  That they're directly
> > leaked with CTR is why there was concern raised about using that method,
> > as discussed above and previously.
> > 
> 
> Yeah. With CTR you pretty learn where the hint bits are exactly, while with
> XTS the whole ciphertext changes.
> 
> This also means CTR is much more malleable, i.e. you can tweak the
> ciphertext bits to flip the plaintext, while with XTS that's not really
> possible - it's pretty much guaranteed to break the block structure. Not
> sure if that's an issue for our use case, but if it is then neither of the
> two modes is a solution.

Yes, this is a vary good point.  Let's look at the impact of _not_ using
the LSN.  For CTR (already rejected) bit changes would be visible by
comparing old/new page contents.  For CBC (also not under consideration)
the first 16-byte block would show a change, and all later 16-byte
blocks would show a change.  For CBC, you see the 16-byte blocks change,
but you have no idea how many bits were changed, and in what locations
in the 16-byte block (AES uses substitution and diffusion).  For XTS,
because earlier blocks don't change the IV used by later blocks like
CBC, you would be able to see each 16-byte block that changed in the 8k
page.  Again, you would not know the number of bits changed or their
locations.

Do we think knowing which 16-byte blocks on an 8k page change would leak
useful information?  If so, we should use the LSN and just accept that
some cases might leak as described above.  If we don't care, then we can
skip the use of the LSN and simplify the patch.

> Not sure - it seems a bit weird to force LSN change even in cases that don't
> generate any WAL. I was not following the encryption thread and maybe it was
> discussed/rejected there, but I've always imagined we'd have a global nonce
> generator (similar to a sequence) and we'd store it at the end of each
> block, or something like that.

Storing the nonce in the page means more code complexity, possible
performance impact, and the inability to create standbys via binary
replication that use cluster file encryption.

As a final comment to Andres's email, adding a GCM has the problems
above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
could also affect the integrity of the data.  Someone could also restore
and old copy of a patch to revert a change, and that would not be
detected even by GCM.

I consider this a checkbox feature and making it too complex will cause
it to be rightly rejected.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Andres Freund

Date:

16 October 2021, 16:15:05

Hi,

On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote:
> As a final comment to Andres's email, adding a GCM has the problems
> above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
> could also affect the integrity of the data.  Someone could also restore
> and old copy of a patch to revert a change, and that would not be
> detected even by GCM.

> I consider this a checkbox feature and making it too complex will cause
> it to be rightly rejected.

You're just deferring / hiding the complexity. For one, we'll need integrity
before long if we add encryption support. Then we'll deal with a more complex
on-disk format because there will be two different ways of encrypting. For
another, you're spreading out the security analysis to a lot of places in the
code and more importantly to future changes affecting on-disk data.

If it's really just a checkbox feature without a real use case, then we should
just reject requests for it and use our energy for useful things.

Greetings,

Andres Freund

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

16 October 2021, 16:28:51

On Sat, Oct 16, 2021 at 09:15:05AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote:
> > As a final comment to Andres's email, adding a GCM has the problems
> > above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
> > could also affect the integrity of the data.  Someone could also restore
> > and old copy of a patch to revert a change, and that would not be
> > detected even by GCM.
> 
> > I consider this a checkbox feature and making it too complex will cause
> > it to be rightly rejected.
> 
> You're just deferring / hiding the complexity. For one, we'll need integrity
> before long if we add encryption support. Then we'll deal with a more complex
> on-disk format because there will be two different ways of encrypting. For
> another, you're spreading out the security analysis to a lot of places in the
> code and more importantly to future changes affecting on-disk data.
> 
> If it's really just a checkbox feature without a real use case, then we should
> just reject requests for it and use our energy for useful things.

Agreed.  That is the conclusion I came to in May:

    https://www.postgresql.org/message-id/20210526210201.GZ3048%40momjian.us
    https://www.postgresql.org/message-id/20210527160003.GF5646%40momjian.us

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

17 October 2021, 21:11:49


On 10/16/21 16:16, Bruce Momjian wrote:
> On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote:
>>> That said, I don't think that's really a huge issue or something that's
>>> a show stopper or a reason to hold off on using XTS.  Note that what
>>> those bits actually *are* isn't leaked, just that they changed in some
>>> fashion inside of that 16-byte cipher text block.  That they're directly
>>> leaked with CTR is why there was concern raised about using that method,
>>> as discussed above and previously.
>>>
>>
>> Yeah. With CTR you pretty learn where the hint bits are exactly, while with
>> XTS the whole ciphertext changes.
>>
>> This also means CTR is much more malleable, i.e. you can tweak the
>> ciphertext bits to flip the plaintext, while with XTS that's not really
>> possible - it's pretty much guaranteed to break the block structure. Not
>> sure if that's an issue for our use case, but if it is then neither of the
>> two modes is a solution.
> 
> Yes, this is a vary good point.  Let's look at the impact of _not_ using
> the LSN.  For CTR (already rejected) bit changes would be visible by
> comparing old/new page contents.  For CBC (also not under consideration)
> the first 16-byte block would show a change, and all later 16-byte
> blocks would show a change.  For CBC, you see the 16-byte blocks change,
> but you have no idea how many bits were changed, and in what locations
> in the 16-byte block (AES uses substitution and diffusion).  For XTS,
> because earlier blocks don't change the IV used by later blocks like
> CBC, you would be able to see each 16-byte block that changed in the 8k
> page.  Again, you would not know the number of bits changed or their
> locations.
> 
> Do we think knowing which 16-byte blocks on an 8k page change would leak
> useful information?  If so, we should use the LSN and just accept that
> some cases might leak as described above.  If we don't care, then we can
> skip the use of the LSN and simplify the patch.
> 
>> Not sure - it seems a bit weird to force LSN change even in cases that don't
>> generate any WAL. I was not following the encryption thread and maybe it was
>> discussed/rejected there, but I've always imagined we'd have a global nonce
>> generator (similar to a sequence) and we'd store it at the end of each
>> block, or something like that.
> 
> Storing the nonce in the page means more code complexity, possible
> performance impact, and the inability to create standbys via binary
> replication that use cluster file encryption.
> 

Would it really be that complex? Reserving a bunch of bytes at the end 
of each encrypted page (a bit like the "special" space, but after 
encryption) seems fairly straightforward. And I don't quite see why 
would this have a measurable impact, given the nonce is 16B at most. The 
encryption is likely way more expensive.

Moreover, it seems fairly reasonable to trade a bit of code complexity 
for something LSN-based which seems simpler but apparently has various 
weak points and is much harder to reason about.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

17 October 2021, 21:23:49

On 10/16/21 18:28, Bruce Momjian wrote:
> On Sat, Oct 16, 2021 at 09:15:05AM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote:
>>> As a final comment to Andres's email, adding a GCM has the problems
>>> above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
>>> could also affect the integrity of the data.  Someone could also restore
>>> and old copy of a patch to revert a change, and that would not be
>>> detected even by GCM.
>>
>>> I consider this a checkbox feature and making it too complex will cause
>>> it to be rightly rejected.
>>
>> You're just deferring / hiding the complexity. For one, we'll need integrity
>> before long if we add encryption support. Then we'll deal with a more complex
>> on-disk format because there will be two different ways of encrypting. For
>> another, you're spreading out the security analysis to a lot of places in the
>> code and more importantly to future changes affecting on-disk data.
>>

I've argued for storing the nonce, but I don't quite see why would we 
need integrity guarantees?

AFAICS the threat model the patch aims to address is an attacker who can 
observe the data (e.g. a low-privileged OS user), but can't modify the 
files. Which seems like a reasonable model for shared environments.

IMO extending this to cases where the attacker can modify the data moves 
the goalposts quite significantly. And it's quite possible authenticated 
encryption would not be enough to prevent that, because that still works 
only at block level, and you can probably do a lot of harm with replay 
attacks (e.g. replacing blocks with older versions). And if you can 
modify the data directory / config files, what are the chances you can't 
just get access to the database, trace the processes or whatever?

We already have a way to check integrity by storing page checksum, but 
I'm not sure if that's good enough (there's a lot of subtle issues with 
building proper AEAD scheme).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

18 October 2021, 02:19:48

Just a mention. the HMAC (or AE/AD) can be disabled in AES-GCM. HMAC in 
AES-GCM is an encrypt-then-hash MAC.

CRC-32 is not a crypto-safe hash (technically CRC-32 is not a hash 
function). Cryptographers may unhappy with CRC-32.

I think CRC or SHA is not such important. If IV can be stored, I believe 
there should have enough space to store HMAC.

On 2021/10/18 05:23, Tomas Vondra wrote:
> 
> I've argued for storing the nonce, but I don't quite see why would we 
> need integrity guarantees?

Attachment

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

18 October 2021, 02:35:48

On 2021/10/16 04:57, Tomas Vondra wrote:
 >
 > Seems reasonable, on the assumption the threat models are the same.

On 2021/10/16 03:22, Stephen Frost wrote:
> plain64: the initial vector is the 64-bit little-endian version of the
> sector number, padded with zeros if necessary
> 
> That is, the default for LUKS is AES, XTS, with a simple IV.  That
> strikes me as a pretty ringing endorsement
On 2021/10/18 05:23, Tomas Vondra wrote:
 >
 > AFAICS the threat model the patch aims to address is an attacker who can
 > observe the data (e.g. a low-privileged OS user), but can't modify the
 > files. Which seems like a reasonable model for shared environments.

I agree this threat model.

And if PostgreSQL is using XTS, there is no different with dm-encrypt.
The user can use dm-encrypt directly.

Attachment

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

18 October 2021, 15:56:03

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 10/15/21 21:22, Stephen Frost wrote:
> >Now, to address the concern around re-encrypting a block with the same
> >key+IV but different data and leaking what parts of the page changed, I
> >do think we should use the LSN and have it change regularly (including
> >unlogged tables) but that's just because it's relatively easy for us to
> >do and means an attacker wouldn't be able to tell what part of the page
> >changed when the LSN was also changed.  That was also recommended by
> >NIST and that's a pretty strong endorsement also.
>
> Not sure - it seems a bit weird to force LSN change even in cases that don't
> generate any WAL. I was not following the encryption thread and maybe it was
> discussed/rejected there, but I've always imagined we'd have a global nonce
> generator (similar to a sequence) and we'd store it at the end of each
> block, or something like that.

The 'LSN' being referred to here isn't the regular LSN that is
associated with the WAL but rather the separate FakeLSN counter which we
already have.  I wasn't suggesting having the regular LSN change in
cases that don't generate WAL.

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Specifically: The default cipher for LUKS is nowadays aes-xts-plain64
> >
> > and then this:
> >
> > https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt
> >
> > where plain64 is defined as:
> >
> > plain64: the initial vector is the 64-bit little-endian version of the
> > sector number, padded with zeros if necessary
> >
> > That is, the default for LUKS is AES, XTS, with a simple IV.  That
> > strikes me as a pretty ringing endorsement.
>
> Yes, that sounds promising. It might not hurt to check for other
> precedents as well, but that seems like a pretty good one.
>
> I'm not very convinced that using the LSN for any of this is a good
> idea. Something that changes most of the time but not all the time
> seems more like it could hurt by masking fuzzy thinking more than it
> helps anything.

This argument doesn't come across as very strong at all to me,
particularly when we have explicit recommendations from NIST that having
the IV vary more is beneficial.  While this would be using the LSN, the
fact that the LSN changes most of the time but not all of the time isn't
new and is something we already have to deal with.  I'd think we'd
address the concern about mis-thinking around how this works by
providing a README and/or an appropriate set of comments around what's
being done and why.

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-10-15 15:22:48 -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > Finally, there is an interesting web page about when not to use XTS:
> > >
> > >     https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/
> >
> > This particular article always struck me as more of a reason for us, at
> > least, to use XTS than to not- in particular the very first comment it
> > makes, which seems to be pretty well supported, is: "XTS is the de-facto
> > standard disk encryption mode."
>
> I don't find that line of argument *that* convincing. The reason XTS is the
> de-facto standard is that for generic block layer encryption is that you can't
> add additional data for each block without very significant overhead
> (basically needing journaling to ensure that the data doesn't get out of
> sync). But we don't really face the same situation - we *can* add additional
> data.

No, we can't always add additional data, and that's part of the
consideration for an XTS option- there are things we can do if we use
XTS that we can't with GCM or another solution.  Specifically, being
able to perform physical replication from an unencrypted cluster to an
encrypted one is a worthwhile use-case that we shouldn't be just tossing
out.

> With something like AES-GCM-SIV we can use the additional data to get IV reuse
> resistance *and* authentication. And while perhaps we are ok with the IV reuse
> guarantees XTS has, it seems pretty clear that we'll want want guaranteed
> authenticity at some point. And then we'll need extra data anyway.

I agree that it'd be useful to have an authenticated encryption option.
Implementing XTS doesn't preclude us from adding that capability down
the road and it's simpler with fewer dependencies.  These all strike me
as good reasons to add XTS first.

> Thus, to me, it doesn't seem worth going down the XTS route, just to
> temporarily save a bit of implementation effort. We'll have to endure that
> pain anyway.

This isn't a valid argument as it isn't just about implementation but
about the capabilities we will have once it's done.

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 10/15/21 23:02, Robert Haas wrote:
> >On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
> >>That is, the default for LUKS is AES, XTS, with a simple IV.  That
> >>strikes me as a pretty ringing endorsement.
> >
> >Yes, that sounds promising. It might not hurt to check for other
> >precedents as well, but that seems like a pretty good one.
>
> TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE
> products at [1], and some of them use XTS, but I would take that with a
> grain of salt - some of the products are somewhat obscure, very old, or
> both.
>
> What is probably more interesting is that there's an IEEE standard [2]
> dealing with encrypted shared storage, and that uses XTS too. I'd bet
> there's a bunch of smart cryptographers involved.

Thanks for finding those and linking to them, that's helpful.

> >I'm not very convinced that using the LSN for any of this is a good
> >idea. Something that changes most of the time but not all the time
> >seems more like it could hurt by masking fuzzy thinking more than it
> >helps anything.
>
> I haven't been following the discussion about using LSN, but I agree that
> while using it seems convenient, the consequences of some changes not
> incrementing LSN seem potentially disastrous, depending on the encryption
> mode.

Yes, this depends on the encryption mode, and is why we are specifically
talking about XTS here as it's an encryption mode that doesn't suffer
from this risk and therefore it's perfectly fine to use the LSN/FakeLSN
with XTS (and would also be alright for AES-GCM-SIV as it's specifically
designed to be resistant to IV reuse).

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote:
> > > That said, I don't think that's really a huge issue or something that's
> > > a show stopper or a reason to hold off on using XTS.  Note that what
> > > those bits actually *are* isn't leaked, just that they changed in some
> > > fashion inside of that 16-byte cipher text block.  That they're directly
> > > leaked with CTR is why there was concern raised about using that method,
> > > as discussed above and previously.
> >
> > Yeah. With CTR you pretty learn where the hint bits are exactly, while with
> > XTS the whole ciphertext changes.
> >
> > This also means CTR is much more malleable, i.e. you can tweak the
> > ciphertext bits to flip the plaintext, while with XTS that's not really
> > possible - it's pretty much guaranteed to break the block structure. Not
> > sure if that's an issue for our use case, but if it is then neither of the
> > two modes is a solution.
>
> Yes, this is a vary good point.  Let's look at the impact of _not_ using
> the LSN.  For CTR (already rejected) bit changes would be visible by
> comparing old/new page contents.  For CBC (also not under consideration)
> the first 16-byte block would show a change, and all later 16-byte
> blocks would show a change.  For CBC, you see the 16-byte blocks change,
> but you have no idea how many bits were changed, and in what locations
> in the 16-byte block (AES uses substitution and diffusion).  For XTS,
> because earlier blocks don't change the IV used by later blocks like
> CBC, you would be able to see each 16-byte block that changed in the 8k
> page.  Again, you would not know the number of bits changed or their
> locations.
>
> Do we think knowing which 16-byte blocks on an 8k page change would leak
> useful information?  If so, we should use the LSN and just accept that
> some cases might leak as described above.  If we don't care, then we can
> skip the use of the LSN and simplify the patch.

While there may not be an active attack against PG that leverages such a
leak, I have a hard time seeing why we would intentionally design this
in when we have a great option that's directly available to us and
doesn't cause such a leak with nearly such regularity as not using the
LSN would, and also follows recommendations of using XTS from NIST.
Further, not using the LSN wouldn't really be an option if we did
eventually implement AES-GCM-SIV, so why not have the two cases be
consistent?

> > Not sure - it seems a bit weird to force LSN change even in cases that don't
> > generate any WAL. I was not following the encryption thread and maybe it was
> > discussed/rejected there, but I've always imagined we'd have a global nonce
> > generator (similar to a sequence) and we'd store it at the end of each
> > block, or something like that.
>
> Storing the nonce in the page means more code complexity, possible
> performance impact, and the inability to create standbys via binary
> replication that use cluster file encryption.

Right- the point of XTS is that we can do things that we can't with GCM
and that it's simpler.

> As a final comment to Andres's email, adding a GCM has the problems
> above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
> could also affect the integrity of the data.  Someone could also restore
> and old copy of a patch to revert a change, and that would not be
> detected even by GCM.

That's an argument based on how things stand today.  I appreciate that
it's no small thing to consider changes to those other systems but I
would argue that having authentication of the heap is still better than
not (but also agree that XTS is simpler to implement and therefore makes
sense to do first and see how things stand after that's done).  Surely,
we would want progress made here to be done so incrementally as a patch
that attempted to change all of those other systems to be encrypted and
authenticated with AES-GCM-SIV would be far too large to consider in one
shot anyway.

> I consider this a checkbox feature and making it too complex will cause
> it to be rightly rejected.

Presuming that 'checkbox feature' here means "we need it to please
$someone but no one will ever use it" or something along those lines,
this is very clearly not the case and therefore we shouldn't be
describing it or treating it as such.  Even if the meaning here is
"there's other ways people could get this capability" the reality is
that those other methods are simply not always available and in those
cases, people will choose to not use PostgreSQL.  Nearly every other
database system which we might compare ourselves to has a solution in
this area and people actively use those solutions in a lot of
deployments.

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote:
> > As a final comment to Andres's email, adding a GCM has the problems
> > above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
> > could also affect the integrity of the data.  Someone could also restore
> > and old copy of a patch to revert a change, and that would not be
> > detected even by GCM.
>
> > I consider this a checkbox feature and making it too complex will cause
> > it to be rightly rejected.
>
> You're just deferring / hiding the complexity. For one, we'll need integrity
> before long if we add encryption support. Then we'll deal with a more complex
> on-disk format because there will be two different ways of encrypting. For
> another, you're spreading out the security analysis to a lot of places in the
> code and more importantly to future changes affecting on-disk data.

I don't follow this argument.  The XTS approach is explicitly the same
on-disk format as what we have unencrypted today, just encrypted, and
that's the whole point of going with that approach.  If we were to
implement AES-GCM-SIV, then that would introduce a new on-disk format
and then we'd have two- one which has space on each page for the
authentication information, and one which doesn't.

> If it's really just a checkbox feature without a real use case, then we should
> just reject requests for it and use our energy for useful things.

This capability certainly has a real use-case and it's one that a lot of
organizations are looking for PG to provide a solution for.  That we
don't today is keeping those organizations from using PG in at least
some cases, and for some organizations, it prevents them from using PG
at all, as they understandably would rather not deal with a hybrid of
using PG for some things and having to use another solution for other
things.

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 10/16/21 16:16, Bruce Momjian wrote:
> >Storing the nonce in the page means more code complexity, possible
> >performance impact, and the inability to create standbys via binary
> >replication that use cluster file encryption.
>
> Would it really be that complex? Reserving a bunch of bytes at the end of
> each encrypted page (a bit like the "special" space, but after encryption)
> seems fairly straightforward. And I don't quite see why would this have a
> measurable impact, given the nonce is 16B at most. The encryption is likely
> way more expensive.

There's a patch to do exactly this- make space available towards the end
of the page.  If we go down the route of using a different page format
then we lose the ability to do physical replication between an
unencrypted cluster and an encrypted one.  That's certainly a nice
capability to have and it will help people migrate to an encrypted PG
instance, plus it's overall simpler to work with, which is also an
advantge.

> Moreover, it seems fairly reasonable to trade a bit of code complexity for
> something LSN-based which seems simpler but apparently has various weak
> points and is much harder to reason about.

This isn't just about code complexity but is also about the resulting
capabilities from these different approaches.

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 10/16/21 18:28, Bruce Momjian wrote:
> >On Sat, Oct 16, 2021 at 09:15:05AM -0700, Andres Freund wrote:
> >>On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote:
> >>>As a final comment to Andres's email, adding a GCM has the problems
> >>>above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which
> >>>could also affect the integrity of the data.  Someone could also restore
> >>>and old copy of a patch to revert a change, and that would not be
> >>>detected even by GCM.
> >>
> >>>I consider this a checkbox feature and making it too complex will cause
> >>>it to be rightly rejected.
> >>
> >>You're just deferring / hiding the complexity. For one, we'll need integrity
> >>before long if we add encryption support. Then we'll deal with a more complex
> >>on-disk format because there will be two different ways of encrypting. For
> >>another, you're spreading out the security analysis to a lot of places in the
> >>code and more importantly to future changes affecting on-disk data.
>
> I've argued for storing the nonce, but I don't quite see why would we need
> integrity guarantees?
>
> AFAICS the threat model the patch aims to address is an attacker who can
> observe the data (e.g. a low-privileged OS user), but can't modify the
> files. Which seems like a reasonable model for shared environments.

There are multiple threat models which we should be considering and
that's why we may want to eventually add integrity.

> IMO extending this to cases where the attacker can modify the data moves the
> goalposts quite significantly. And it's quite possible authenticated
> encryption would not be enough to prevent that, because that still works
> only at block level, and you can probably do a lot of harm with replay
> attacks (e.g. replacing blocks with older versions). And if you can modify
> the data directory / config files, what are the chances you can't just get
> access to the database, trace the processes or whatever?

I agree that working towards an authenticated solution is a larger task.
I don't agree that we should throw out the possibility that we may want
to implement it eventually as there are certainly threat models where an
attacker might have access to the storage but not to the database or the
system on which the database is running.  Implementing a system to
address such an attack vector would take more consideration than just
having authenticated encryption provided by PG, but it certainly
couldn't be done without that either.

> We already have a way to check integrity by storing page checksum, but I'm
> not sure if that's good enough (there's a lot of subtle issues with building
> proper AEAD scheme).

No, it isn't good enough.

* Sasasu (i@sasa.su) wrote:
> Just a mention. the HMAC (or AE/AD) can be disabled in AES-GCM. HMAC in
> AES-GCM is an encrypt-then-hash MAC.

Not sure why you would though.

> CRC-32 is not a crypto-safe hash (technically CRC-32 is not a hash
> function). Cryptographers may unhappy with CRC-32.

Yes, that's correct (and it isn't even CRC-32 that we have, heh).

> I think CRC or SHA is not such important. If IV can be stored, I believe
> there should have enough space to store HMAC.

This would be the case, yes.  If we can find a way to make room for an
IV then we could make room to store the tag too, and we certainly should
(and we should include a variety of additional data in the AEAD- block
number, relfileno, etc).

* Sasasu (i@sasa.su) wrote:
> On 2021/10/16 04:57, Tomas Vondra wrote:
> > Seems reasonable, on the assumption the threat models are the same.
>
> On 2021/10/16 03:22, Stephen Frost wrote:
> >plain64: the initial vector is the 64-bit little-endian version of the
> >sector number, padded with zeros if necessary
> >
> >That is, the default for LUKS is AES, XTS, with a simple IV.  That
> >strikes me as a pretty ringing endorsement
> On 2021/10/18 05:23, Tomas Vondra wrote:
> > AFAICS the threat model the patch aims to address is an attacker who can
> > observe the data (e.g. a low-privileged OS user), but can't modify the
> > files. Which seems like a reasonable model for shared environments.
>
> I agree this threat model.
>
> And if PostgreSQL is using XTS, there is no different with dm-encrypt.
> The user can use dm-encrypt directly.

dm-encrypt is not always an option and it doesn't actually address the
threat-model that Tomas brought up here anyway, as it would be below the
level that the low-privileged OS user would be looking at.  That's not
the only threat model to consider, but it is one which could potentially
be addressed by either XTS or AES-GCM-SIV.  There are threat models
which dm-crypt would address, of course, such as data-at-rest (hard
drive theft, improper disposal of storage media, backups which don't
have their own encryption, etc), but, again, dm-crypt isn't always an
option that is available and so I don't agree that we should throw this
out just because dm-crypt exists and may be useable in some cases.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Robert Haas

Date:

18 October 2021, 16:37:39

On Fri, Oct 15, 2021 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:
> I don't find that line of argument *that* convincing. The reason XTS is the
> de-facto standard is that for generic block layer encryption is that you can't
> add additional data for each block without very significant overhead
> (basically needing journaling to ensure that the data doesn't get out of
> sync). But we don't really face the same situation - we *can* add additional
> data.

Yes. The downside is that there is some code complexity, and also some
runtime overhead even for cases that don't use encryption, because
some things that now are compile time constants might need to be
computed at runtime. That can probably be made pretty small, though.

> With something like AES-GCM-SIV we can use the additional data to get IV reuse
> resistance *and* authentication. And while perhaps we are ok with the IV reuse
> guarantees XTS has, it seems pretty clear that we'll want want guaranteed
> authenticity at some point. And then we'll need extra data anyway.
>
> Thus, to me, it doesn't seem worth going down the XTS route, just to
> temporarily save a bit of implementation effort. We'll have to endure that
> pain anyway.

I agree up to a point, but I do also kind of feel like we should be
leaving it up to whoever is working on an implementation to decide
what they want to implement. I don't particularly like this discussion
where it feels like people are trying to tell other people what they
have to do because "the community has decided X." It's pretty clear
that there are multiple opinions here, and I don't really see any of
them to be without merit, nor do I see why Bruce or Stephen or you or
anyone else should get to say "what the community has decided" in the
absence of a clear consensus.

I do really like the idea of using AES-GCM-SIV not because I know
anything about it, but because the integrity checking seems cool, and
storing the nonce seems like it would improve security. However, based
on what I know now, I would not vote to reject an XTS-based patch and,
as Stephen and Bruce have said, maybe with the right design it permits
upgrades from non-encrypted clusters to encrypted clusters. I'm
actually kind of doubtful about that, because that seems to require
some pretty specific and somewhat painful implementation decisions.
For example, I think if your solution for rotating keys is to shut
down the standby, re-encrypt it with a new key, start it up again, and
fail over to it, then you probably ever can't do key rotation in any
other way. The keys now have to be non-WAL-logged so that the standby
can be different, which means you can't add a new key on the master
and run around re-encrypting everything with it, WAL-logging those
changes as you go. Now I realize that implementing that is really
challenging anyway so maybe some people wouldn't like to go that way,
but then maybe other people would. Another thing you probably can't do
in this model is encrypt different parts of the database with
different keys, because how would you keep track of that? Certainly
not in the system catalogs, if none of that can show up in the WAL
stream.

But, you know, still: if somebody showed up with a fully-working XTS
patch with everything in good working order, I don't see that we have
enough evidence to reject it just because it's XTS. And I would hope
that the people favoring XTS would not vote to reject a fully working
GCM patch just because it's GCM. I think what we ought to be doing at
this point is combining our efforts to try to get some things
committed which make future work in this area committed - like that
patch to preserve relfilenode and database OID, or maybe some patches
to drive all of our I/O through a smaller number of code paths instead
of having every different type of temporary file we write reinvent the
wheel. These discussions about what encryption type we ought to use
are useful for ruling out options that we know are bad, but beyond
that I'm not sure they have much value. AES in any mode could seem
like a much less safe choice by the time we get a committed feature
here than it does today - even if somehow that were to happen for v15.
I expect there are people out there trying to break it even as I write
these words, and it seems likely that they will eventually succeed,
but as to when, who can say?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

18 October 2021, 19:02:56


On 10/18/21 04:19, Sasasu wrote:
> Just a mention. the HMAC (or AE/AD) can be disabled in AES-GCM. HMAC in 
> AES-GCM is an encrypt-then-hash MAC.
> 
> CRC-32 is not a crypto-safe hash (technically CRC-32 is not a hash 
> function). Cryptographers may unhappy with CRC-32.
> 

True. If you can flip enough bits in the page, it probably is not very 
hard to generate a page with the desired checksum. It's probably harder 
with XTS, but likely not much more.

> I think CRC or SHA is not such important. If IV can be stored, I believe 
> there should have enough space to store HMAC.
> 

Right, I agree.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

18 October 2021, 23:54:58


On 10/18/21 17:56, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
>> On 10/15/21 21:22, Stephen Frost wrote:
>>> Now, to address the concern around re-encrypting a block with the same
>>> key+IV but different data and leaking what parts of the page changed, I
>>> do think we should use the LSN and have it change regularly (including
>>> unlogged tables) but that's just because it's relatively easy for us to
>>> do and means an attacker wouldn't be able to tell what part of the page
>>> changed when the LSN was also changed.  That was also recommended by
>>> NIST and that's a pretty strong endorsement also.
>>
>> Not sure - it seems a bit weird to force LSN change even in cases that don't
>> generate any WAL. I was not following the encryption thread and maybe it was
>> discussed/rejected there, but I've always imagined we'd have a global nonce
>> generator (similar to a sequence) and we'd store it at the end of each
>> block, or something like that.
> 
> The 'LSN' being referred to here isn't the regular LSN that is
> associated with the WAL but rather the separate FakeLSN counter which we
> already have.  I wasn't suggesting having the regular LSN change in
> cases that don't generate WAL.
> 

I'm not very familiar with FakeLSN, but isn't that just about unlogged 
tables? How does that help cases like setting hint bits, which may not 
generate WAL?

> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
>>> Specifically: The default cipher for LUKS is nowadays aes-xts-plain64
>>>
>>> and then this:
>>>
>>> https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt
>>>
>>> where plain64 is defined as:
>>>
>>> plain64: the initial vector is the 64-bit little-endian version of the
>>> sector number, padded with zeros if necessary
>>>
>>> That is, the default for LUKS is AES, XTS, with a simple IV.  That
>>> strikes me as a pretty ringing endorsement.
>>
>> Yes, that sounds promising. It might not hurt to check for other
>> precedents as well, but that seems like a pretty good one.
>>
>> I'm not very convinced that using the LSN for any of this is a good
>> idea. Something that changes most of the time but not all the time
>> seems more like it could hurt by masking fuzzy thinking more than it
>> helps anything.
> 
> This argument doesn't come across as very strong at all to me,
> particularly when we have explicit recommendations from NIST that having
> the IV vary more is beneficial.  While this would be using the LSN, the
> fact that the LSN changes most of the time but not all of the time isn't
> new and is something we already have to deal with.  I'd think we'd
> address the concern about mis-thinking around how this works by
> providing a README and/or an appropriate set of comments around what's
> being done and why.
> 

I don't think anyone objects to varying IV more, as recommended by NIST. 
AFAICS the issue at hand is exactly the opposite - maybe not varying it 
enough, in some cases. It might be enough for MVCC purposes yet it might 
result in fatal failure of the encryption scheme. That's my concern, at 
least, and I assume it's what Robert meant by "fuzzy thinking" too.

FWIW I think we seem to be mixing nonces, IVs and tweak values. Although 
various encryption schemes place different requirements on those anyway.


> * Andres Freund (andres@anarazel.de) wrote:
>> On 2021-10-15 15:22:48 -0400, Stephen Frost wrote:
>>> * Bruce Momjian (bruce@momjian.us) wrote:
>>>> Finally, there is an interesting web page about when not to use XTS:
>>>>
>>>>     https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/
>>>
>>> This particular article always struck me as more of a reason for us, at
>>> least, to use XTS than to not- in particular the very first comment it
>>> makes, which seems to be pretty well supported, is: "XTS is the de-facto
>>> standard disk encryption mode."
>>
>> I don't find that line of argument *that* convincing. The reason XTS is the
>> de-facto standard is that for generic block layer encryption is that you can't
>> add additional data for each block without very significant overhead
>> (basically needing journaling to ensure that the data doesn't get out of
>> sync). But we don't really face the same situation - we *can* add additional
>> data.
> 
> No, we can't always add additional data, and that's part of the
> consideration for an XTS option- there are things we can do if we use
> XTS that we can't with GCM or another solution.  Specifically, being
> able to perform physical replication from an unencrypted cluster to an
> encrypted one is a worthwhile use-case that we shouldn't be just tossing
> out.
> 

Yeah, XTS seems like a reasonable first step, both because it doesn't 
require storing extra data and it's widespread use in FDE software (of 
course, there's a link between those). And I suspect replication between 
encrypted and unencrypted clusters is going to be a huge can of worms, 
even with XTS.

It's probably a silly / ugly idea, but can't we simply store a special 
"page format" flag in controldat - when set to 'true' during initdb, 
each page would have a bit of space (at the end) reserved for additional 
encryption data. Say, ~64B should be enough. On the encrypted cluster 
this would store the nonce/IV/... and on the unencrypted cluster it'd be 
simply unused. 64B seems like a negligible amount of data. And when set 
to 'false' the cluster would not allow encryption.


>> With something like AES-GCM-SIV we can use the additional data to get IV reuse
>> resistance *and* authentication. And while perhaps we are ok with the IV reuse
>> guarantees XTS has, it seems pretty clear that we'll want want guaranteed
>> authenticity at some point. And then we'll need extra data anyway.
> 
> I agree that it'd be useful to have an authenticated encryption option.
> Implementing XTS doesn't preclude us from adding that capability down
> the road and it's simpler with fewer dependencies.  These all strike me
> as good reasons to add XTS first.
> 

True. If XTS addresses the threat model we aimed to solve ...

>> Thus, to me, it doesn't seem worth going down the XTS route, just to
>> temporarily save a bit of implementation effort. We'll have to endure that
>> pain anyway.
> 
> This isn't a valid argument as it isn't just about implementation but
> about the capabilities we will have once it's done.
> 
> * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
>> On 10/15/21 23:02, Robert Haas wrote:
>>> On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
>>>> That is, the default for LUKS is AES, XTS, with a simple IV.  That
>>>> strikes me as a pretty ringing endorsement.
>>>
>>> Yes, that sounds promising. It might not hurt to check for other
>>> precedents as well, but that seems like a pretty good one.
>>
>> TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE
>> products at [1], and some of them use XTS, but I would take that with a
>> grain of salt - some of the products are somewhat obscure, very old, or
>> both.
>>
>> What is probably more interesting is that there's an IEEE standard [2]
>> dealing with encrypted shared storage, and that uses XTS too. I'd bet
>> there's a bunch of smart cryptographers involved.
> 
> Thanks for finding those and linking to them, that's helpful.
> 
>>> I'm not very convinced that using the LSN for any of this is a good
>>> idea. Something that changes most of the time but not all the time
>>> seems more like it could hurt by masking fuzzy thinking more than it
>>> helps anything.
>>
>> I haven't been following the discussion about using LSN, but I agree that
>> while using it seems convenient, the consequences of some changes not
>> incrementing LSN seem potentially disastrous, depending on the encryption
>> mode.
> 
> Yes, this depends on the encryption mode, and is why we are specifically
> talking about XTS here as it's an encryption mode that doesn't suffer
> from this risk and therefore it's perfectly fine to use the LSN/FakeLSN
> with XTS (and would also be alright for AES-GCM-SIV as it's specifically
> designed to be resistant to IV reuse).
> 

I'm not quite sure about the "perfectly fine" bit, as it's making XTS 
vulnerable to traffic analysis attacks (comparing multiple copies of an 
encrypted block). It may be a reasonable trade-off, of course.

> * Bruce Momjian (bruce@momjian.us) wrote:
>> On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote:
>>>> That said, I don't think that's really a huge issue or something that's
>>>> a show stopper or a reason to hold off on using XTS.  Note that what
>>>> those bits actually *are* isn't leaked, just that they changed in some
>>>> fashion inside of that 16-byte cipher text block.  That they're directly
>>>> leaked with CTR is why there was concern raised about using that method,
>>>> as discussed above and previously.
>>>
>>> Yeah. With CTR you pretty learn where the hint bits are exactly, while with
>>> XTS the whole ciphertext changes.
>>>
>>> This also means CTR is much more malleable, i.e. you can tweak the
>>> ciphertext bits to flip the plaintext, while with XTS that's not really
>>> possible - it's pretty much guaranteed to break the block structure. Not
>>> sure if that's an issue for our use case, but if it is then neither of the
>>> two modes is a solution.
>>
>> Yes, this is a vary good point.  Let's look at the impact of _not_ using
>> the LSN.  For CTR (already rejected) bit changes would be visible by
>> comparing old/new page contents.  For CBC (also not under consideration)
>> the first 16-byte block would show a change, and all later 16-byte
>> blocks would show a change.  For CBC, you see the 16-byte blocks change,
>> but you have no idea how many bits were changed, and in what locations
>> in the 16-byte block (AES uses substitution and diffusion).  For XTS,
>> because earlier blocks don't change the IV used by later blocks like
>> CBC, you would be able to see each 16-byte block that changed in the 8k
>> page.  Again, you would not know the number of bits changed or their
>> locations.
>>
>> Do we think knowing which 16-byte blocks on an 8k page change would leak
>> useful information?  If so, we should use the LSN and just accept that
>> some cases might leak as described above.  If we don't care, then we can
>> skip the use of the LSN and simplify the patch.
> 
> While there may not be an active attack against PG that leverages such a
> leak, I have a hard time seeing why we would intentionally design this
> in when we have a great option that's directly available to us and
> doesn't cause such a leak with nearly such regularity as not using the
> LSN would, and also follows recommendations of using XTS from NIST.
> Further, not using the LSN wouldn't really be an option if we did
> eventually implement AES-GCM-SIV, so why not have the two cases be
> consistent?
> 

I'm a bit confused, because the question was what happens if we encrypt 
the page twice with the same LSN or any tweak value in general. It 
certainly does not matter when it comes to malleability or replay 
attacks, because in that case the attacker is the one who modifies the 
block (and obviously won't change the LSN).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

19 October 2021, 00:07:27

On 10/18/21 17:56, Stephen Frost wrote:
 >> ...
>> I've argued for storing the nonce, but I don't quite see why would we need
>> integrity guarantees?
>>
>> AFAICS the threat model the patch aims to address is an attacker who can
>> observe the data (e.g. a low-privileged OS user), but can't modify the
>> files. Which seems like a reasonable model for shared environments.
> 
> There are multiple threat models which we should be considering and
> that's why we may want to eventually add integrity.
> 

So what are these threat models? If we should be considering them it'd 
be nice to have a description, explaining what capabilities must the 
attacker have ...

My (perhaps naive) understanding is that the authentication / integrity 
provides (partial) protection against attackers that may modify instance 
data - modify files, etc. But I'd guess an attacker with such capability 
can do various other (simpler) things to extract data. Say, modify the 
config to load an extension that dumps keys from memory, or whatever.

So what's a plausible / practical threat model that would be mitigated 
by the authenticated encryption?

It'd be a bit silly to add complexity to allow AEAD, only to find out 
there are ways around it.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

19 October 2021, 15:46:12

On 2021/10/19 00:37, Robert Haas wrote:
> I think what we ought to be doing at
> this point is combining our efforts to try to get some things
> committed which make future work in this area committed - like that
> patch to preserve relfilenode and database OID, or maybe some patches
> to drive all of our I/O through a smaller number of code paths instead
> of having every different type of temporary file we write reinvent the
> wheel.

A unified block-based I/O API sounds great. Has anyone tried to do this 
before? It would be nice if the front-end tools could also use these API.

As there are so many threat models, I propose to do the TDE feature by a 
set of hooks. those hooks are on the critical path of IO operation, add 
the ability to let extension replace the IO API. and also load extension 
when initdb, single-mode, and in front-end tools.
This sounds Like using $LD_PRELOAD to replace pread(2) and pwrite(2), 
which widely used in folder based encryption. but the hook will pass 
more context (filenode, tableoid, blocksize, and many) to the under 
layer, this hook API will look like object_access_hook.
then implement the simplest AES-XTS. and put it to contrib. provide a 
tool to deactivate AES-XTS to make PostgreSQL upgradeable.

I think this is the most peaceful method. GCM people will not reject 
this just because XTS. and XTS people will satisfied(maybe?) with the 
complexity. for performance, just one more long-jump compare with 
current AES-XTS code.

Attachment

Re: XTS cipher mode for cluster file encryption

From

Robert Haas

Date:

19 October 2021, 16:35:39

On Tue, Oct 19, 2021 at 11:46 AM Sasasu <i@sasa.su> wrote:
> As there are so many threat models, I propose to do the TDE feature by a
> set of hooks.

This is too invasive to do using hooks. We are inevitably going to
need to make significant changes in core.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

19 October 2021, 18:44:26

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 10/18/21 17:56, Stephen Frost wrote:
> >* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> >>On 10/15/21 21:22, Stephen Frost wrote:
> >>>Now, to address the concern around re-encrypting a block with the same
> >>>key+IV but different data and leaking what parts of the page changed, I
> >>>do think we should use the LSN and have it change regularly (including
> >>>unlogged tables) but that's just because it's relatively easy for us to
> >>>do and means an attacker wouldn't be able to tell what part of the page
> >>>changed when the LSN was also changed.  That was also recommended by
> >>>NIST and that's a pretty strong endorsement also.
> >>
> >>Not sure - it seems a bit weird to force LSN change even in cases that don't
> >>generate any WAL. I was not following the encryption thread and maybe it was
> >>discussed/rejected there, but I've always imagined we'd have a global nonce
> >>generator (similar to a sequence) and we'd store it at the end of each
> >>block, or something like that.
> >
> >The 'LSN' being referred to here isn't the regular LSN that is
> >associated with the WAL but rather the separate FakeLSN counter which we
> >already have.  I wasn't suggesting having the regular LSN change in
> >cases that don't generate WAL.
>
> I'm not very familiar with FakeLSN, but isn't that just about unlogged
> tables? How does that help cases like setting hint bits, which may not
> generate WAL?

Errr, there seems to have been some confusion in this thread between the
different ideas being tossed around.

The point of using FakeLSN for unlogged tables consistently is to
provide variability in the value used as the IV.  I wasn't suggesting to
use FakeLSN to provide a uniqueness guarantee- the reason we're talking
about using XTS is specifically because, unlike CTR, unique IVs are not
required.  Further, we don't need to find any additional space on the
page if we use XTS, meaning we can do things like go from an unencrypted
to an encrypted system with a basebackup+physical replication.

> >* Robert Haas (robertmhaas@gmail.com) wrote:
> >>On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
> >>>Specifically: The default cipher for LUKS is nowadays aes-xts-plain64
> >>>
> >>>and then this:
> >>>
> >>>https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt
> >>>
> >>>where plain64 is defined as:
> >>>
> >>>plain64: the initial vector is the 64-bit little-endian version of the
> >>>sector number, padded with zeros if necessary
> >>>
> >>>That is, the default for LUKS is AES, XTS, with a simple IV.  That
> >>>strikes me as a pretty ringing endorsement.
> >>
> >>Yes, that sounds promising. It might not hurt to check for other
> >>precedents as well, but that seems like a pretty good one.
> >>
> >>I'm not very convinced that using the LSN for any of this is a good
> >>idea. Something that changes most of the time but not all the time
> >>seems more like it could hurt by masking fuzzy thinking more than it
> >>helps anything.
> >
> >This argument doesn't come across as very strong at all to me,
> >particularly when we have explicit recommendations from NIST that having
> >the IV vary more is beneficial.  While this would be using the LSN, the
> >fact that the LSN changes most of the time but not all of the time isn't
> >new and is something we already have to deal with.  I'd think we'd
> >address the concern about mis-thinking around how this works by
> >providing a README and/or an appropriate set of comments around what's
> >being done and why.
>
> I don't think anyone objects to varying IV more, as recommended by NIST.

Great.

> AFAICS the issue at hand is exactly the opposite - maybe not varying it
> enough, in some cases. It might be enough for MVCC purposes yet it might
> result in fatal failure of the encryption scheme. That's my concern, at
> least, and I assume it's what Robert meant by "fuzzy thinking" too.

XTS does not require the IV to be unique for every invocation, and
indeed other systems like dm-crypt don't use a unique IV for XTS.  We
really can't divorce the encryption methodology from the parameters that
are being used.  CTR and GCM are the methods that require a unique IV
(or, at least, a very very very likely unique one if you can't actually
implement a proper counter) but that's *not* what we're talking about
here.  The methods being discussed are XTS and GCM-SIV, the former
explicitly doesn't require a unique IV and the latter is specifically
designed to reduce the impact of an IV being re-used.  Both are, as NIST
points out, better off with a varying IV, but having the IV be reused
from time to time in either XTS or GCM-SIV does not result in a fatal
failure of the encryption scheme.

> FWIW I think we seem to be mixing nonces, IVs and tweak values. Although
> various encryption schemes place different requirements on those anyway.

The differences between those are pretty subtle and it gets a bit
confusing when things like OpenSSL accept an IV but then use it as the
'tweak', such as with XTS.  In general though, most methods accept a
key, a nonce/IV/tweak, and then produce ciphertext and possbily a tag,
so I don't think we've been incorrect in usage but rather perhaps a bit
sloppy by not using the right term for the specific methodology.

> >* Andres Freund (andres@anarazel.de) wrote:
> >>On 2021-10-15 15:22:48 -0400, Stephen Frost wrote:
> >>>* Bruce Momjian (bruce@momjian.us) wrote:
> >>>>Finally, there is an interesting web page about when not to use XTS:
> >>>>
> >>>>    https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/
> >>>
> >>>This particular article always struck me as more of a reason for us, at
> >>>least, to use XTS than to not- in particular the very first comment it
> >>>makes, which seems to be pretty well supported, is: "XTS is the de-facto
> >>>standard disk encryption mode."
> >>
> >>I don't find that line of argument *that* convincing. The reason XTS is the
> >>de-facto standard is that for generic block layer encryption is that you can't
> >>add additional data for each block without very significant overhead
> >>(basically needing journaling to ensure that the data doesn't get out of
> >>sync). But we don't really face the same situation - we *can* add additional
> >>data.
> >
> >No, we can't always add additional data, and that's part of the
> >consideration for an XTS option- there are things we can do if we use
> >XTS that we can't with GCM or another solution.  Specifically, being
> >able to perform physical replication from an unencrypted cluster to an
> >encrypted one is a worthwhile use-case that we shouldn't be just tossing
> >out.
>
> Yeah, XTS seems like a reasonable first step, both because it doesn't
> require storing extra data and it's widespread use in FDE software (of
> course, there's a link between those). And I suspect replication between
> encrypted and unencrypted clusters is going to be a huge can of worms, even
> with XTS.

Glad you agree with XTS being a reasonable first step.  XTS will at
least make physical replication possible- other methods (such as below)
simply won't work.

> It's probably a silly / ugly idea, but can't we simply store a special "page
> format" flag in controldat - when set to 'true' during initdb, each page
> would have a bit of space (at the end) reserved for additional encryption
> data. Say, ~64B should be enough. On the encrypted cluster this would store
> the nonce/IV/... and on the unencrypted cluster it'd be simply unused. 64B
> seems like a negligible amount of data. And when set to 'false' the cluster
> would not allow encryption.

This is essentially what the patch that was posted does, but the problem
is that you can't do physical replication when 64B have been stolen off
of a page because the page in the unencrypted database might be entirely
full and not able to physically fit those extra 64B, and then what?

> >>With something like AES-GCM-SIV we can use the additional data to get IV reuse
> >>resistance *and* authentication. And while perhaps we are ok with the IV reuse
> >>guarantees XTS has, it seems pretty clear that we'll want want guaranteed
> >>authenticity at some point. And then we'll need extra data anyway.
> >
> >I agree that it'd be useful to have an authenticated encryption option.
> >Implementing XTS doesn't preclude us from adding that capability down
> >the road and it's simpler with fewer dependencies.  These all strike me
> >as good reasons to add XTS first.
>
> True. If XTS addresses the threat model we aimed to solve ...

It addresses a valid threat model that people are interested in PG
having a solution for, yes.

> >>Thus, to me, it doesn't seem worth going down the XTS route, just to
> >>temporarily save a bit of implementation effort. We'll have to endure that
> >>pain anyway.
> >
> >This isn't a valid argument as it isn't just about implementation but
> >about the capabilities we will have once it's done.
> >
> >* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> >>On 10/15/21 23:02, Robert Haas wrote:
> >>>On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
> >>>>That is, the default for LUKS is AES, XTS, with a simple IV.  That
> >>>>strikes me as a pretty ringing endorsement.
> >>>
> >>>Yes, that sounds promising. It might not hurt to check for other
> >>>precedents as well, but that seems like a pretty good one.
> >>
> >>TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE
> >>products at [1], and some of them use XTS, but I would take that with a
> >>grain of salt - some of the products are somewhat obscure, very old, or
> >>both.
> >>
> >>What is probably more interesting is that there's an IEEE standard [2]
> >>dealing with encrypted shared storage, and that uses XTS too. I'd bet
> >>there's a bunch of smart cryptographers involved.
> >
> >Thanks for finding those and linking to them, that's helpful.
> >
> >>>I'm not very convinced that using the LSN for any of this is a good
> >>>idea. Something that changes most of the time but not all the time
> >>>seems more like it could hurt by masking fuzzy thinking more than it
> >>>helps anything.
> >>
> >>I haven't been following the discussion about using LSN, but I agree that
> >>while using it seems convenient, the consequences of some changes not
> >>incrementing LSN seem potentially disastrous, depending on the encryption
> >>mode.
> >
> >Yes, this depends on the encryption mode, and is why we are specifically
> >talking about XTS here as it's an encryption mode that doesn't suffer
> >from this risk and therefore it's perfectly fine to use the LSN/FakeLSN
> >with XTS (and would also be alright for AES-GCM-SIV as it's specifically
> >designed to be resistant to IV reuse).
>
> I'm not quite sure about the "perfectly fine" bit, as it's making XTS
> vulnerable to traffic analysis attacks (comparing multiple copies of an
> encrypted block). It may be a reasonable trade-off, of course.

Considering the default usage in dmcrypt isn't even varying the IV as
much as what we're talking about here, I'd say that, yes, it's quite
reasonable for our use-case and allows us to vary the IV quite a bit
which reduces the attack surface in a meaningful way.  That dmcrypt has
this risk and it isn't considered enough of an issue for them to use
something other than plain64 as the default makes it certainly seem
reasonable to me.

> >* Bruce Momjian (bruce@momjian.us) wrote:
> >>On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote:
> >>>>That said, I don't think that's really a huge issue or something that's
> >>>>a show stopper or a reason to hold off on using XTS.  Note that what
> >>>>those bits actually *are* isn't leaked, just that they changed in some
> >>>>fashion inside of that 16-byte cipher text block.  That they're directly
> >>>>leaked with CTR is why there was concern raised about using that method,
> >>>>as discussed above and previously.
> >>>
> >>>Yeah. With CTR you pretty learn where the hint bits are exactly, while with
> >>>XTS the whole ciphertext changes.
> >>>
> >>>This also means CTR is much more malleable, i.e. you can tweak the
> >>>ciphertext bits to flip the plaintext, while with XTS that's not really
> >>>possible - it's pretty much guaranteed to break the block structure. Not
> >>>sure if that's an issue for our use case, but if it is then neither of the
> >>>two modes is a solution.
> >>
> >>Yes, this is a vary good point.  Let's look at the impact of _not_ using
> >>the LSN.  For CTR (already rejected) bit changes would be visible by
> >>comparing old/new page contents.  For CBC (also not under consideration)
> >>the first 16-byte block would show a change, and all later 16-byte
> >>blocks would show a change.  For CBC, you see the 16-byte blocks change,
> >>but you have no idea how many bits were changed, and in what locations
> >>in the 16-byte block (AES uses substitution and diffusion).  For XTS,
> >>because earlier blocks don't change the IV used by later blocks like
> >>CBC, you would be able to see each 16-byte block that changed in the 8k
> >>page.  Again, you would not know the number of bits changed or their
> >>locations.
> >>
> >>Do we think knowing which 16-byte blocks on an 8k page change would leak
> >>useful information?  If so, we should use the LSN and just accept that
> >>some cases might leak as described above.  If we don't care, then we can
> >>skip the use of the LSN and simplify the patch.
> >
> >While there may not be an active attack against PG that leverages such a
> >leak, I have a hard time seeing why we would intentionally design this
> >in when we have a great option that's directly available to us and
> >doesn't cause such a leak with nearly such regularity as not using the
> >LSN would, and also follows recommendations of using XTS from NIST.
> >Further, not using the LSN wouldn't really be an option if we did
> >eventually implement AES-GCM-SIV, so why not have the two cases be
> >consistent?
>
> I'm a bit confused, because the question was what happens if we encrypt the
> page twice with the same LSN or any tweak value in general. It certainly
> does not matter when it comes to malleability or replay attacks, because in
> that case the attacker is the one who modifies the block (and obviously
> won't change the LSN).

The question that I was responding to above was specifically about if
knowing which 16-byte blocks on an 8K page changes was an issue or not
and that's what I was addressing.  As for if we encrypt the page twice
with the same LSN/tweak/IV or not- that depends on the specific
encryption methodology being used as to how much of an issue that is, as
discussed above.

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 10/18/21 17:56, Stephen Frost wrote:
> >> ...
> >>I've argued for storing the nonce, but I don't quite see why would we need
> >>integrity guarantees?
> >>
> >>AFAICS the threat model the patch aims to address is an attacker who can
> >>observe the data (e.g. a low-privileged OS user), but can't modify the
> >>files. Which seems like a reasonable model for shared environments.
> >
> >There are multiple threat models which we should be considering and
> >that's why we may want to eventually add integrity.
>
> So what are these threat models? If we should be considering them it'd be
> nice to have a description, explaining what capabilities must the attacker
> have ...

The first and simplest to consider is the basic "data at rest" threat
model, where a hard drive is not properly wiped or a backup isn't
properly encrypted, or a laptop or other removable media is stolen, etc.
These are well addressed through XTS as in those cases what is needed is
confidentiality, not integrity, as the attacker isn't able to modify the
existing system in such cases.

Another threat model to consider is if the attacker has read-only access
to the data directory through, say, unix group read privileges or maybe
the ability to monitor the traffic on the SAN, or the ability to
read-only mount the LUN on to another system.  This might be obtained by
attacking a backup process where the system was configured to run
physical backups using an unprivileged OS user who only has group read
access to the cluster (and the necessary but non-superuser privleges in
the database system to start/stop the backup), or various potential
attacks at the storage layer.  This is similar to the "data at rest"
case above in that XTS works well to address this, but because the
attacker would have ongoing access (rather than just one-time, such as
in the first case), information such as which blocks are being changed
inside of a given 8k page might be able to be determined and that could
be useful information, though a point here: they would already be able
to see clearly which 8k pages are being changed and which aren't, and
there's not really any way for us to prevent that reasonably.  As such,
I'd argue that using XTS is reasonable and we can mitigate some of this
concern by using the LSN in the tweak instead of just the block number
as the 'plain64' option in dmcrypt does.  That doing so would mean that
we'd also be able to reasonably use the same IV for both XTS and
AES-GCM-SIV, should we choose to implement that at some point down the
road, is a nice perk to this approach.

> My (perhaps naive) understanding is that the authentication / integrity
> provides (partial) protection against attackers that may modify instance
> data - modify files, etc. But I'd guess an attacker with such capability can
> do various other (simpler) things to extract data. Say, modify the config to
> load an extension that dumps keys from memory, or whatever.

Another attack vector to consider is an attacker who is able to actively
modify files on the system in some way.  The specific files they are
able to modify matter in such a case.  There's no doubt that it's more
difficult to address such an attack vector and it's unlikely we'd be
able to provide a 100% solution in all cases, but that doesn't mean we
should throw out the idea entirely.

> So what's a plausible / practical threat model that would be mitigated by
> the authenticated encryption?

In a similar vein to above- consider a storage-level attacker who is
able to gain read/write access to a particular volume.  If that volume
is used only for a tablespace, a TDE implementation which uses
AES-GCM-SIV would go a long way towards protecting the system.  We may
even go so far as to encourage users to ensure that their 'main' PG data
directory, where configuration, transaction log, et al, are stored be
kept on a more secure (perhaps local) volume.  This isn't a 100%
solution, of course, due to the visibility map and the free space map
being stored with the tables in the tablespaces, but attacks on those
would be much more coarse and difficult to mount effectively.

We should also be thinking about ways to address those other subsystems
and improve the situation around them (and not just for encryption but
also for detection of corruption) but trying to get all of that done in
one patch or even one major release would make this a massive change
that would almost certainly be rejected as too destablizing.

> It'd be a bit silly to add complexity to allow AEAD, only to find out there
> are ways around it.

There are ways around it.  There likely always will be.  We need to be
clear about what it provides and what it doesn't.  We need to stop
telling ourselves that the only answer is a 100% solution and therefore
it's impossible to do.  Users who care about these capabilities will
understand that it's not 100% and they will still happily use it because
it's better than 0% which is where we are today and is why they are
going with other solutions.  Yes, if it's trivial to get around then
perhaps it's not much better than 0% and if that's the case then it
doesn't make sense to do it, but none of what has been discussed here
thus far has made me feel like either the XTS or the GCM-SIV approaches
would be trivial to to circumvent for the threat models they're intended
to address, though it certainly takes more care and more thought when
we're trying to address someone who has write access to part of the
system and that we need to be clear what is addressed and what isn't in
all of these cases.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

19 October 2021, 18:54:56

Greetings,

* Sasasu (i@sasa.su) wrote:
> On 2021/10/19 00:37, Robert Haas wrote:
> >I think what we ought to be doing at
> >this point is combining our efforts to try to get some things
> >committed which make future work in this area committed - like that
> >patch to preserve relfilenode and database OID, or maybe some patches
> >to drive all of our I/O through a smaller number of code paths instead
> >of having every different type of temporary file we write reinvent the
> >wheel.
>
> A unified block-based I/O API sounds great. Has anyone tried to do this
> before? It would be nice if the front-end tools could also use these API.

The TDE patch from Cybertec did go down this route, but the API ended up
being rather different which menat a lot of changes in other parts of
the system.  If we can get a block-based temporary file method that
maintains more-or-less the same API, that'd be great, but I'm not sure
that we can really do so and I am not entirely convinced that we should
make the TDE effort depend on an otherwise quite independent effort of
making all temp files usage be block based.

> As there are so many threat models, I propose to do the TDE feature by a set
> of hooks. those hooks are on the critical path of IO operation, add the
> ability to let extension replace the IO API. and also load extension when
> initdb, single-mode, and in front-end tools.
> This sounds Like using $LD_PRELOAD to replace pread(2) and pwrite(2), which
> widely used in folder based encryption. but the hook will pass more context
> (filenode, tableoid, blocksize, and many) to the under layer, this hook API
> will look like object_access_hook.
> then implement the simplest AES-XTS. and put it to contrib. provide a tool
> to deactivate AES-XTS to make PostgreSQL upgradeable.
>
> I think this is the most peaceful method. GCM people will not reject this
> just because XTS. and XTS people will satisfied(maybe?) with the complexity.
> for performance, just one more long-jump compare with current AES-XTS code.

I agree with Robert- using hooks for this really isn't realistic.  Where
would you store the tag for GCM without changes in core, for starters..?
Certainly wouldn't make sense to provide GCM only to throw the tag away.
Even between XTS and GCM, to say nothing of other possible methods,
there's going to be some serious differences that a single hook-based
API wouldn't be able to address.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

20 October 2021, 07:05:14

On 2021/10/20 02:54, Stephen Frost wrote:
 > I agree with Robert- using hooks for this really isn't realistic.

OK, I agree use hooks is too invasive. Just a whim, never mind.

But If PG has a clear block-based IO API, TDE is much easier to 
understand. security people may lack database knowledge but they can 
understand block IO.
This will allow more people to join PG community.

On 2021/10/20 02:54, Stephen Frost wrote:
 > Where would you store the tag for GCM without changes in core?

If can add 32bit reserve field (in CGM is 28bits) will be best.
data file size will increase 0.048% (if BLCKSZ = 8KiB), I think it is 
acceptable even for the user who does not use TDE. but need ondisk 
format change.
If without of modify anything in core and doing GCM, the under-layer can 
write out a key fork, fsync(2) key fork with the same strategy for main 
fork. this is crash-safe. The consistency is ensured by WAL. (means 
wal_log_hints need set to on)
Or the underlayer can re-struct the IO request. insert one IV block per 
2730(=BLKSZ/IV_SIZE) data blocks. this method like the _mdfd_getseg() in 
md.c which split file by 1GiB. No perception in the upper layers.
Both of them can use cache to reduce performance downgrade.

for WAL encryption, the CybertecDB implement is correct. we can not 
write any extra data without adding a reserved field in core. because 
can not guarantee consistency. If use GCM for WAL encryption must 
disable HMAC verification.

* only shows the possibility, not mean anyone should implement TDE in 
that way.
* blahblah

Attachment

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

20 October 2021, 12:24:08

Greetings,

* Sasasu (i@sasa.su) wrote:
> But If PG has a clear block-based IO API, TDE is much easier to understand.

PG does have a block-based IO API, it's just not exposed as hooks.  In
particular, take a look at md.c, though perhaps you'd be more interested
in the higher level bufmgr.c routines.  For the specific places where
encryption may hook in, looking at the DataChecksumsEnabled() call sites
may be informative (there aren't that many of them).

> security people may lack database knowledge but they can understand block
> IO.
> This will allow more people to join PG community.

We'd certainly welcome them.  I don't think we're going to try to
redesign our entire IO subsystem in the hopes that they'll show up
though.

> On 2021/10/20 02:54, Stephen Frost wrote:
> > Where would you store the tag for GCM without changes in core?
>
> If can add 32bit reserve field (in CGM is 28bits) will be best.

That's the idea that's been discussed, but the approach put forward is
to do it in a manner that allows the same binaries to work with a
TDE-enabled cluster and a non-TDE cluster which means two different
formats on disk.  This is still a pretty big deal and would require
logical replication or pg_dump/restore to go from unencrypted to
encrypted.

> data file size will increase 0.048% (if BLCKSZ = 8KiB), I think it is
> acceptable even for the user who does not use TDE. but need ondisk format
> change.

Breaking our ondisk format explicitly means that pg_upgrade won't work
any longer and folks won't be able to do in-place upgrades.  That's a
pretty huge deal and it's something we've not done in over a decade.
I doubt that's going to fly.

> If without of modify anything in core and doing GCM, the under-layer can
> write out a key fork, fsync(2) key fork with the same strategy for main
> fork. this is crash-safe. The consistency is ensured by WAL. (means
> wal_log_hints need set to on)
> Or the underlayer can re-struct the IO request. insert one IV block per
> 2730(=BLKSZ/IV_SIZE) data blocks. this method like the _mdfd_getseg() in
> md.c which split file by 1GiB. No perception in the upper layers.
> Both of them can use cache to reduce performance downgrade.

Yes, using another fork for this is something that's been considered but
it's not without its own drawbacks, in particular having to do another
write and later fsync when a page changes.

Further, none of this is necessary for XTS, but only for GCM.  This is
why it was put forward that GCM involves a lot more changes to the
system and means that we won't be able to do things like binary
replication to switch from an unencrypted to encrypted cluster.  Those
are good reasons to consider an XTS implementation first and then later,
perhaps, implement GCM.

> for WAL encryption, the CybertecDB implement is correct. we can not write
> any extra data without adding a reserved field in core. because can not
> guarantee consistency. If use GCM for WAL encryption must disable HMAC
> verification.

What's the point of using GCM if we aren't going to actually verify the
tag?  Also, the Cybertec patch didn't add an extra reserved field to the
page format, and it used CTR anyway..

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

21 October 2021, 04:18:59

On 2021/10/20 20:24, Stephen Frost wrote:
 > PG does have a block-based IO API, it's just not exposed as hooks.  In
 > particular, take a look at md.c, though perhaps you'd be more interested
 > in the higher level bufmgr.c routines.  For the specific places where
 > encryption may hook in, looking at the DataChecksumsEnabled() call sites
 > may be informative (there aren't that many of them).

md.c is great, easy to understand. but PG does not have a unified API. 
There has many unexpected pread()/pwrite() in many corners. md.c only 
for heap table, bufmgr.c only for a buffered heap table.

eg: XLogWrite() looks like a block API, but is a range write. equivalent 
to the append(2)
eg: ALTER DATABASE SET TABLESPACE , the movedb() call. use copy_file() 
on heap table. which is just pread() pwrite() with 8*BLCKSZ.
eg: all front-end tools use pread() to read heap table. in particular, 
pg_rewind write heap table by offset.
eg: in contrib, pg_standby use system("cp") to copy WAL.

On 2021/10/20 20:24, Stephen Frost wrote:
 > Breaking our ondisk format explicitly means that pg_upgrade won't work
 > any longer and folks won't be able to do in-place upgrades. That's a
 > pretty huge deal and it's something we've not done in over a decade.
 > I doubt that's going to fly.

I completely agree.

On 2021/10/20 20:24, Stephen Frost wrote:
 > Yes, using another fork for this is something that's been considered but
 > it's not without its own drawbacks, in particular having to do another
 > write and later fsync when a page changes.
 >
 > Further, none of this is necessary for XTS, but only for GCM. This is
 > why it was put forward that GCM involves a lot more changes to the
 > system and means that we won't be able to do things like binary
 > replication to switch from an unencrypted to encrypted cluster. Those
 > are good reasons to consider an XTS implementation first and then later,
 > perhaps, implement GCM.

same as Robert Haas. I wish PG can do some infrastructure first. add 
more abstract layers like md.c (maybe a block-based API with ondisk 
format version field). so people can dive in without understanding the 
things which isolated by the abstract layer.

On 2021/10/20 20:24, Stephen Frost wrote:
 > What's the point of using GCM if we aren't going to actually verify the
 > tag? Also, the Cybertec patch didn't add an extra reserved field to the
 > page format, and it used CTR anyway..

Oh, I am wrong, Cybertec patch can not use XTS, because WAL may not be 
aligned to 16bytes. for WAL need a stream cipher. The CTR implement is 
still correct.

CTR with hash(offset) as IV is basically equal to XTS. if use another 
AES key to encrypt the hash(offset), and block size is 16bytes it is XTS.
The point is that can not save random IV for WAL without adding a 
reserved field, no matter use GCM or CTR.

Because WAL only does append to the end, using CTR for WAL is safer than 
using XTS for heap table. If you want more security for WAL encryption, 
add HKDF[1].

[1]: https://en.wikipedia.org/wiki/HKDF

Attachment

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

21 October 2021, 17:28:12

Greetings,

* Sasasu (i@sasa.su) wrote:
> On 2021/10/20 20:24, Stephen Frost wrote:
> > PG does have a block-based IO API, it's just not exposed as hooks.  In
> > particular, take a look at md.c, though perhaps you'd be more interested
> > in the higher level bufmgr.c routines.  For the specific places where
> > encryption may hook in, looking at the DataChecksumsEnabled() call sites
> > may be informative (there aren't that many of them).
>
> md.c is great, easy to understand. but PG does not have a unified API. There
> has many unexpected pread()/pwrite() in many corners. md.c only for heap
> table, bufmgr.c only for a buffered heap table.

There's certainly other calls out there, yes.

> eg: XLogWrite() looks like a block API, but is a range write. equivalent to
> the append(2)

XLog is certainly another thing that has to be dealt with, of course,
but I don't see us trying to shoehorn that into using md.c somehow.

> eg: ALTER DATABASE SET TABLESPACE , the movedb() call. use copy_file() on
> heap table. which is just pread() pwrite() with 8*BLCKSZ.
> eg: all front-end tools use pread() to read heap table. in particular,
> pg_rewind write heap table by offset.
> eg: in contrib, pg_standby use system("cp") to copy WAL.

None of these are actually working with or changing the data though,
they're just copying it.  I don't think we'd actually want these to
decrypt and reencrypt the data.

> On 2021/10/20 20:24, Stephen Frost wrote:
> > Breaking our ondisk format explicitly means that pg_upgrade won't work
> > any longer and folks won't be able to do in-place upgrades. That's a
> > pretty huge deal and it's something we've not done in over a decade.
> > I doubt that's going to fly.
>
> I completely agree.

Great.

> On 2021/10/20 20:24, Stephen Frost wrote:
> > Yes, using another fork for this is something that's been considered but
> > it's not without its own drawbacks, in particular having to do another
> > write and later fsync when a page changes.
> >
> > Further, none of this is necessary for XTS, but only for GCM. This is
> > why it was put forward that GCM involves a lot more changes to the
> > system and means that we won't be able to do things like binary
> > replication to switch from an unencrypted to encrypted cluster. Those
> > are good reasons to consider an XTS implementation first and then later,
> > perhaps, implement GCM.
>
> same as Robert Haas. I wish PG can do some infrastructure first. add more
> abstract layers like md.c (maybe a block-based API with ondisk format
> version field). so people can dive in without understanding the things which
> isolated by the abstract layer.

I really don't think this is necessary.  Similar to PageSetChecksumCopy
and PageSetChecksumInplace, I'm sure we would have functions which are
called in the appropriate spots to do encryption (such as 'encrypt_page'
and 'encrypt_block' in the Cybertec patch) and folks could review those
in relative isolation to the rest.  Dealing with blocks in PG is already
pretty well handled, the infrastructure that needs to be added is around
handling temporary files and is being actively worked on ... if we could
move past this debate around if we should be adding support for XTS or
if only GCM-SIV would be accepted.

> On 2021/10/20 20:24, Stephen Frost wrote:
> > What's the point of using GCM if we aren't going to actually verify the
> > tag? Also, the Cybertec patch didn't add an extra reserved field to the
> > page format, and it used CTR anyway..
>
> Oh, I am wrong, Cybertec patch can not use XTS, because WAL may not be
> aligned to 16bytes. for WAL need a stream cipher. The CTR implement is still
> correct.

No, the CTR approach isn't great because, as has been discussed quite a
bit already, using the LSN as the IV means that different data might be
re-encrypted with the same LSN and that's not an acceptable thing to
have happen with CTR.

> CTR with hash(offset) as IV is basically equal to XTS. if use another AES
> key to encrypt the hash(offset), and block size is 16bytes it is XTS.

I don't understand why we're talking about CTR+other-stuff.  Maybe if
you use CTR and then do other things then it's equivilant in some
fashion to XTS ... but then it's not CTR anymore and we shouldn't be
calling it that.  Saying that we should do "CTR+other stuff" (which
happens to make it equivilant to XTS) instead of just saying we should
use "XTS" is very confusing and further means that we're starting down
the path of trying to come up with our own hack on existing encryption
schemes, and regardless of who shows up on this list to claim that doing
so makes sense and is "right", I'm going to be extremely skeptical.

> The point is that can not save random IV for WAL without adding a reserved
> field, no matter use GCM or CTR.

Yes, it's correct that we can't use a random IV for the WAL without
figuring out how to keep track of that random IV.  Thankfully, for WAL
(unlike heap and index blocks) we don't really have that issue- we
hopefully aren't going to write different WAL records at the same LSN
and so using the LSN there should be alright.  There's some odd edge
cases around this too (split-brain situations in particular where one of
the systems does crash recovery and doesn't actually promote and
therefore stays on the same timeline), but that kind of thing ends up
breaking other things too (WAL archiving, as an example) and isn't
really something we can support.  Further, I figure we'll tell people to
use different keys for different systems anyway to avoid this risk.

> Because WAL only does append to the end, using CTR for WAL is safer than
> using XTS for heap table. If you want more security for WAL encryption, add
> HKDF[1].

I don't follow how you're making these comparisons as to what is "safer"
for which when talking about two different encryption methodologies and
two very different systems (the heap vs. WAL) or if you're actually
suggesting something here.

We've discussed at length how using CTR for heap isn't a good idea even
if we're using the LSN for the IV, while if we use XTS then we don't
have the issues that CTR has with IV re-use and using the LSN (plus
block number and perhaps other things).  Nothing in what has been
discussed here has really changed anything there that I can see and so
it's unclear to me why we continue to go round and round with it.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

22 October 2021, 03:35:38

On 2021/10/22 01:28, Stephen Frost wrote:
> None of these are actually working with or changing the data though,
> they're just copying it.  I don't think we'd actually want these to
> decrypt and reencrypt the data.

OK, but why ALTER TABLE SET TABLESPACE use smgrread() and smgrextend() 
instead of copy_file().
TDE needs to modify these code paths, and make the patch bigger.

On 2021/10/22 01:28, Stephen Frost wrote:
 > No, the CTR approach isn't great because, as has been discussed quite a
 > bit already, using the LSN as the IV means that different data might be
 > re-encrypted with the same LSN and that's not an acceptable thing to
 > have happen with CTR.
On 2021/10/22 01:28, Stephen Frost wrote:
 > Thankfully, for WAL
 > (unlike heap and index blocks) we don't really have that issue- we
 > hopefully aren't going to write different WAL records at the same LSN
 > and so using the LSN there should be alright.
On 2021/10/22 01:28, Stephen Frost wrote:
 > We've discussed at length how using CTR for heap isn't a good idea even
 > if we're using the LSN for the IV, while if we use XTS then we don't
 > have the issues that CTR has with IV re-use and using the LSN (plus
 > block number and perhaps other things).  Nothing in what has been
 > discussed here has really changed anything there that I can see and so
 > it's unclear to me why we continue to go round and round with it.

I am not re-discuss using CTR for heap table. I mean use some CTR-like 
algorithm *only* for WAL encryption. My idea is exactly the same when 
you are typing "we hopefully aren't going to write different WAL records 
at the same LSN and so using the LSN there should be alright."

The point of disagreement between you and me is only on the block-based API.

On 2021/10/22 01:28, Stephen Frost wrote:
 > it's unclear to me why we continue to go round and round with it.

same to me. I am monitoring this thread about 9 months, watching yours 
discuss key management/CBC/CRT/GCM round and round.

Attachment

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

22 October 2021, 15:36:37

Greetings,

* Sasasu (i@sasa.su) wrote:
> On 2021/10/22 01:28, Stephen Frost wrote:
> >None of these are actually working with or changing the data though,
> >they're just copying it.  I don't think we'd actually want these to
> >decrypt and reencrypt the data.
>
> OK, but why ALTER TABLE SET TABLESPACE use smgrread() and smgrextend()
> instead of copy_file().
> TDE needs to modify these code paths, and make the patch bigger.

Tables and databases are handled differently, yes.

With ALTER TABLE SET TABLESPACE, we're allocating a new refilenode and
WAL'ing the table as FPIs.  What happens with databases is fundamentally
different- no one is allowed to be connected to the database being moved
and we write a single 'database changed tablespace' record in the WAL
for this case.  When it comes to TDE, this probably is actually helpful
as we're going to likely want the relfilenode to be included as part of
the IV.

> On 2021/10/22 01:28, Stephen Frost wrote:
> > No, the CTR approach isn't great because, as has been discussed quite a
> > bit already, using the LSN as the IV means that different data might be
> > re-encrypted with the same LSN and that's not an acceptable thing to
> > have happen with CTR.
> On 2021/10/22 01:28, Stephen Frost wrote:
> > Thankfully, for WAL
> > (unlike heap and index blocks) we don't really have that issue- we
> > hopefully aren't going to write different WAL records at the same LSN
> > and so using the LSN there should be alright.
> On 2021/10/22 01:28, Stephen Frost wrote:
> > We've discussed at length how using CTR for heap isn't a good idea even
> > if we're using the LSN for the IV, while if we use XTS then we don't
> > have the issues that CTR has with IV re-use and using the LSN (plus
> > block number and perhaps other things).  Nothing in what has been
> > discussed here has really changed anything there that I can see and so
> > it's unclear to me why we continue to go round and round with it.
>
> I am not re-discuss using CTR for heap table. I mean use some CTR-like
> algorithm *only* for WAL encryption. My idea is exactly the same when you
> are typing "we hopefully aren't going to write different WAL records at the
> same LSN and so using the LSN there should be alright."

I don't like the idea of "CTR-like".  What's wrong with using CTR for
WAL encryption?  Based on the available information, that seems like the
exact use-case for CTR.

> The point of disagreement between you and me is only on the block-based API.

I'm glad to hear that, at least.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

22 October 2021, 23:51:05

On Mon, Oct 18, 2021 at 11:56:03AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> > On 10/15/21 21:22, Stephen Frost wrote:
> > >Now, to address the concern around re-encrypting a block with the same
> > >key+IV but different data and leaking what parts of the page changed, I
> > >do think we should use the LSN and have it change regularly (including
> > >unlogged tables) but that's just because it's relatively easy for us to
> > >do and means an attacker wouldn't be able to tell what part of the page
> > >changed when the LSN was also changed.  That was also recommended by
> > >NIST and that's a pretty strong endorsement also.
> > 
> > Not sure - it seems a bit weird to force LSN change even in cases that don't
> > generate any WAL. I was not following the encryption thread and maybe it was
> > discussed/rejected there, but I've always imagined we'd have a global nonce
> > generator (similar to a sequence) and we'd store it at the end of each
> > block, or something like that.
> 
> The 'LSN' being referred to here isn't the regular LSN that is
> associated with the WAL but rather the separate FakeLSN counter which we
> already have.  I wasn't suggesting having the regular LSN change in
> cases that don't generate WAL.

Yes, my original patch created dummy WAL records for dummy LSNs but that
is no longer needed with XTS.

> > I'm not very convinced that using the LSN for any of this is a good
> > idea. Something that changes most of the time but not all the time
> > seems more like it could hurt by masking fuzzy thinking more than it
> > helps anything.
> 
> This argument doesn't come across as very strong at all to me,
> particularly when we have explicit recommendations from NIST that having
> the IV vary more is beneficial.  While this would be using the LSN, the
> fact that the LSN changes most of the time but not all of the time isn't
> new and is something we already have to deal with.  I'd think we'd
> address the concern about mis-thinking around how this works by
> providing a README and/or an appropriate set of comments around what's
> being done and why.

Agreed.  I think we would need to document when we reencrypt a page
with the same LSN, and of course write-based attacks.

> > Do we think knowing which 16-byte blocks on an 8k page change would leak
> > useful information?  If so, we should use the LSN and just accept that
> > some cases might leak as described above.  If we don't care, then we can
> > skip the use of the LSN and simplify the patch.
> 
> While there may not be an active attack against PG that leverages such a
> leak, I have a hard time seeing why we would intentionally design this
> in when we have a great option that's directly available to us and
> doesn't cause such a leak with nearly such regularity as not using the
> LSN would, and also follows recommendations of using XTS from NIST.

Agreed.

> > I consider this a checkbox feature and making it too complex will cause
> > it to be rightly rejected.
> 
> Presuming that 'checkbox feature' here means "we need it to please
> $someone but no one will ever use it" or something along those lines,
> this is very clearly not the case and therefore we shouldn't be
> describing it or treating it as such.  Even if the meaning here is
> "there's other ways people could get this capability" the reality is
> that those other methods are simply not always available and in those
> cases, people will choose to not use PostgreSQL.  Nearly every other
> database system which we might compare ourselves to has a solution in
> this area and people actively use those solutions in a lot of
> deployments.

I think people will use this feature, but I called it a 'checkbox
feature' because they usually are not looking for a complex or flexible
feature, but rather something that is simple to setup and effective.

> > And if PostgreSQL is using XTS, there is no different with dm-encrypt.
> > The user can use dm-encrypt directly.
> 
> dm-encrypt is not always an option and it doesn't actually address the
> threat-model that Tomas brought up here anyway, as it would be below the
> level that the low-privileged OS user would be looking at.  That's not
> the only threat model to consider, but it is one which could potentially
> be addressed by either XTS or AES-GCM-SIV.  There are threat models
> which dm-crypt would address, of course, such as data-at-rest (hard
> drive theft, improper disposal of storage media, backups which don't
> have their own encryption, etc), but, again, dm-crypt isn't always an
> option that is available and so I don't agree that we should throw this
> out just because dm-crypt exists and may be useable in some cases.

I actually think a Postgres integrity-check feature would need to create
an abstraction layer on top of all writes to PGDATA and tablespaces so
the filesystem would look unencrypted to Postgres.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

22 October 2021, 23:57:02

On Tue, Oct 19, 2021 at 02:44:26PM -0400, Stephen Frost wrote:
> There are ways around it.  There likely always will be.  We need to be
> clear about what it provides and what it doesn't.  We need to stop
> telling ourselves that the only answer is a 100% solution and therefore
> it's impossible to do.  Users who care about these capabilities will
> understand that it's not 100% and they will still happily use it because
> it's better than 0% which is where we are today and is why they are
> going with other solutions.  Yes, if it's trivial to get around then
> perhaps it's not much better than 0% and if that's the case then it
> doesn't make sense to do it, but none of what has been discussed here
> thus far has made me feel like either the XTS or the GCM-SIV approaches
> would be trivial to to circumvent for the threat models they're intended
> to address, though it certainly takes more care and more thought when
> we're trying to address someone who has write access to part of the
> system and that we need to be clear what is addressed and what isn't in
> all of these cases.

Stephen, your emails on this thread have been very helpful and on-topic.
I think the distinction above is that it is useful to fully protect
against some attack types, even if we don't protect against all attack
types.  For example, if we protect 100% against read attacks, it doesn't
mean that gets reduced to 50% because we don't protect against write
attacks --- we are still 100% read-protected and 0% write protected.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

23 October 2021, 00:04:50

On Mon, Oct 18, 2021 at 12:37:39PM -0400, Robert Haas wrote:
> > Thus, to me, it doesn't seem worth going down the XTS route, just to
> > temporarily save a bit of implementation effort. We'll have to endure that
> > pain anyway.
> 
> I agree up to a point, but I do also kind of feel like we should be
> leaving it up to whoever is working on an implementation to decide
> what they want to implement. I don't particularly like this discussion

Uh, our TODO has this list:

    Desirability -> Design -> Implement -> Test -> Review -> Commit

I think we have to agree on Desirability and Design before anyone starts
work since it is more likely a patch will be rejected without this.

> where it feels like people are trying to tell other people what they
> have to do because "the community has decided X." It's pretty clear
> that there are multiple opinions here, and I don't really see any of
> them to be without merit, nor do I see why Bruce or Stephen or you or
> anyone else should get to say "what the community has decided" in the
> absence of a clear consensus.

I don't see anyone saying we have agreed on anything, but I do hear
people say they are willing to work on some things, and not others.

> I do really like the idea of using AES-GCM-SIV not because I know
> anything about it, but because the integrity checking seems cool, and
> storing the nonce seems like it would improve security. However, based
> on what I know now, I would not vote to reject an XTS-based patch and,
> as Stephen and Bruce have said, maybe with the right design it permits
> upgrades from non-encrypted clusters to encrypted clusters. I'm
> actually kind of doubtful about that, because that seems to require
> some pretty specific and somewhat painful implementation decisions.
> For example, I think if your solution for rotating keys is to shut
> down the standby, re-encrypt it with a new key, start it up again, and
> fail over to it, then you probably ever can't do key rotation in any
> other way. The keys now have to be non-WAL-logged so that the standby
> can be different, which means you can't add a new key on the master
> and run around re-encrypting everything with it, WAL-logging those
> changes as you go. Now I realize that implementing that is really
> challenging anyway so maybe some people wouldn't like to go that way,
> but then maybe other people would. Another thing you probably can't do
> in this model is encrypt different parts of the database with
> different keys, because how would you keep track of that? Certainly
> not in the system catalogs, if none of that can show up in the WAL
> stream.

The design is to have a heap/index key and a WAL key.  You create a
binary replica that uses a different heap/index key but the same WAL
key, switch-over to it, and then change the WAL key.

`> But, you know, still: if somebody showed up with a fully-working XTS
> patch with everything in good working order, I don't see that we have
> enough evidence to reject it just because it's XTS. And I would hope
> that the people favoring XTS would not vote to reject a fully working
> GCM patch just because it's GCM. I think what we ought to be doing at

I don't think that would happen, but I do think patch size, code
maintenance, and upgradability would be reasonable considerations.

> this point is combining our efforts to try to get some things
> committed which make future work in this area committed - like that
> patch to preserve relfilenode and database OID, or maybe some patches
> to drive all of our I/O through a smaller number of code paths instead
> of having every different type of temporary file we write reinvent the
> wheel. These discussions about what encryption type we ought to use
> are useful for ruling out options that we know are bad, but beyond
> that I'm not sure they have much value. AES in any mode could seem
> like a much less safe choice by the time we get a committed feature
> here than it does today - even if somehow that were to happen for v15.
> I expect there are people out there trying to break it even as I write
> these words, and it seems likely that they will eventually succeed,
> but as to when, who can say?

Yes, we should start on things we know we need, but we will have to have
these discussions until we have desirability and design most people
agree on.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

23 October 2021, 00:06:46

On Fri, Oct 22, 2021 at 11:36:37AM -0400, Stephen Frost wrote:
> > I am not re-discuss using CTR for heap table. I mean use some CTR-like
> > algorithm *only* for WAL encryption. My idea is exactly the same when you
> > are typing "we hopefully aren't going to write different WAL records at the
> > same LSN and so using the LSN there should be alright."
> 
> I don't like the idea of "CTR-like".  What's wrong with using CTR for
> WAL encryption?  Based on the available information, that seems like the
> exact use-case for CTR.

Agreed.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

23 October 2021, 15:29:05

On Tue, Oct 19, 2021 at 02:44:26PM -0400, Stephen Frost wrote:
> Another threat model to consider is if the attacker has read-only access
> to the data directory through, say, unix group read privileges or maybe
> the ability to monitor the traffic on the SAN, or the ability to
> read-only mount the LUN on to another system.  This might be obtained by
> attacking a backup process where the system was configured to run
> physical backups using an unprivileged OS user who only has group read
> access to the cluster (and the necessary but non-superuser privleges in
> the database system to start/stop the backup), or various potential
> attacks at the storage layer.  This is similar to the "data at rest"
> case above in that XTS works well to address this, but because the
> attacker would have ongoing access (rather than just one-time, such as
> in the first case), information such as which blocks are being changed
> inside of a given 8k page might be able to be determined and that could
> be useful information, though a point here: they would already be able
> to see clearly which 8k pages are being changed and which aren't, and
> there's not really any way for us to prevent that reasonably.  As such,
> I'd argue that using XTS is reasonable and we can mitigate some of this
> concern by using the LSN in the tweak instead of just the block number
> as the 'plain64' option in dmcrypt does.  That doing so would mean that

That is an excellent point, and something we should mention in our
documentation --- the fact that a change of 8k granularity will be
visible, and in certain specified cases, 16-byte change granularity will
also be visible.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

23 October 2021, 15:49:44

On Mon, Oct 18, 2021 at 12:37:39PM -0400, Robert Haas wrote:
> I do really like the idea of using AES-GCM-SIV not because I know
> anything about it, but because the integrity checking seems cool, and
                                                        ----------
> storing the nonce seems like it would improve security. However, based

Frankly, I think we need to be cautious about doing anything related to
security for "cool" motivations.  (This might be how OpenSSL became such
a mess.)  For non-security features, you can often add a few lines of
code to enable some cool use-case.  For security features, you have to
block its targeted attack methods fully or it is useless.  (It doesn't
need to block all attack methods.)  To fully block attack methods,
security features must be thoroughly designed and all potential
interactions must be researched.

When adding non-security Postgres features, cool features can be more
easily implemented because they are built on the sold foundation of
Postgres.  For security features, you have to assume that attacks can
come from anywhere, so the foundation is unclear and caution is wise.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

23 October 2021, 16:03:36

On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote:
> * Sasasu (i@sasa.su) wrote:
> > A unified block-based I/O API sounds great. Has anyone tried to do this
> > before? It would be nice if the front-end tools could also use these API.
> 
> The TDE patch from Cybertec did go down this route, but the API ended up
> being rather different which menat a lot of changes in other parts of
> the system.  If we can get a block-based temporary file method that
> maintains more-or-less the same API, that'd be great, but I'm not sure
> that we can really do so and I am not entirely convinced that we should
> make the TDE effort depend on an otherwise quite independent effort of
> making all temp files usage be block based.

Uh, I thought people felt the Cybertec patch was too large and that a
unified API for temporary file I/O-encryption was a requirement.  Would
a CTR-steaming-encryption API for temporary tables be easier to
implement?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Yura Sokolov

Date:

25 October 2021, 06:59:17

В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет:
> Greetings,
> 
> I really don't think this is necessary.  Similar to PageSetChecksumCopy
> and PageSetChecksumInplace, I'm sure we would have functions which are
> called in the appropriate spots to do encryption (such as 'encrypt_page'
> and 'encrypt_block' in the Cybertec patch) and folks could review those
> in relative isolation to the rest.  Dealing with blocks in PG is already
> pretty well handled, the infrastructure that needs to be added is around
> handling temporary files and is being actively worked on ... if we could
> move past this debate around if we should be adding support for XTS or
> if only GCM-SIV would be accepted.
> 
> .....
> 
> No, the CTR approach isn't great because, as has been discussed quite a
> bit already, using the LSN as the IV means that different data might be
> re-encrypted with the same LSN and that's not an acceptable thing to
> have happen with CTR.
> 
> .....
> 
> We've discussed at length how using CTR for heap isn't a good idea even
> if we're using the LSN for the IV, while if we use XTS then we don't
> have the issues that CTR has with IV re-use and using the LSN (plus
> block number and perhaps other things).  Nothing in what has been
> discussed here has really changed anything there that I can see and so
> it's unclear to me why we continue to go round and round with it.
> 

Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2]
[3][4].

It is explicitely created to solve large block encryption issue - disk
encryption. It is used to encrypt 4kb at whole, but in fact has no
(practical) limit on block size: it is near-zero modified to encrypt 1kb
or 8kb or 32kb.

It has benefits of both XTS and GCM-SIV:
- like GCM-SIV every bit of cipher text depends on every bit of plain text
- therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse
  LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every
  block's bit will change.
- like XTS it doesn't need to change plain text format and doesn't need in
  additional Nonce/Auth Code.

Adiantum stands on "giant's shoulders": AES, Chacha and Poly1305.
It is included into Linux kernel since 5.0 .

Adiantum/HPolyC approach (hash+cipher+stream-cipher+hash) could be used
with other primitives as well. For example, Chacha12 could be replaces
with AES-GCM or AES-XTS with IV derived from hash+cipher.

[1] https://security.googleblog.com/2019/02/introducing-adiantum-encryption-for.html
[2] https://en.wikipedia.org/wiki/Adiantum_(cipher)
[3] https://tosc.iacr.org/index.php/ToSC/article/view/7360
[4] https://github.com/google/adiantum
[5]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=059c2a4d8e164dccc3078e49e7f286023b019a98

-------

regards
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

25 October 2021, 15:58:14

Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote:
> > * Sasasu (i@sasa.su) wrote:
> > > A unified block-based I/O API sounds great. Has anyone tried to do this
> > > before? It would be nice if the front-end tools could also use these API.
> >
> > The TDE patch from Cybertec did go down this route, but the API ended up
> > being rather different which menat a lot of changes in other parts of
> > the system.  If we can get a block-based temporary file method that
> > maintains more-or-less the same API, that'd be great, but I'm not sure
> > that we can really do so and I am not entirely convinced that we should
> > make the TDE effort depend on an otherwise quite independent effort of
> > making all temp files usage be block based.
>
> Uh, I thought people felt the Cybertec patch was too large and that a
> unified API for temporary file I/O-encryption was a requirement.  Would
> a CTR-steaming-encryption API for temporary tables be easier to
> implement?

Having a unified API for temporary file I/O (that could then be extended
to provide encryption) would definitely help with reducing the size of a
TDE patch.  The approach used in the Cybertec patch was to make
temporary file access block based, but the way that was implemented was
with an API different from pread/pwrite and that meant changing pretty
much all of the call sites for temporary file access, which naturally
resulted in changes in a lot of otherwise unrelated code.

There was an argument put forth that a block-based API for temporary
file access would generally be good as it would mean fewer syscalls.  If
we can get behind that and make it happen in (relatively) short order
then we'd certainly be better off when it comes to implementing TDE
which also deals with temporary files.  I'm a bit concerned about that
approach due to the changes needed but I'm also not against it.  I do
think that an API which was more-or-less the same as what's used today
would be a smaller change and therefore might be easier to get in and
that it'd also make a TDE patch smaller.  Perhaps both could be
accomplished (an API that's similar to today, but the actual file access
being block-based).

Either way, we should get that unification done in the core code first
as an independent effort.  That's what I hope the Cybertec folks have
had a chance to work on.

As for the specific encryption method to use, using CTR would be simpler
as it doesn't require access to be block-based, though we would need to
make sure to not re-use the IV across any of the temporary files being
created (potentially concurrently).  Probably not that hard to do but
just something to make sure we do.  Of course, if we arrange for
block-based access then we could use XTS or perhaps GCM/GCM-SIV if we
wanted to.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

25 October 2021, 16:12:27

Greetings,

* Yura Sokolov (y.sokolov@postgrespro.ru) wrote:
> В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет:
> > I really don't think this is necessary.  Similar to PageSetChecksumCopy
> > and PageSetChecksumInplace, I'm sure we would have functions which are
> > called in the appropriate spots to do encryption (such as 'encrypt_page'
> > and 'encrypt_block' in the Cybertec patch) and folks could review those
> > in relative isolation to the rest.  Dealing with blocks in PG is already
> > pretty well handled, the infrastructure that needs to be added is around
> > handling temporary files and is being actively worked on ... if we could
> > move past this debate around if we should be adding support for XTS or
> > if only GCM-SIV would be accepted.
> >
> > .....
> >
> > No, the CTR approach isn't great because, as has been discussed quite a
> > bit already, using the LSN as the IV means that different data might be
> > re-encrypted with the same LSN and that's not an acceptable thing to
> > have happen with CTR.
> >
> > .....
> >
> > We've discussed at length how using CTR for heap isn't a good idea even
> > if we're using the LSN for the IV, while if we use XTS then we don't
> > have the issues that CTR has with IV re-use and using the LSN (plus
> > block number and perhaps other things).  Nothing in what has been
> > discussed here has really changed anything there that I can see and so
> > it's unclear to me why we continue to go round and round with it.
> >
>
> Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2]
> [3][4].

That sounds like a great thing to think about adding ... after we get
something in that's based on XTS.

> It is explicitely created to solve large block encryption issue - disk
> encryption. It is used to encrypt 4kb at whole, but in fact has no
> (practical) limit on block size: it is near-zero modified to encrypt 1kb
> or 8kb or 32kb.
>
> It has benefits of both XTS and GCM-SIV:
> - like GCM-SIV every bit of cipher text depends on every bit of plain text
> - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse
>   LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every
>   block's bit will change.

The advantage of GCM-SIV is that it provides integrity as well as
confidentiality.

> - like XTS it doesn't need to change plain text format and doesn't need in
>   additional Nonce/Auth Code.

Sure, in which case it's something that could potentially be added later
as another option in the future.  I don't think we'll always have just
one encryption method and it's good to generally think about what it
might look like to have others but I don't think it makes sense to try
and get everything in all at once.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

25 October 2021, 20:06:50

On Mon, Oct 25, 2021 at 11:58:14AM -0400, Stephen Frost wrote:
> As for the specific encryption method to use, using CTR would be simpler
> as it doesn't require access to be block-based, though we would need to
> make sure to not re-use the IV across any of the temporary files being
> created (potentially concurrently).  Probably not that hard to do but
> just something to make sure we do.  Of course, if we arrange for
> block-based access then we could use XTS or perhaps GCM/GCM-SIV if we
> wanted to.

Agreed on all points.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Yura Sokolov

Date:

25 October 2021, 20:32:43

В Пн, 25/10/2021 в 12:12 -0400, Stephen Frost пишет:
> Greetings,
> 
> * Yura Sokolov (y.sokolov@postgrespro.ru) wrote:
> > В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет:
> > > I really don't think this is necessary.  Similar to PageSetChecksumCopy
> > > and PageSetChecksumInplace, I'm sure we would have functions which are
> > > called in the appropriate spots to do encryption (such as 'encrypt_page'
> > > and 'encrypt_block' in the Cybertec patch) and folks could review those
> > > in relative isolation to the rest.  Dealing with blocks in PG is already
> > > pretty well handled, the infrastructure that needs to be added is around
> > > handling temporary files and is being actively worked on ... if we could
> > > move past this debate around if we should be adding support for XTS or
> > > if only GCM-SIV would be accepted.
> > > 
> > > .....
> > > 
> > > No, the CTR approach isn't great because, as has been discussed quite a
> > > bit already, using the LSN as the IV means that different data might be
> > > re-encrypted with the same LSN and that's not an acceptable thing to
> > > have happen with CTR.
> > > 
> > > .....
> > > 
> > > We've discussed at length how using CTR for heap isn't a good idea even
> > > if we're using the LSN for the IV, while if we use XTS then we don't
> > > have the issues that CTR has with IV re-use and using the LSN (plus
> > > block number and perhaps other things).  Nothing in what has been
> > > discussed here has really changed anything there that I can see and so
> > > it's unclear to me why we continue to go round and round with it.
> > > 
> > 
> > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2]
> > [3][4].
> 
> That sounds like a great thing to think about adding ... after we get
> something in that's based on XTS.

Why? I see no points to do it after. Why not XTS after Adiantum?

Ok, I see one: XTS is standartized.

> > It is explicitely created to solve large block encryption issue - disk
> > encryption. It is used to encrypt 4kb at whole, but in fact has no
> > (practical) limit on block size: it is near-zero modified to encrypt 1kb
> > or 8kb or 32kb.
> > 
> > It has benefits of both XTS and GCM-SIV:
> > - like GCM-SIV every bit of cipher text depends on every bit of plain text
> > - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse
> >   LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every
> >   block's bit will change.
> 
> The advantage of GCM-SIV is that it provides integrity as well as
> confidentiality.

Integrity could be based on simple non-cryptographic checksum, and it could
be checked after decryption. It would be imposible to intentionally change
encrypted page in a way it will pass checksum after decription. 

Currently we have 16bit checksum, and it is very small. But having larger
checksum is orthogonal (ie doesn't bound) to having encryption.

In fact, Adiantum is easily made close to SIV construction:
- just leave last 8/16 bytes zero. If after decription they are zero,
  then integrity check passed.
That is because SIV and Adiantum are very similar in its structure:
- SIV:
-- hash
-- then stream cipher
- Adiantum:
-- hash (except last 16bytes)
-- then encrypt last 16bytes with hash,
-- then stream cipher
-- then hash.
If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last
16bytes with hash" become equivalent to just "hash", and Adiantum became
logical equivalent to SIV.

> > - like XTS it doesn't need to change plain text format and doesn't need in
> >   additional Nonce/Auth Code.
> 
> Sure, in which case it's something that could potentially be added later
> as another option in the future.  I don't think we'll always have just
> one encryption method and it's good to generally think about what it
> might look like to have others but I don't think it makes sense to try
> and get everything in all at once.

And among others Adiantum looks best: it is fast even without hardware
acceleration, it provides whole block encryption (ie every bit depends
on every bit) and it doesn't bound to plain-text format.

> Thanks,
> 
> Stephen

regards,

Yura

PS. Construction beside SIV and Adiantum could be used with XTS as well.
I.e. instead of AES-GCM-SIV it could be AES-XTS-SIV.
And same way AES-XTS could be used instead of Chacha12 in Adiantum.

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

26 October 2021, 03:08:38

On 2021/10/26 04:32, Yura Sokolov wrote:
> And among others Adiantum looks best: it is fast even without hardware
> acceleration,

No, AES is fast on modern high-end hardware.

on X86 AMD 3700X
type              1024 bytes  8192 bytes   16384 bytes
aes-128-ctr       8963982.50k 11124613.88k 11509149.42k
aes-128-gcm       3978860.44k 4669417.10k  4732070.64k
aes-128-xts       7776628.39k 9073664.63k  9264617.74k
chacha20-poly1305 2043729.73k 2131296.36k  2141002.10k

on ARM RK3399, A53 middle-end with AES-NI
type              1024 bytes   8192 bytes   16384 bytes
aes-128-ctr       1663857.66k  1860930.22k  1872991.57k
aes-128-xts       685086.38k   712906.07k   716073.64k
aes-128-gcm       985578.84k   1054818.30k  1056768.00k
chacha20-poly1305 309012.82k   318889.98k   319711.91k

I think the baseline is the speed when using read(2) syscall on 
/dev/zero (which is 3.6GiB/s, on ARM is 980MiB/s)
chacha is fast on the low-end arm, but I haven't seen any HTTPS sites 
using chacha, including Cloudflare and Google.

On 2021/10/26 04:32, Yura Sokolov wrote:
 >> That sounds like a great thing to think about adding ... after we get
 >> something in that's based on XTS.
 > Why? I see no points to do it after. Why not XTS after Adiantum?
 >
 > Ok, I see one: XTS is standartized.
:>
PostgreSQL even not discuss single-table key rotation or remote KMS.
I think it's too hard to use an encryption algorithm which openssl 
doesn't implement.

Attachment

Re: XTS cipher mode for cluster file encryption

From

Yura Sokolov

Date:

26 October 2021, 09:33:47

В Вт, 26/10/2021 в 11:08 +0800, Sasasu пишет:
> On 2021/10/26 04:32, Yura Sokolov wrote:
> > And among others Adiantum looks best: it is fast even without hardware
> > acceleration,
> 
> No, AES is fast on modern high-end hardware.
> 
> on X86 AMD 3700X
> type              1024 bytes  8192 bytes   16384 bytes
> aes-128-ctr       8963982.50k 11124613.88k 11509149.42k
> aes-128-gcm       3978860.44k 4669417.10k  4732070.64k
> aes-128-xts       7776628.39k 9073664.63k  9264617.74k
> chacha20-poly1305 2043729.73k 2131296.36k  2141002.10k
> 
> on ARM RK3399, A53 middle-end with AES-NI
> type              1024 bytes   8192 bytes   16384 bytes
> aes-128-ctr       1663857.66k  1860930.22k  1872991.57k
> aes-128-xts       685086.38k   712906.07k   716073.64k
> aes-128-gcm       985578.84k   1054818.30k  1056768.00k
> chacha20-poly1305 309012.82k   318889.98k   319711.91k
> 
> I think the baseline is the speed when using read(2) syscall on 
> /dev/zero (which is 3.6GiB/s, on ARM is 980MiB/s)
> chacha is fast on the low-end arm, but I haven't seen any HTTPS sites 
> using chacha, including Cloudflare and Google.

1. Chacha20-poly1305 includes authentication code (poly1305),
   aes-gcm also includes (GCM).
   But aes-128-(ctr,xts) doesn't.
   Therefore, Chacha should be compared with ctr,xts, not Chacha-Poly1305.
2. Chacha20 has security margin x2.8: only 7 rounds from 20 are broken.
   AES-128 has security margin x1.4: broken 7 rounds from 10.
   That is why Adiantum uses Chacha12: it is still "more secure" than AES-128.

Yes, AES with AES-NI is fastest. But not so much.

And, AES-CTR could be easily used instead of ChaCha12 in Adiantum.
Adiantum uses ChaCha12 as a stream cipher, and any other stream cipher will
be ok as well with minor modifications to algorithm. 

> 
> On 2021/10/26 04:32, Yura Sokolov wrote:
>  >> That sounds like a great thing to think about adding ... after we get
>  >> something in that's based on XTS.
>  > Why? I see no points to do it after. Why not XTS after Adiantum?
>  >
>  > Ok, I see one: XTS is standartized.
> :>
> PostgreSQL even not discuss single-table key rotation or remote KMS.
> I think it's too hard to use an encryption algorithm which openssl 
> doesn't implement.

That is argument. But, again, openssl could be used for primitives:
AES + AES-CTR + Poly/GCM. And Adiantum like construction could be
composed from them quite easily.

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

26 October 2021, 15:47:10

On 2021/10/26 17:33, Yura Sokolov wrote:
> Yes, AES with AES-NI is fastest. But not so much.
> 
> And, AES-CTR could be easily used instead of ChaCha12 in Adiantum.
> Adiantum uses ChaCha12 as a stream cipher, and any other stream cipher will
> be ok as well with minor modifications to algorithm.

not so much ~= half speed.

I prefer to use AES on all devices because AES is faster and more power 
efficient. using chacha only on low-end arm devices running complex 
program which most people do not have this device.

I reserve my opinion on this point, but I agree with you on the rest.
And I also agree and think it should add more algorithms. The current 
implementation does not have any reserved fields, which makes any 
upgrade like adding a new algorithm unfeasible.

On 2021/10/26 17:33, Yura Sokolov wrote:
 > That is argument. But, again, openssl could be used for primitives:
 > AES + AES-CTR + Poly/GCM. And Adiantum like construction could be
 > composed from them quite easily.

implement Adiantum is a small problem (which doesn't look good, lack 
security audits). the real problem is there are too many code path can 
trigger disk IO.
TDE need modify them. each code path has different behavior (align or 
unaligned, once or more than once). and front-end tools even not use VF 
layer, they use pread with offset. TDE need fix them all. at the same 
time, keep the patch small enough.

I still think there need a unified IO API, not only modify 
BufFileDumpBuffer() and BufFileLoadBuffer(). the front-end tools also 
use this API will be great. with that API, TDE can focus on block IO. 
then give out a small patch.

Other works can also benefit from this API, like PostgreSQL with AIO, 
PostgreSQL on S3 (BLKSZ=4M), PostgreSQL on PMEM(no FS) and many.

Attachment

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

26 October 2021, 19:43:30

Greetings,

* Yura Sokolov (y.sokolov@postgrespro.ru) wrote:
> В Пн, 25/10/2021 в 12:12 -0400, Stephen Frost пишет:
> > * Yura Sokolov (y.sokolov@postgrespro.ru) wrote:
> > > В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет:
> > > > I really don't think this is necessary.  Similar to PageSetChecksumCopy
> > > > and PageSetChecksumInplace, I'm sure we would have functions which are
> > > > called in the appropriate spots to do encryption (such as 'encrypt_page'
> > > > and 'encrypt_block' in the Cybertec patch) and folks could review those
> > > > in relative isolation to the rest.  Dealing with blocks in PG is already
> > > > pretty well handled, the infrastructure that needs to be added is around
> > > > handling temporary files and is being actively worked on ... if we could
> > > > move past this debate around if we should be adding support for XTS or
> > > > if only GCM-SIV would be accepted.
> > > >
> > > > .....
> > > >
> > > > No, the CTR approach isn't great because, as has been discussed quite a
> > > > bit already, using the LSN as the IV means that different data might be
> > > > re-encrypted with the same LSN and that's not an acceptable thing to
> > > > have happen with CTR.
> > > >
> > > > .....
> > > >
> > > > We've discussed at length how using CTR for heap isn't a good idea even
> > > > if we're using the LSN for the IV, while if we use XTS then we don't
> > > > have the issues that CTR has with IV re-use and using the LSN (plus
> > > > block number and perhaps other things).  Nothing in what has been
> > > > discussed here has really changed anything there that I can see and so
> > > > it's unclear to me why we continue to go round and round with it.
> > > >
> > >
> > > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2]
> > > [3][4].
> >
> > That sounds like a great thing to think about adding ... after we get
> > something in that's based on XTS.
>
> Why? I see no points to do it after. Why not XTS after Adiantum?
>
> Ok, I see one: XTS is standartized.

That's certainly one aspect of it.  It's also more easily available to
us, and frankly the people working on this and writing the patches have
a better understanding of XTS from having looked into it.  That it's
also more widely used is another reason.

> > > It is explicitely created to solve large block encryption issue - disk
> > > encryption. It is used to encrypt 4kb at whole, but in fact has no
> > > (practical) limit on block size: it is near-zero modified to encrypt 1kb
> > > or 8kb or 32kb.
> > >
> > > It has benefits of both XTS and GCM-SIV:
> > > - like GCM-SIV every bit of cipher text depends on every bit of plain text
> > > - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse
> > >   LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every
> > >   block's bit will change.
> >
> > The advantage of GCM-SIV is that it provides integrity as well as
> > confidentiality.
>
> Integrity could be based on simple non-cryptographic checksum, and it could
> be checked after decryption. It would be imposible to intentionally change
> encrypted page in a way it will pass checksum after decription.

No, it wouldn't be impossible when we're talking about non-cryptographic
checksums.  That is, in fact, why you'd call them that.  If it were
impossible (or at least utterly impractical) then you'd be able to claim
that it's cryptographic-level integrity validation.

> Currently we have 16bit checksum, and it is very small. But having larger
> checksum is orthogonal (ie doesn't bound) to having encryption.

Sure, but that would also require a page-format change.  We've pointed
out the downsides of that and what it would prevent in terms of
use-cases.  That's still something that might happen but it would be a
different effort from this.

> In fact, Adiantum is easily made close to SIV construction:
> - just leave last 8/16 bytes zero. If after decription they are zero,
>   then integrity check passed.
> That is because SIV and Adiantum are very similar in its structure:
> - SIV:
> -- hash
> -- then stream cipher
> - Adiantum:
> -- hash (except last 16bytes)
> -- then encrypt last 16bytes with hash,
> -- then stream cipher
> -- then hash.
> If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last
> 16bytes with hash" become equivalent to just "hash", and Adiantum became
> logical equivalent to SIV.

While I appreciate your interest in this, I don't think it makes sense
for us to try and implement something of our own- we're not
cryptographers.  Best is to look at published guideance and what other
projects have had success doing, and that's what this thread has been
about.

> > > - like XTS it doesn't need to change plain text format and doesn't need in
> > >   additional Nonce/Auth Code.
> >
> > Sure, in which case it's something that could potentially be added later
> > as another option in the future.  I don't think we'll always have just
> > one encryption method and it's good to generally think about what it
> > might look like to have others but I don't think it makes sense to try
> > and get everything in all at once.
>
> And among others Adiantum looks best: it is fast even without hardware
> acceleration, it provides whole block encryption (ie every bit depends
> on every bit) and it doesn't bound to plain-text format.

And it could still be added later as another option if folks really want
it to be.  I've outlined why it makes sense to go with XTS first but I
don't mean that to imply that we'll only ever have that.  Indeed, once
we've actually got something, adding other methods will almost certainly
be simpler.  Trying to do everything from the start will make this very
difficult to accomplish though.

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Tomas Vondra

Date:

26 October 2021, 21:11:39

On 10/26/21 21:43, Stephen Frost wrote:
> Greetings,
> 
> * Yura Sokolov (y.sokolov@postgrespro.ru) wrote:
>> ... >>
>> Integrity could be based on simple non-cryptographic checksum, and it could
>> be checked after decryption. It would be imposible to intentionally change
>> encrypted page in a way it will pass checksum after decription.
> 
> No, it wouldn't be impossible when we're talking about non-cryptographic
> checksums.  That is, in fact, why you'd call them that.  If it were
> impossible (or at least utterly impractical) then you'd be able to claim
> that it's cryptographic-level integrity validation.
>

Yeah, our checksums are probabilistic protection against rare and random 
bitflips cause by hardware, not against an attacker in the crypto sense.

To explain why it's not enough, consider our checksum is uint16, i.e. 
there are only 64k possible values. In other words, you can try flipping 
bits in the encrypted page, and after generating 64k you're guaranteed 
to have at least one collision. Yes, it's harder to get collision with 
the existing checksum, and compression methods that diffuse bits better 
makes it harder to get a valid page after decryption, but it's simply 
not the same thing as a crypto integrity.

Let's not try inventing something custom, there's been enough crypto 
failures due to smart custom stuff in the past already.

BTW I'm not sure what the existing patches do, but I wonder if we should 
calculate the checksum before or after encryption. I'd say it should be 
after encryption, because checksums were meant as a protection against 
issues at the storage level, so the checksum should be on what's written 
to storage, and it'd also allow offline verification of checksums etc. 
(Of course, that'd make the whole idea of relying on our checksums even 
more futile.)

Note: Maybe there are reasons why the checksum needs to be calculated 
before encryption, not sure.

>> Currently we have 16bit checksum, and it is very small. But having larger
>> checksum is orthogonal (ie doesn't bound) to having encryption.
> 
> Sure, but that would also require a page-format change.  We've pointed
> out the downsides of that and what it would prevent in terms of
> use-cases.  That's still something that might happen but it would be a
> different effort from this.
> 

... and if such page format ends up happening, it'd be fairly easy to 
just add some extra crypto data into the page header and not rely on the 
data checksums at all.

>> In fact, Adiantum is easily made close to SIV construction:
>> - just leave last 8/16 bytes zero. If after decription they are zero,
>>    then integrity check passed.
>> That is because SIV and Adiantum are very similar in its structure:
>> - SIV:
>> -- hash
>> -- then stream cipher
>> - Adiantum:
>> -- hash (except last 16bytes)
>> -- then encrypt last 16bytes with hash,
>> -- then stream cipher
>> -- then hash.
>> If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last
>> 16bytes with hash" become equivalent to just "hash", and Adiantum became
>> logical equivalent to SIV.
> 
> While I appreciate your interest in this, I don't think it makes sense
> for us to try and implement something of our own- we're not
> cryptographers.  Best is to look at published guideance and what other
> projects have had success doing, and that's what this thread has been
> about.
> 

Yeah, I personally don't see much difference between XTS and Adiantum.

There are a bunch of benefits, but the main reason why Google developed 
it seems to be performance on low-end ARM machines (i.e. phones). Which 
is nice, but it's probably not hugely important - very few people run Pg 
on such machines, especially in performance-sensitive context.

It's true Adiantum is probably more resilient to IV reuse etc. but it's 
not like XTS is suddenly obsolete, and it certainly doesn't solve the 
integrity issue etc.

>>>> - like XTS it doesn't need to change plain text format and doesn't need in
>>>>    additional Nonce/Auth Code.
>>>
>>> Sure, in which case it's something that could potentially be added later
>>> as another option in the future.  I don't think we'll always have just
>>> one encryption method and it's good to generally think about what it
>>> might look like to have others but I don't think it makes sense to try
>>> and get everything in all at once.
>>
>> And among others Adiantum looks best: it is fast even without hardware
>> acceleration, it provides whole block encryption (ie every bit depends
>> on every bit) and it doesn't bound to plain-text format.
> 
> And it could still be added later as another option if folks really want
> it to be.  I've outlined why it makes sense to go with XTS first but I
> don't mean that to imply that we'll only ever have that.  Indeed, once
> we've actually got something, adding other methods will almost certainly
> be simpler.  Trying to do everything from the start will make this very
> difficult to accomplish though.
> 

Yeah.

So maybe the best thing is simply to roll with both - design the whole 
feature in a way that allows selecting the encryption scheme, with two 
options. That's generally a good engineering practice, as it ensures 
things are not coupled too much. And it's not like the encryption 
methods are expected to be super difficult.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

26 October 2021, 21:39:30

On Tue, Oct 26, 2021 at 11:11:39PM +0200, Tomas Vondra wrote:
> BTW I'm not sure what the existing patches do, but I wonder if we should
> calculate the checksum before or after encryption. I'd say it should be
> after encryption, because checksums were meant as a protection against
> issues at the storage level, so the checksum should be on what's written to
> storage, and it'd also allow offline verification of checksums etc. (Of
> course, that'd make the whole idea of relying on our checksums even more
> futile.)
> 
> Note: Maybe there are reasons why the checksum needs to be calculated before
> encryption, not sure.

Yes, these are the tradeoffs --- allowing offline checksum checking
without requiring the key vs. giving _some_ integrity checking and
requiring the key.

> Yeah, I personally don't see much difference between XTS and Adiantum.
> 
> There are a bunch of benefits, but the main reason why Google developed it
> seems to be performance on low-end ARM machines (i.e. phones). Which is
> nice, but it's probably not hugely important - very few people run Pg on
> such machines, especially in performance-sensitive context.
> 
> It's true Adiantum is probably more resilient to IV reuse etc. but it's not
> like XTS is suddenly obsolete, and it certainly doesn't solve the integrity
> issue etc.
> 
> > > > > - like XTS it doesn't need to change plain text format and doesn't need in
> > > > >    additional Nonce/Auth Code.
> > > > 
> > > > Sure, in which case it's something that could potentially be added later
> > > > as another option in the future.  I don't think we'll always have just
> > > > one encryption method and it's good to generally think about what it
> > > > might look like to have others but I don't think it makes sense to try
> > > > and get everything in all at once.
> > > 
> > > And among others Adiantum looks best: it is fast even without hardware
> > > acceleration, it provides whole block encryption (ie every bit depends
> > > on every bit) and it doesn't bound to plain-text format.
> > 
> > And it could still be added later as another option if folks really want
> > it to be.  I've outlined why it makes sense to go with XTS first but I
> > don't mean that to imply that we'll only ever have that.  Indeed, once
> > we've actually got something, adding other methods will almost certainly
> > be simpler.  Trying to do everything from the start will make this very
> > difficult to accomplish though.
> > 
> 
...
> So maybe the best thing is simply to roll with both - design the whole
> feature in a way that allows selecting the encryption scheme, with two
> options. That's generally a good engineering practice, as it ensures things
> are not coupled too much. And it's not like the encryption methods are
> expected to be super difficult.

I am not in favor of adding additional options to this feature unless we
can explain why users should choose one over the other.  There is also
the problem of OpenSSL not supporting Adiantum.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

01 November 2021, 18:24:36

Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Oct 26, 2021 at 11:11:39PM +0200, Tomas Vondra wrote:
> > BTW I'm not sure what the existing patches do, but I wonder if we should
> > calculate the checksum before or after encryption. I'd say it should be
> > after encryption, because checksums were meant as a protection against
> > issues at the storage level, so the checksum should be on what's written to
> > storage, and it'd also allow offline verification of checksums etc. (Of
> > course, that'd make the whole idea of relying on our checksums even more
> > futile.)
> >
> > Note: Maybe there are reasons why the checksum needs to be calculated before
> > encryption, not sure.
>
> Yes, these are the tradeoffs --- allowing offline checksum checking
> without requiring the key vs. giving _some_ integrity checking and
> requiring the key.

I'm in favor of calculating the checksum before encrypting as that will
still catch the storage level bit-flips that it was implemented to
address in the first place and will also make it so that we're very
likely to realize we have an incorrect key before trying to do anything
with the page.  That it might also serve as a deterrent against
attackers trying to randomly flip bits in a page has perhaps some value
but without a cryptographic-level hash it isn't really enough to prevent
against an attacker who has write access to a page.

Any tools which include checking the checksum on pages already have to
deal with clusters where checksums aren't enabled anyway and I wouldn't
expect it to generally be an issue for tools which want to validate
checksums on an encrypted cluster to be able to have the appropriate
key(s) necessary for doing so and to be able to perform the decryption
to do the check.  We can certainly make pg_checksums do this and I don't
see it as an issue for pgbackrest, as two examples that I've
specifically thought about.

> > > > > > - like XTS it doesn't need to change plain text format and doesn't need in
> > > > > >    additional Nonce/Auth Code.
> > > > >
> > > > > Sure, in which case it's something that could potentially be added later
> > > > > as another option in the future.  I don't think we'll always have just
> > > > > one encryption method and it's good to generally think about what it
> > > > > might look like to have others but I don't think it makes sense to try
> > > > > and get everything in all at once.
> > > >
> > > > And among others Adiantum looks best: it is fast even without hardware
> > > > acceleration, it provides whole block encryption (ie every bit depends
> > > > on every bit) and it doesn't bound to plain-text format.
> > >
> > > And it could still be added later as another option if folks really want
> > > it to be.  I've outlined why it makes sense to go with XTS first but I
> > > don't mean that to imply that we'll only ever have that.  Indeed, once
> > > we've actually got something, adding other methods will almost certainly
> > > be simpler.  Trying to do everything from the start will make this very
> > > difficult to accomplish though.
> ...
> > So maybe the best thing is simply to roll with both - design the whole
> > feature in a way that allows selecting the encryption scheme, with two
> > options. That's generally a good engineering practice, as it ensures things
> > are not coupled too much. And it's not like the encryption methods are
> > expected to be super difficult.
>
> I am not in favor of adding additional options to this feature unless we
> can explain why users should choose one over the other.  There is also
> the problem of OpenSSL not supporting Adiantum.

I can understand the general idea that we should be sure to engineer
this in a way that multiple methods can be used, as surely one day folks
will say that AES128 isn't acceptable any more.  In terms of what we'll
do from the start, I would think providing the options of AES128 and
AES256 would be good to ensure that we have the bits covered to support
multiple methods and I don't think that would put us into a situation of
having to really explain which to use to users (we don't for pgcrypto
anyway, as an example).  I agree that we shouldn't be looking at adding
in a whole new crypto library for this though, that's a large and
independent effort (see the work on NSS happening nearby).

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

02 November 2021, 06:22:39

On 2021/11/2 02:24, Stephen Frost wrote:
> I can understand the general idea that we should be sure to engineer
> this in a way that multiple methods can be used, as surely one day folks
> will say that AES128 isn't acceptable any more.
Cheers!

Attachment

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

02 November 2021, 21:49:03

On Mon, Nov  1, 2021 at 02:24:36PM -0400, Stephen Frost wrote:
> I can understand the general idea that we should be sure to engineer
> this in a way that multiple methods can be used, as surely one day folks
> will say that AES128 isn't acceptable any more.  In terms of what we'll
> do from the start, I would think providing the options of AES128 and
> AES256 would be good to ensure that we have the bits covered to support
> multiple methods and I don't think that would put us into a situation of

My patch supports AES128, AES192, and AES256.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

03 November 2021, 18:45:22

Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Nov  1, 2021 at 02:24:36PM -0400, Stephen Frost wrote:
> > I can understand the general idea that we should be sure to engineer
> > this in a way that multiple methods can be used, as surely one day folks
> > will say that AES128 isn't acceptable any more.  In terms of what we'll
> > do from the start, I would think providing the options of AES128 and
> > AES256 would be good to ensure that we have the bits covered to support
> > multiple methods and I don't think that would put us into a situation of
>
> My patch supports AES128, AES192, and AES256.

Right, so we're already showing that it's flexible to allow for multiple
encryption methods.  If folks want more then it's on them to research
how they'd work exactly and explain why they'd be useful to add and how
users might make an informed choice (though, again, I don't think we
need to go *too* deep into that as we don't for, eg, pgcrypto, and I
don't believe we've ever heard people complain about that).

Thanks,

Stephen

Attachment

signature.asc

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

12 November 2021, 18:13:07

On Mon, Nov  1, 2021 at 02:24:36PM -0400, Stephen Frost wrote:
> I can understand the general idea that we should be sure to engineer
> this in a way that multiple methods can be used, as surely one day folks
> will say that AES128 isn't acceptable any more.  In terms of what we'll
> do from the start, I would think providing the options of AES128 and
> AES256 would be good to ensure that we have the bits covered to support
> multiple methods and I don't think that would put us into a situation of
> having to really explain which to use to users (we don't for pgcrypto
> anyway, as an example).  I agree that we shouldn't be looking at adding
> in a whole new crypto library for this though, that's a large and
> independent effort (see the work on NSS happening nearby).

Since it has been two weeks since the last activity on this thread, I
have updated the TDE wiki to reflect the conclusions and discussions:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Antonin Houska

Date:

29 November 2021, 07:37:31

Stephen Frost <sfrost@snowman.net> wrote:

> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote:
> > > * Sasasu (i@sasa.su) wrote:
> > > > A unified block-based I/O API sounds great. Has anyone tried to do this
> > > > before? It would be nice if the front-end tools could also use these API.
> > > 
> > > The TDE patch from Cybertec did go down this route, but the API ended up
> > > being rather different which menat a lot of changes in other parts of
> > > the system.  If we can get a block-based temporary file method that
> > > maintains more-or-less the same API, that'd be great, but I'm not sure
> > > that we can really do so and I am not entirely convinced that we should
> > > make the TDE effort depend on an otherwise quite independent effort of
> > > making all temp files usage be block based.
> > 
> > Uh, I thought people felt the Cybertec patch was too large and that a
> > unified API for temporary file I/O-encryption was a requirement.  Would
> > a CTR-steaming-encryption API for temporary tables be easier to
> > implement?
> 
> Having a unified API for temporary file I/O (that could then be extended
> to provide encryption) would definitely help with reducing the size of a
> TDE patch.  The approach used in the Cybertec patch was to make
> temporary file access block based, but the way that was implemented was
> with an API different from pread/pwrite and that meant changing pretty
> much all of the call sites for temporary file access, which naturally
> resulted in changes in a lot of otherwise unrelated code.

The changes to buffile.c are not trivial, but we haven't really changed the
API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead().

What our patch affects on the caller side is that BufFileOpenTransient(),
BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient()
replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and
read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes
become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1],
see the diffs attached.

(I expect that [2] will get committed someday so that the TDE feature won't
affect pgstat.c in the future at all.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


[1] https://github.com/cybertec-postgresql/postgres/tree/PG_14_TDE_1_1

[2] https://commitfest.postgresql.org/34/1708/

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d474ea1d0a..9cb0708e41 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -99,6 +99,7 @@
 #include "replication/slot.h"
 #include "replication/snapbuild.h"    /* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/fd.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
@@ -131,21 +132,13 @@ typedef struct ReorderBufferTupleCidEnt
     CommandId    combocid;        /* just for debugging */
 } ReorderBufferTupleCidEnt;
 
-/* Virtual file descriptor with file offset tracking */
-typedef struct TXNEntryFile
-{
-    File        vfd;            /* -1 when the file is closed */
-    off_t        curOffset;        /* offset for next write or read. Reset to 0
-                                 * when vfd is opened. */
-} TXNEntryFile;
-
 /* k-way in-order change iteration support structures */
 typedef struct ReorderBufferIterTXNEntry
 {
     XLogRecPtr    lsn;
     ReorderBufferChange *change;
     ReorderBufferTXN *txn;
-    TXNEntryFile file;
+    TransientBufFile *file;
     XLogSegNo    segno;
 } ReorderBufferIterTXNEntry;
 
@@ -243,14 +236,22 @@ static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMe
  * Disk serialization support functions
  * ---------------------------------------
  */
+static void ReorderBufferTweakBase(ReorderBufferTXN *txn,
+                                   char tweak_base[TWEAK_BASE_SIZE]);
 static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
-                                         int fd, ReorderBufferChange *change);
+                             TransientBufFile *file, ReorderBufferChange *change);
+static void ReorderBufferWriteData(TransientBufFile *file, void *ptr, size_t size,
+                       ReorderBufferTXN *txn);
 static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
-                                        TXNEntryFile *file, XLogSegNo *segno);
+                            TransientBufFile **file, XLogSegNo *segno);
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
-                                       char *change);
+                           TransientBufFile **file);
+static ReorderBufferTupleBuf *ReorderBufferRestoreTuple(ReorderBuffer *rb,
+                          TransientBufFile *file);
+static void ReorderBufferReadData(TransientBufFile *file, void *ptr, size_t size,
+                      bool *no_data_p);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
                                      bool txn_prepared);
@@ -342,8 +343,6 @@ ReorderBufferAllocate(void)
     buffer->by_txn_last_xid = InvalidTransactionId;
     buffer->by_txn_last_txn = NULL;
 
-    buffer->outbuf = NULL;
-    buffer->outbufsize = 0;
     buffer->size = 0;
 
     buffer->spillTxns = 0;
@@ -1241,7 +1240,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
     for (off = 0; off < state->nr_txns; off++)
     {
-        state->entries[off].file.vfd = -1;
+        state->entries[off].file = NULL;
         state->entries[off].segno = 0;
     }
 
@@ -1423,8 +1422,8 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
     for (off = 0; off < state->nr_txns; off++)
     {
-        if (state->entries[off].file.vfd != -1)
-            FileClose(state->entries[off].file.vfd);
+        if (state->entries[off].file)
+            BufFileCloseTransient(state->entries[off].file);
     }
 
     /* free memory we might have "leaked" in the last *Next call */
@@ -3332,21 +3331,39 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
  */
 
 /*
- * Ensure the IO buffer is >= sz.
+ * Initialize the common part of the encryption tweak.
  */
 static void
-ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
+ReorderBufferTweakBase(ReorderBufferTXN *txn,
+                       char tweak_base[TWEAK_BASE_SIZE])
 {
-    if (!rb->outbufsize)
-    {
-        rb->outbuf = MemoryContextAlloc(rb->context, sz);
-        rb->outbufsize = sz;
-    }
-    else if (rb->outbufsize < sz)
-    {
-        rb->outbuf = repalloc(rb->outbuf, sz);
-        rb->outbufsize = sz;
-    }
+    char    *c = tweak_base;
+    pid_t    pid = MyProcPid;
+    int    pid_bytes;
+
+/* Only this part of the PID fits into the tweak. */
+#define PID_BYTES_USABLE    3
+
+    StaticAssertStmt(1 + sizeof(TransactionId) + PID_BYTES_USABLE
+                     <= TWEAK_BASE_SIZE,
+                     "tweak components do not fit into TWEAK_BASE_SIZE");
+
+    memset(tweak_base, 0, TWEAK_BASE_SIZE);
+    *c = TRANS_BUF_FILE_REORDERBUFFER;
+    c++;
+    memcpy(c, &txn->xid, sizeof(TransactionId));
+    c += sizeof(TransactionId);
+
+    /*
+     * There's only room for PID_BYTES_USABLE bytes of the PID. Use the less
+     * significant part so that PID increment always causes tweak change.
+     */
+    pid_bytes = Min(sizeof(pid), PID_BYTES_USABLE);
+#ifdef WORDS_BIGENDIAN
+    memcpy(c, ((char *) &pid) + sizeof(pid) - pid_bytes, pid_bytes);
+#else
+    memcpy(c, &pid, pid_bytes);
+#endif
 }
 
 /*
@@ -3517,9 +3534,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
     dlist_iter    subtxn_i;
     dlist_mutable_iter change_i;
-    int            fd = -1;
+    TransientBufFile *file = NULL;
     XLogSegNo    curOpenSegNo = 0;
     Size        spilled = 0;
+
     Size        size = txn->size;
 
     elog(DEBUG2, "spill %u changes in XID %u to disk",
@@ -3545,13 +3563,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
          * store in segment in which it belongs by start lsn, don't split over
          * multiple segments tho
          */
-        if (fd == -1 ||
+        if (file == NULL ||
             !XLByteInSeg(change->lsn, curOpenSegNo, wal_segment_size))
         {
             char        path[MAXPGPATH];
+            char        tweak_base[TWEAK_BASE_SIZE];
 
-            if (fd != -1)
-                CloseTransientFile(fd);
+            if (file)
+                BufFileCloseTransient(file);
 
             XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);
 
@@ -3562,17 +3581,15 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
             ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
                                         curOpenSegNo);
 
+            if (data_encrypted)
+                ReorderBufferTweakBase(txn, tweak_base);
             /* open segment, create it if necessary */
-            fd = OpenTransientFile(path,
-                                   O_CREAT | O_WRONLY | O_APPEND | PG_BINARY);
-
-            if (fd < 0)
-                ereport(ERROR,
-                        (errcode_for_file_access(),
-                         errmsg("could not open file \"%s\": %m", path)));
+            file = BufFileOpenTransient(path,
+                                        O_CREAT | O_WRONLY | O_APPEND | PG_BINARY,
+                                        tweak_base);
         }
 
-        ReorderBufferSerializeChange(rb, txn, fd, change);
+        ReorderBufferSerializeChange(rb, txn, file, change);
         dlist_delete(&change->node);
         ReorderBufferReturnChange(rb, change, true);
 
@@ -3597,8 +3614,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
     txn->nentries_mem = 0;
     txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
-    if (fd != -1)
-        CloseTransientFile(fd);
+    if (file)
+        BufFileCloseTransient(file);
 }
 
 /*
@@ -3606,15 +3623,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
  */
 static void
 ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
-                             int fd, ReorderBufferChange *change)
+                             TransientBufFile *file, ReorderBufferChange *change)
 {
-    ReorderBufferDiskChange *ondisk;
+    ReorderBufferDiskChange hdr;
     Size        sz = sizeof(ReorderBufferDiskChange);
 
-    ReorderBufferSerializeReserve(rb, sz);
-
-    ondisk = (ReorderBufferDiskChange *) rb->outbuf;
-    memcpy(&ondisk->change, change, sizeof(ReorderBufferChange));
+    memcpy((char *) &hdr + offsetof(ReorderBufferDiskChange, change),
+           change, sizeof(ReorderBufferChange));
 
     switch (change->action)
     {
@@ -3624,7 +3639,6 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
         case REORDER_BUFFER_CHANGE_DELETE:
         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
             {
-                char       *data;
                 ReorderBufferTupleBuf *oldtup,
                            *newtup;
                 Size        oldlen = 0;
@@ -3647,84 +3661,71 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
                     sz += newlen;
                 }
 
-                /* make sure we have enough space */
-                ReorderBufferSerializeReserve(rb, sz);
-
-                data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
-                /* might have been reallocated above */
-                ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+                hdr.size = sz;
+                ReorderBufferWriteData(file, &hdr, sizeof(ReorderBufferDiskChange),
+                                       txn);
 
                 if (oldlen)
                 {
-                    memcpy(data, &oldtup->tuple, sizeof(HeapTupleData));
-                    data += sizeof(HeapTupleData);
-
-                    memcpy(data, oldtup->tuple.t_data, oldlen);
-                    data += oldlen;
+                    ReorderBufferWriteData(file, &oldtup->tuple,
+                                           sizeof(HeapTupleData), txn);
+                    ReorderBufferWriteData(file, oldtup->tuple.t_data, oldlen,
+                                           txn);
                 }
 
                 if (newlen)
                 {
-                    memcpy(data, &newtup->tuple, sizeof(HeapTupleData));
-                    data += sizeof(HeapTupleData);
-
-                    memcpy(data, newtup->tuple.t_data, newlen);
-                    data += newlen;
+                    ReorderBufferWriteData(file, &newtup->tuple,
+                                           sizeof(HeapTupleData), txn);
+                    ReorderBufferWriteData(file, newtup->tuple.t_data, newlen,
+                                           txn);
                 }
                 break;
             }
         case REORDER_BUFFER_CHANGE_MESSAGE:
             {
-                char       *data;
                 Size        prefix_size = strlen(change->data.msg.prefix) + 1;
 
                 sz += prefix_size + change->data.msg.message_size +
                     sizeof(Size) + sizeof(Size);
-                ReorderBufferSerializeReserve(rb, sz);
 
-                data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
-
-                /* might have been reallocated above */
-                ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+                hdr.size = sz;
+                ReorderBufferWriteData(file, &hdr,
+                                       sizeof(ReorderBufferDiskChange),
+                                       txn);
 
                 /* write the prefix including the size */
-                memcpy(data, &prefix_size, sizeof(Size));
-                data += sizeof(Size);
-                memcpy(data, change->data.msg.prefix,
-                       prefix_size);
-                data += prefix_size;
+                ReorderBufferWriteData(file, &prefix_size, sizeof(Size), txn);
+                ReorderBufferWriteData(file, change->data.msg.prefix,
+                                       prefix_size, txn);
 
                 /* write the message including the size */
-                memcpy(data, &change->data.msg.message_size, sizeof(Size));
-                data += sizeof(Size);
-                memcpy(data, change->data.msg.message,
-                       change->data.msg.message_size);
-                data += change->data.msg.message_size;
+                ReorderBufferWriteData(file, &change->data.msg.message_size,
+                                       sizeof(Size), txn);
+                ReorderBufferWriteData(file, change->data.msg.message,
+                                       change->data.msg.message_size, txn);
 
                 break;
             }
         case REORDER_BUFFER_CHANGE_INVALIDATION:
             {
-                char       *data;
                 Size        inval_size = sizeof(SharedInvalidationMessage) *
                 change->data.inval.ninvalidations;
 
                 sz += inval_size;
 
-                ReorderBufferSerializeReserve(rb, sz);
-                data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
-
-                /* might have been reallocated above */
-                ondisk = (ReorderBufferDiskChange *) rb->outbuf;
-                memcpy(data, change->data.inval.invalidations, inval_size);
-                data += inval_size;
-
+                hdr.size = sz;
+                ReorderBufferWriteData(file, &hdr,
+                                       sizeof(ReorderBufferDiskChange),
+                                       txn);
+                ReorderBufferWriteData(file, change->data.inval.invalidations,
+                                       inval_size,
+                                       txn);
                 break;
             }
         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
             {
                 Snapshot    snap;
-                char       *data;
 
                 snap = change->data.snapshot;
 
@@ -3732,78 +3733,50 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
                     sizeof(TransactionId) * snap->xcnt +
                     sizeof(TransactionId) * snap->subxcnt;
 
-                /* make sure we have enough space */
-                ReorderBufferSerializeReserve(rb, sz);
-                data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
-                /* might have been reallocated above */
-                ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+                hdr.size = sz;
+                ReorderBufferWriteData(file, &hdr,
+                                       sizeof(ReorderBufferDiskChange), txn);
 
-                memcpy(data, snap, sizeof(SnapshotData));
-                data += sizeof(SnapshotData);
+                ReorderBufferWriteData(file, snap, sizeof(SnapshotData), txn);
 
                 if (snap->xcnt)
-                {
-                    memcpy(data, snap->xip,
-                           sizeof(TransactionId) * snap->xcnt);
-                    data += sizeof(TransactionId) * snap->xcnt;
-                }
+                    ReorderBufferWriteData(file, snap->xip,
+                                           sizeof(TransactionId) * snap->xcnt,
+                                           txn);
 
                 if (snap->subxcnt)
-                {
-                    memcpy(data, snap->subxip,
-                           sizeof(TransactionId) * snap->subxcnt);
-                    data += sizeof(TransactionId) * snap->subxcnt;
-                }
+                    ReorderBufferWriteData(file, snap->subxip,
+                                           sizeof(TransactionId) * snap->subxcnt,
+                                           txn);
                 break;
             }
         case REORDER_BUFFER_CHANGE_TRUNCATE:
             {
                 Size        size;
-                char       *data;
 
                 /* account for the OIDs of truncated relations */
                 size = sizeof(Oid) * change->data.truncate.nrelids;
                 sz += size;
 
-                /* make sure we have enough space */
-                ReorderBufferSerializeReserve(rb, sz);
-
-                data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
-                /* might have been reallocated above */
-                ondisk = (ReorderBufferDiskChange *) rb->outbuf;
-
-                memcpy(data, change->data.truncate.relids, size);
-                data += size;
+                hdr.size = sz;
+                ReorderBufferWriteData(file, &hdr, sizeof(ReorderBufferDiskChange),
+                                       txn);
 
+                ReorderBufferWriteData(file, change->data.truncate.relids, size,
+                                       txn);
                 break;
             }
         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+            hdr.size = sz;
+            ReorderBufferWriteData(file, &hdr, sizeof(ReorderBufferDiskChange),
+                                   txn);
             /* ReorderBufferChange contains everything important */
             break;
     }
 
-    ondisk->size = sz;
-
-    errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
-    if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
-    {
-        int            save_errno = errno;
-
-        CloseTransientFile(fd);
-
-        /* if write didn't set errno, assume problem is no disk space */
-        errno = save_errno ? save_errno : ENOSPC;
-        ereport(ERROR,
-                (errcode_for_file_access(),
-                 errmsg("could not write to data file for XID %u: %m",
-                        txn->xid)));
-    }
-    pgstat_report_wait_end();
-
     /*
      * Keep the transaction's final_lsn up to date with each change we send to
      * disk, so that ReorderBufferRestoreCleanup works correctly.  (We used to
@@ -3814,8 +3787,21 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
      */
     if (txn->final_lsn < change->lsn)
         txn->final_lsn = change->lsn;
+}
 
-    Assert(ondisk->change.action == change->action);
+/*
+ * Wrapper for BufFileWriteTransient() that raises ERROR if the whole chunk
+ * was not written. XXX Should this be a macro?
+ */
+static void
+ReorderBufferWriteData(TransientBufFile *file, void *ptr, size_t size,
+                       ReorderBufferTXN *txn)
+{
+    if (BufFileWriteTransient(file, ptr, size) != size)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not write to data file for XID %u: %m",
+                        txn->xid)));
 }
 
 /* Returns true, if the output plugin supports streaming, false, otherwise. */
@@ -4058,12 +4044,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
  */
 static Size
 ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
-                            TXNEntryFile *file, XLogSegNo *segno)
+                            TransientBufFile **file, XLogSegNo *segno)
 {
     Size        restored = 0;
     XLogSegNo    last_segno;
     dlist_mutable_iter cleanup_iter;
-    File       *fd = &file->vfd;
 
     Assert(txn->first_lsn != InvalidXLogRecPtr);
     Assert(txn->final_lsn != InvalidXLogRecPtr);
@@ -4084,12 +4069,10 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
     while (restored < max_changes_in_memory && *segno <= last_segno)
     {
-        int            readBytes;
-        ReorderBufferDiskChange *ondisk;
-
-        if (*fd == -1)
+        if (*file == NULL)
         {
             char        path[MAXPGPATH];
+            char        tweak_base[TWEAK_BASE_SIZE];
 
             /* first time in */
             if (*segno == 0)
@@ -4104,86 +4087,27 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
             ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
                                         *segno);
 
-            *fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
-
-            /* No harm in resetting the offset even in case of failure */
-            file->curOffset = 0;
-
-            if (*fd < 0 && errno == ENOENT)
+            if (data_encrypted)
+                ReorderBufferTweakBase(txn, tweak_base);
+            *file = BufFileOpenTransient(path, O_RDONLY | PG_BINARY,
+                                         tweak_base);
+            if (*file == NULL)
             {
-                *fd = -1;
+                Assert(errno == ENOENT);
                 (*segno)++;
                 continue;
             }
-            else if (*fd < 0)
-                ereport(ERROR,
-                        (errcode_for_file_access(),
-                         errmsg("could not open file \"%s\": %m",
-                                path)));
         }
 
-        /*
-         * Read the statically sized part of a change which has information
-         * about the total size. If we couldn't read a record, we're at the
-         * end of this file.
-         */
-        ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange));
-        readBytes = FileRead(file->vfd, rb->outbuf,
-                             sizeof(ReorderBufferDiskChange),
-                             file->curOffset, WAIT_EVENT_REORDER_BUFFER_READ);
-
-        /* eof */
-        if (readBytes == 0)
+        ReorderBufferRestoreChange(rb, txn, file);
+        if (*file)
+            restored++;
+        else
         {
-            FileClose(*fd);
-            *fd = -1;
+            /* No data could be restored. */
             (*segno)++;
             continue;
         }
-        else if (readBytes < 0)
-            ereport(ERROR,
-                    (errcode_for_file_access(),
-                     errmsg("could not read from reorderbuffer spill file: %m")));
-        else if (readBytes != sizeof(ReorderBufferDiskChange))
-            ereport(ERROR,
-                    (errcode_for_file_access(),
-                     errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
-                            readBytes,
-                            (uint32) sizeof(ReorderBufferDiskChange))));
-
-        file->curOffset += readBytes;
-
-        ondisk = (ReorderBufferDiskChange *) rb->outbuf;
-
-        ReorderBufferSerializeReserve(rb,
-                                      sizeof(ReorderBufferDiskChange) + ondisk->size);
-        ondisk = (ReorderBufferDiskChange *) rb->outbuf;
-
-        readBytes = FileRead(file->vfd,
-                             rb->outbuf + sizeof(ReorderBufferDiskChange),
-                             ondisk->size - sizeof(ReorderBufferDiskChange),
-                             file->curOffset,
-                             WAIT_EVENT_REORDER_BUFFER_READ);
-
-        if (readBytes < 0)
-            ereport(ERROR,
-                    (errcode_for_file_access(),
-                     errmsg("could not read from reorderbuffer spill file: %m")));
-        else if (readBytes != ondisk->size - sizeof(ReorderBufferDiskChange))
-            ereport(ERROR,
-                    (errcode_for_file_access(),
-                     errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
-                            readBytes,
-                            (uint32) (ondisk->size - sizeof(ReorderBufferDiskChange)))));
-
-        file->curOffset += readBytes;
-
-        /*
-         * ok, read a full change from disk, now restore it into proper
-         * in-memory format
-         */
-        ReorderBufferRestoreChange(rb, txn, rb->outbuf);
-        restored++;
     }
 
     return restored;
@@ -4193,25 +4117,36 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * Convert change from its on-disk format to in-memory format and queue it onto
  * the TXN's ->changes list.
  *
- * Note: although "data" is declared char*, at entry it points to a
- * maxalign'd buffer, making it safe in most of this function to assume
- * that the pointed-to data is suitably aligned for direct access.
+ * If no data was found in the file, close it and set *file to NULL.
  */
 static void
 ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
-                           char *data)
+                           TransientBufFile **file)
 {
-    ReorderBufferDiskChange *ondisk;
+    ReorderBufferDiskChange ondisk;
+    bool        no_data;
     ReorderBufferChange *change;
 
-    ondisk = (ReorderBufferDiskChange *) data;
+    /*
+     * Read the statically sized part of a change which has information about
+     * the total size. If we couldn't read a record, we're at the end of this
+     * file.
+     */
+    ReorderBufferReadData(*file, &ondisk, sizeof(ReorderBufferDiskChange),
+                          &no_data);
+
+    /* eof */
+    if (no_data)
+    {
+        BufFileCloseTransient(*file);
+        *file = NULL;
+        return;
+    }
 
     change = ReorderBufferGetChange(rb);
 
     /* copy static part */
-    memcpy(change, &ondisk->change, sizeof(ReorderBufferChange));
-
-    data += sizeof(ReorderBufferDiskChange);
+    memcpy(change, &ondisk.change, sizeof(ReorderBufferChange));
 
     /* restore individual stuff */
     switch (change->action)
@@ -4222,50 +4157,10 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
         case REORDER_BUFFER_CHANGE_DELETE:
         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
             if (change->data.tp.oldtuple)
-            {
-                uint32        tuplelen = ((HeapTuple) data)->t_len;
-
-                change->data.tp.oldtuple =
-                    ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
-
-                /* restore ->tuple */
-                memcpy(&change->data.tp.oldtuple->tuple, data,
-                       sizeof(HeapTupleData));
-                data += sizeof(HeapTupleData);
-
-                /* reset t_data pointer into the new tuplebuf */
-                change->data.tp.oldtuple->tuple.t_data =
-                    ReorderBufferTupleBufData(change->data.tp.oldtuple);
-
-                /* restore tuple data itself */
-                memcpy(change->data.tp.oldtuple->tuple.t_data, data, tuplelen);
-                data += tuplelen;
-            }
+                change->data.tp.oldtuple = ReorderBufferRestoreTuple(rb, *file);
 
             if (change->data.tp.newtuple)
-            {
-                /* here, data might not be suitably aligned! */
-                uint32        tuplelen;
-
-                memcpy(&tuplelen, data + offsetof(HeapTupleData, t_len),
-                       sizeof(uint32));
-
-                change->data.tp.newtuple =
-                    ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
-
-                /* restore ->tuple */
-                memcpy(&change->data.tp.newtuple->tuple, data,
-                       sizeof(HeapTupleData));
-                data += sizeof(HeapTupleData);
-
-                /* reset t_data pointer into the new tuplebuf */
-                change->data.tp.newtuple->tuple.t_data =
-                    ReorderBufferTupleBufData(change->data.tp.newtuple);
-
-                /* restore tuple data itself */
-                memcpy(change->data.tp.newtuple->tuple.t_data, data, tuplelen);
-                data += tuplelen;
-            }
+                change->data.tp.newtuple = ReorderBufferRestoreTuple(rb, *file);
 
             break;
         case REORDER_BUFFER_CHANGE_MESSAGE:
@@ -4273,22 +4168,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
                 Size        prefix_size;
 
                 /* read prefix */
-                memcpy(&prefix_size, data, sizeof(Size));
-                data += sizeof(Size);
+                ReorderBufferReadData(*file, &prefix_size, sizeof(Size), NULL);
                 change->data.msg.prefix = MemoryContextAlloc(rb->context,
                                                              prefix_size);
-                memcpy(change->data.msg.prefix, data, prefix_size);
+                ReorderBufferReadData(*file, change->data.msg.prefix,
+                                      prefix_size, NULL);
                 Assert(change->data.msg.prefix[prefix_size - 1] == '\0');
-                data += prefix_size;
 
                 /* read the message */
-                memcpy(&change->data.msg.message_size, data, sizeof(Size));
-                data += sizeof(Size);
+                ReorderBufferReadData(*file, &change->data.msg.message_size,
+                                      sizeof(Size), NULL);
                 change->data.msg.message = MemoryContextAlloc(rb->context,
                                                               change->data.msg.message_size);
-                memcpy(change->data.msg.message, data,
-                       change->data.msg.message_size);
-                data += change->data.msg.message_size;
+                ReorderBufferReadData(*file, change->data.msg.message,
+                                      change->data.msg.message_size, NULL);
 
                 break;
             }
@@ -4301,29 +4194,32 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
                     MemoryContextAlloc(rb->context, inval_size);
 
                 /* read the message */
-                memcpy(change->data.inval.invalidations, data, inval_size);
+                ReorderBufferReadData(*file, change->data.inval.invalidations,
+                                      inval_size, NULL);
 
                 break;
             }
         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
             {
-                Snapshot    oldsnap;
+                SnapshotData oldsnap;
                 Snapshot    newsnap;
                 Size        size;
 
-                oldsnap = (Snapshot) data;
+                ReorderBufferReadData(*file, &oldsnap, sizeof(SnapshotData), NULL);
 
                 size = sizeof(SnapshotData) +
-                    sizeof(TransactionId) * oldsnap->xcnt +
-                    sizeof(TransactionId) * (oldsnap->subxcnt + 0);
+                    sizeof(TransactionId) * oldsnap.xcnt +
+                    sizeof(TransactionId) * (oldsnap.subxcnt + 0);
 
                 change->data.snapshot = MemoryContextAllocZero(rb->context, size);
 
                 newsnap = change->data.snapshot;
 
-                memcpy(newsnap, data, size);
+                memcpy(newsnap, &oldsnap, sizeof(SnapshotData));
                 newsnap->xip = (TransactionId *)
                     (((char *) newsnap) + sizeof(SnapshotData));
+                ReorderBufferReadData(*file, newsnap->xip,
+                                      size - sizeof(SnapshotData), NULL);
                 newsnap->subxip = newsnap->xip + newsnap->xcnt;
                 newsnap->copied = true;
                 break;
@@ -4335,7 +4231,9 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
                 relids = ReorderBufferGetRelids(rb,
                                                 change->data.truncate.nrelids);
-                memcpy(relids, data, change->data.truncate.nrelids * sizeof(Oid));
+                ReorderBufferReadData(*file, relids,
+                                      change->data.truncate.nrelids * sizeof(Oid),
+                                      NULL);
                 change->data.truncate.relids = relids;
 
                 break;
@@ -4362,6 +4260,77 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
                                     ReorderBufferChangeSize(change));
 }
 
+/*
+ * Convert heap tuple from its on-disk format to in-memory format.
+ */
+static ReorderBufferTupleBuf *
+ReorderBufferRestoreTuple(ReorderBuffer *rb, TransientBufFile *file)
+{
+    HeapTupleData tupdata;
+    uint32        tuplelen;
+    ReorderBufferTupleBuf *result;
+
+    ReorderBufferReadData(file, &tupdata, sizeof(HeapTupleData), NULL);
+    tuplelen = tupdata.t_len;
+
+    result = ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
+
+    /* restore ->tuple */
+    memcpy(&result->tuple, &tupdata, sizeof(HeapTupleData));
+
+    /* reset t_data pointer into the new tuplebuf */
+    result->tuple.t_data = ReorderBufferTupleBufData(result);
+
+    /* restore tuple data itself */
+    ReorderBufferReadData(file, result->tuple.t_data, tuplelen, NULL);
+
+    return result;
+}
+
+/*
+ * Wrapper for BufFileReadTransient() that raises ERROR if the expected amount
+ * of bytes was not read.
+ *
+ * If valid pointer is passed for no_data_p, set *no_data_p to indicate
+ * whether zero bytes was read. If NULL is passed, do not tolerate missing
+ * data.
+ */
+static void
+ReorderBufferReadData(TransientBufFile *file, void *ptr, size_t size,
+                      bool *no_data_p)
+{
+    int            readBytes;
+
+    /*
+     * Caller should not request zero bytes. This assumption simplifies
+     * setting of *no_data_p below.
+     */
+    Assert(size > 0);
+
+    if ((readBytes = BufFileReadTransient(file, ptr, size)) != size)
+    {
+        if (no_data_p)
+            *no_data_p = readBytes == 0;
+
+        /*
+         * It is o.k. to receive exactly zero bytes if caller passed valid
+         * no_data_p.
+         */
+        if (no_data_p && *no_data_p)
+            return;
+
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
+                        readBytes, (uint32) size)));
+    }
+    else if (no_data_p)
+    {
+        /* Given that size is non-zero, readBytes must be non-zero too. */
+        *no_data_p = false;
+    }
+}
+
 /*
  * Remove all on-disk stored for the passed in transaction.
  */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a2f75b23b8..af5b329f8a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -11,6 +11,7 @@
  *            - Add a pgstat config column to pg_database, so this
  *              entire thing can be enabled/disabled on a per db basis.
  *
+ *    Portions Copyright (c) 2019-2021, CYBERTEC PostgreSQL International GmbH
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
  *    src/backend/postmaster/pgstat.c
@@ -54,6 +55,7 @@
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "storage/backendid.h"
+#include "storage/buffile.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -308,10 +310,16 @@ NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_no
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
                                                  Oid tableoid, bool create);
+static void pgstat_tweak_base(bool permanent, bool global, Oid database,
+                              char tweak_base[TWEAK_BASE_SIZE]);
 static void pgstat_write_statsfiles(bool permanent, bool allDbs);
 static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
+static void pgstat_write_bytes(TransientBufFile *file, void *ptr, size_t size,
+                               bool *failed);
 static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
 static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
+static void pgstat_read_bytes(TransientBufFile *file, void *ptr, size_t size,
+                              bool *failed);
 static void backend_read_statsfile(void);
 
 static bool pgstat_write_statsfile_needed(void);
@@ -3202,6 +3210,9 @@ PgstatCollectorMain(int argc, char *argv[])
     MyBackendType = B_STATS_COLLECTOR;
     init_ps_display(NULL);
 
+    /* BufFileOpenTransient() and friends do use VFD. */
+    InitFileAccess();
+
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
@@ -3595,6 +3606,26 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
+/*
+ * Initialize the common part of the encryption tweak.
+ */
+static void
+pgstat_tweak_base(bool permanent, bool global, Oid database,
+                  char tweak_base[TWEAK_BASE_SIZE])
+{
+    char    *c = tweak_base;
+
+    StaticAssertStmt(3 + sizeof(Oid) <= TWEAK_BASE_SIZE,
+                     "tweak components do not fit into TWEAK_BASE_SIZE");
+    memset(tweak_base, 0, TWEAK_BASE_SIZE);
+    *c = TRANS_BUF_FILE_PGSTATS;
+    c++;
+    *c = permanent;
+    c++;
+    *c = global;
+    c++;
+    memcpy(c, &database, sizeof(Oid));
+}
 
 /* ----------
  * pgstat_write_statsfiles() -
@@ -3615,18 +3646,24 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 {
     HASH_SEQ_STATUS hstat;
     PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
+    TransientBufFile       *fpout;
+    File    vfd;
     int32        format_id;
     const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
     const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
+    bool    failed = false;
+    char tweak_base[TWEAK_BASE_SIZE];
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
      * Open the statistics temp file to write out the current values.
      */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (data_encrypted)
+        pgstat_tweak_base(permanent, true, InvalidOid, tweak_base);
+    fpout = BufFileOpenTransient(tmpfile,
+                                 O_CREAT | O_WRONLY | O_APPEND | PG_BINARY,
+                                 tweak_base);
     if (fpout == NULL)
     {
         ereport(LOG,
@@ -3645,32 +3682,27 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
      * Write the file header --- currently just a format ID.
      */
     format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+    pgstat_write_bytes(fpout, &format_id, sizeof(format_id), &failed);
 
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+    pgstat_write_bytes(fpout, &globalStats, sizeof(globalStats), &failed);
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+    pgstat_write_bytes(fpout, &archiverStats, sizeof(archiverStats), &failed);
 
     /*
      * Write WAL stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+    pgstat_write_bytes(fpout, &walStats, sizeof(walStats), &failed);
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+    pgstat_write_bytes(fpout, slruStats, sizeof(slruStats), &failed);
 
     /*
      * Walk through the database table.
@@ -3694,9 +3726,11 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
          * Write out the DB entry. We don't write the tables or functions
          * pointers, since they're of no use to any other process.
          */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        pgstat_write_bytes(fpout, "D", 1, &failed);
+        pgstat_write_bytes(fpout,
+                           dbentry,
+                           offsetof(PgStat_StatDBEntry, tables),
+                           &failed);
     }
 
     /*
@@ -3709,29 +3743,36 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         hash_seq_init(&hstat, replSlotStatHash);
         while ((slotent = (PgStat_StatReplSlotEntry *) hash_seq_search(&hstat)) != NULL)
         {
-            fputc('R', fpout);
-            rc = fwrite(slotent, sizeof(PgStat_StatReplSlotEntry), 1, fpout);
-            (void) rc;            /* we'll check for error with ferror */
+            pgstat_write_bytes(fpout, "R", 1, &failed);
+            pgstat_write_bytes(fpout, slotent, sizeof(PgStat_StatReplSlotEntry),
+                               &failed);
         }
     }
 
     /*
      * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
+     * pgstat.stat with it.
      */
-    fputc('E', fpout);
+    pgstat_write_bytes(fpout, "E", 1, &failed);
 
-    if (ferror(fpout))
+    if (failed)
     {
         ereport(LOG,
                 (errcode_for_file_access(),
                  errmsg("could not write temporary statistics file \"%s\": %m",
                         tmpfile)));
-        FreeFile(fpout);
+        BufFileCloseTransient(fpout);
         unlink(tmpfile);
+        return;
     }
-    else if (FreeFile(fpout) < 0)
+
+    /*
+     * XXX This might PANIC, see FileClose(). Don't we need special behaviour
+     * for statistics?
+     */
+    vfd = BufFileTransientGetVfd(fpout);
+    BufFileCloseTransient(fpout);
+    if (!FileIsClosed(vfd))
     {
         ereport(LOG,
                 (errcode_for_file_access(),
@@ -3796,12 +3837,14 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     HASH_SEQ_STATUS fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
+    TransientBufFile       *fpout;
+    File    vfd;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
-    int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    bool    failed = false;
+    char tweak_base[TWEAK_BASE_SIZE];
 
     get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
     get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
@@ -3811,7 +3854,11 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Open the statistics temp file to write out the current values.
      */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (data_encrypted)
+        pgstat_tweak_base(permanent, false, dbid, tweak_base);
+    fpout = BufFileOpenTransient(tmpfile,
+                                 O_CREAT | O_WRONLY | O_APPEND | PG_BINARY,
+                                 tweak_base);
     if (fpout == NULL)
     {
         ereport(LOG,
@@ -3825,8 +3872,7 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
      * Write the file header --- currently just a format ID.
      */
     format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+    pgstat_write_bytes(fpout, &format_id, sizeof(format_id), &failed);
 
     /*
      * Walk through the database's access stats per table.
@@ -3834,9 +3880,14 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     hash_seq_init(&tstat, dbentry->tables);
     while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
     {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        pgstat_write_bytes(fpout, "T", 1, &failed);
+        if (failed)
+            break;
+
+        pgstat_write_bytes(fpout, tabentry, sizeof(PgStat_StatTabEntry),
+                           &failed);
+        if (failed)
+            break;
     }
 
     /*
@@ -3845,28 +3896,42 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     hash_seq_init(&fstat, dbentry->functions);
     while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        pgstat_write_bytes(fpout, "F", 1, &failed);
+        if (failed)
+            break;
+
+        pgstat_write_bytes(fpout,
+                           funcentry,
+                           sizeof(PgStat_StatFuncEntry),
+                           &failed);
+        if (failed)
+            break;
     }
 
     /*
      * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
+     * pgstat.stat with it.
      */
-    fputc('E', fpout);
+    pgstat_write_bytes(fpout, "E", 1, &failed);
 
-    if (ferror(fpout))
+    if (failed)
     {
         ereport(LOG,
                 (errcode_for_file_access(),
                  errmsg("could not write temporary statistics file \"%s\": %m",
                         tmpfile)));
-        FreeFile(fpout);
+        BufFileCloseTransient(fpout);
         unlink(tmpfile);
+        return;
     }
-    else if (FreeFile(fpout) < 0)
+
+    /*
+     * XXX This might PANIC, see FileClose(). Don't we need special behaviour
+     * for statistics?
+     */
+    vfd = BufFileTransientGetVfd(fpout);
+    BufFileCloseTransient(fpout);
+    if (!FileIsClosed(vfd))
     {
         ereport(LOG,
                 (errcode_for_file_access(),
@@ -3892,6 +3957,25 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     }
 }
 
+/*
+ * Convenience routine to write data to file and check for errors.
+ */
+static void
+pgstat_write_bytes(TransientBufFile *file, void *ptr, size_t size,
+    bool *failed)
+{
+    /* Do nothing if any previous write failed. */
+    if (*failed)
+        return;
+
+    /*
+     * Use BufFileWriteTransient() because it handles encryption
+     * transparently.
+     */
+    if (BufFileWriteTransient(file, ptr, size) != size)
+        *failed = true;
+}
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
@@ -3919,10 +4003,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     PgStat_StatDBEntry dbbuf;
     HASHCTL        hash_ctl;
     HTAB       *dbhash;
-    FILE       *fpin;
+    TransientBufFile *fpin;
     int32        format_id;
     bool        found;
     const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    bool    failed = false;
+    char tweak_base[TWEAK_BASE_SIZE];
     int            i;
 
     /*
@@ -3971,7 +4057,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (data_encrypted)
+        pgstat_tweak_base(permanent, true, InvalidOid, tweak_base);
+    if ((fpin = BufFileOpenTransient(statfile,
+                                     O_RDONLY | PG_BINARY,
+                                     tweak_base)) == NULL)
     {
         if (errno != ENOENT)
             ereport(pgStatRunningInCollector ? LOG : WARNING,
@@ -3984,8 +4074,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Verify it's of the expected format.
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    pgstat_read_bytes(fpin, &format_id, sizeof(format_id), &failed);
+    if (failed || format_id != PGSTAT_FILE_FORMAT_ID)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -3995,7 +4085,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    pgstat_read_bytes(fpin, &globalStats, sizeof(globalStats), &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -4016,7 +4107,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    pgstat_read_bytes(fpin, &archiverStats, sizeof(archiverStats), &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -4027,7 +4119,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read WAL stats struct
      */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
+    pgstat_read_bytes(fpin, &walStats, sizeof(walStats), &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -4038,7 +4131,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read SLRU stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    pgstat_read_bytes(fpin, &slruStats, sizeof(slruStats), &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -4052,15 +4146,22 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      */
     for (;;)
     {
-        switch (fgetc(fpin))
+        char    c;
+
+        pgstat_read_bytes(fpin, &c, 1, &failed);
+
+        switch (c)
         {
                 /*
                  * 'D'    A PgStat_StatDBEntry struct describing a database
                  * follows.
                  */
             case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                pgstat_read_bytes(fpin,
+                                  &dbbuf,
+                                  offsetof(PgStat_StatDBEntry, tables),
+                                  &failed);
+                if (failed)
                 {
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
@@ -4144,8 +4245,10 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     PgStat_StatReplSlotEntry slotbuf;
                     PgStat_StatReplSlotEntry *slotent;
 
-                    if (fread(&slotbuf, 1, sizeof(PgStat_StatReplSlotEntry), fpin)
-                        != sizeof(PgStat_StatReplSlotEntry))
+                    pgstat_read_bytes(fpin, &slotbuf,
+                                      sizeof(PgStat_StatReplSlotEntry),
+                                      &failed);
+                    if (failed)
                     {
                         ereport(pgStatRunningInCollector ? LOG : WARNING,
                                 (errmsg("corrupted statistics file \"%s\"",
@@ -4186,7 +4289,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     }
 
 done:
-    FreeFile(fpin);
+    BufFileCloseTransient(fpin);
 
     /* If requested to read the permanent file, also get rid of it. */
     if (permanent)
@@ -4221,10 +4324,12 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
+    TransientBufFile       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
+    bool    failed = false;
+    char tweak_base[TWEAK_BASE_SIZE];
 
     get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
 
@@ -4237,7 +4342,11 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
      * not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (data_encrypted)
+        pgstat_tweak_base(permanent, false, databaseid, tweak_base);
+    if ((fpin = BufFileOpenTransient(statfile,
+                                     O_RDONLY | PG_BINARY,
+                                     tweak_base)) == NULL)
     {
         if (errno != ENOENT)
             ereport(pgStatRunningInCollector ? LOG : WARNING,
@@ -4250,8 +4359,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     /*
      * Verify it's of the expected format.
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    pgstat_read_bytes(fpin, &format_id, sizeof(format_id), &failed);
+    if (failed || format_id != PGSTAT_FILE_FORMAT_ID)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -4264,14 +4373,19 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
      */
     for (;;)
     {
-        switch (fgetc(fpin))
+        char    c;
+
+        pgstat_read_bytes(fpin, &c, 1, &failed);
+
+        switch (c)
         {
                 /*
                  * 'T'    A PgStat_StatTabEntry follows.
                  */
             case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
+                pgstat_read_bytes(fpin, &tabbuf, sizeof(PgStat_StatTabEntry),
+                                  &failed);
+                if (failed)
                 {
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
@@ -4304,8 +4418,11 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                  * 'F'    A PgStat_StatFuncEntry follows.
                  */
             case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
+                pgstat_read_bytes(fpin,
+                                  &funcbuf,
+                                  sizeof(PgStat_StatFuncEntry),
+                                  &failed);
+                if (failed)
                 {
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
@@ -4349,7 +4466,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
+    BufFileCloseTransient(fpin);
 
     if (permanent)
     {
@@ -4386,15 +4503,21 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
     PgStat_WalStats myWalStats;
     PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
     PgStat_StatReplSlotEntry myReplSlotStats;
-    FILE       *fpin;
+    TransientBufFile       *fpin;
     int32        format_id;
     const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    bool    failed = false;
+    char tweak_base[TWEAK_BASE_SIZE];
 
     /*
      * Try to open the stats file.  As above, anything but ENOENT is worthy of
      * complaining about.
      */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (data_encrypted)
+        pgstat_tweak_base(permanent, true, InvalidOid, tweak_base);
+    if ((fpin = BufFileOpenTransient(statfile,
+                                     O_RDONLY | PG_BINARY,
+                                     tweak_base)) == NULL)
     {
         if (errno != ENOENT)
             ereport(pgStatRunningInCollector ? LOG : WARNING,
@@ -4407,58 +4530,62 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
     /*
      * Verify it's of the expected format.
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    pgstat_read_bytes(fpin, &format_id, sizeof(format_id), &failed);
+    if (failed || format_id != PGSTAT_FILE_FORMAT_ID)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
+        BufFileCloseTransient(fpin);
         return false;
     }
 
     /*
      * Read global stats struct
      */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
+    pgstat_read_bytes(fpin, &myGlobalStats, sizeof(myGlobalStats), &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
+        BufFileCloseTransient(fpin);
         return false;
     }
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    pgstat_read_bytes(fpin, &myArchiverStats, sizeof(myArchiverStats),
+                      &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
+        BufFileCloseTransient(fpin);
         return false;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    pgstat_read_bytes(fpin, &myWalStats, sizeof(myWalStats), &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
+        BufFileCloseTransient(fpin);
         return false;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    pgstat_read_bytes(fpin, mySLRUStats, sizeof(mySLRUStats),
+                      &failed);
+    if (failed)
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
+        BufFileCloseTransient(fpin);
         return false;
     }
 
@@ -4471,20 +4598,27 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
      */
     for (;;)
     {
-        switch (fgetc(fpin))
+        char    c;
+
+        pgstat_read_bytes(fpin, &c, 1, &failed);
+
+        switch (c)
         {
                 /*
                  * 'D'    A PgStat_StatDBEntry struct describing a database
                  * follows.
                  */
             case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                pgstat_read_bytes(fpin,
+                                  &dbentry,
+                                  offsetof(PgStat_StatDBEntry, tables),
+                                  &failed);
+                if (failed)
                 {
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
-                    FreeFile(fpin);
+                    BufFileCloseTransient(fpin);
                     return false;
                 }
 
@@ -4505,13 +4639,14 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
                  * replication slot follows.
                  */
             case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_StatReplSlotEntry), fpin)
-                    != sizeof(PgStat_StatReplSlotEntry))
+                pgstat_read_bytes(fpin, &myReplSlotStats,
+                                  sizeof(PgStat_StatReplSlotEntry), &failed);
+                if (failed)
                 {
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
-                    FreeFile(fpin);
+                    BufFileCloseTransient(fpin);
                     return false;
                 }
                 break;
@@ -4524,17 +4659,35 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
-                    FreeFile(fpin);
+                    BufFileCloseTransient(fpin);
                     return false;
                 }
         }
     }
 
 done:
-    FreeFile(fpin);
+    BufFileCloseTransient(fpin);
     return true;
 }
 
+/*
+ * Convenience routine to read data from file and check for errors.
+ */
+static void
+pgstat_read_bytes(TransientBufFile *file, void *ptr, size_t size,
+    bool *failed)
+{
+    /* Do nothing if any previous read failed. */
+    if (*failed)
+        return;
+
+    /*
+     * Use BufFileReadTransient() because it handles encryption transparently.
+     */
+    if (BufFileReadTransient(file, ptr, size) != size)
+        *failed = true;
+}
+
 /*
  * If not already done, read the statistics collector stats file into
  * some hash tables.  The results will be kept until pgstat_clear_snapshot()

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

24 January 2022, 22:57:18

On Mon, Nov 29, 2021 at 08:37:31AM +0100, Antonin Houska wrote:
> The changes to buffile.c are not trivial, but we haven't really changed the
> API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead().
> 
> What our patch affects on the caller side is that BufFileOpenTransient(),
> BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient()
> replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and
> read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes
> become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1],
> see the diffs attached.

With pg_upgrade modified to preserve the relfilenode, tablespace oid, and
database oid, we are now closer to implementing cluster file encryption
using XTS.  I think we have a few steps left:

1.  modify temporary file I/O to use a more centralized API
2.  modify the existing cluster file encryption patch to use XTS with a
    IV that uses more than the LSN
3.  add XTS regression test code like CTR
4.  create WAL encryption code using CTR

If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
The feature wiki page is:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

Do people want to advance this feature forward?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Antonin Houska

Date:

01 February 2022, 06:45:06

Bruce Momjian <bruce@momjian.us> wrote:

> On Mon, Nov 29, 2021 at 08:37:31AM +0100, Antonin Houska wrote:
> > The changes to buffile.c are not trivial, but we haven't really changed the
> > API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead().
> >
> > What our patch affects on the caller side is that BufFileOpenTransient(),
> > BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient()
> > replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and
> > read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes
> > become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1],
> > see the diffs attached.
>
> With pg_upgrade modified to preserve the relfilenode, tablespace oid, and
> database oid, we are now closer to implementing cluster file encryption
> using XTS.  I think we have a few steps left:
>
> 1.  modify temporary file I/O to use a more centralized API
> 2.  modify the existing cluster file encryption patch to use XTS with a
>     IV that uses more than the LSN
> 3.  add XTS regression test code like CTR
> 4.  create WAL encryption code using CTR
>
> If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
> The feature wiki page is:
>
>     https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
>
> Do people want to advance this feature forward?

I confirm that we (Cybertec) do and that we're ready to spend more time on the
community implementation.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

01 February 2022, 17:50:46

On Tue, Feb  1, 2022 at 07:45:06AM +0100, Antonin Houska wrote:
> > With pg_upgrade modified to preserve the relfilenode, tablespace oid, and
> > database oid, we are now closer to implementing cluster file encryption
> > using XTS.  I think we have a few steps left:
> > 
> > 1.  modify temporary file I/O to use a more centralized API
> > 2.  modify the existing cluster file encryption patch to use XTS with a
> >     IV that uses more than the LSN
> > 3.  add XTS regression test code like CTR
> > 4.  create WAL encryption code using CTR
> > 
> > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
> > The feature wiki page is:
> > 
> >     https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> > 
> > Do people want to advance this feature forward?
> 
> I confirm that we (Cybertec) do and that we're ready to spend more time on the
> community implementation.

Well, I sent an email a week ago asking if people want to advance this
feature forward, and so far you are the only person to reply, which I
think means there isn't enough interest in this feature to advance it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Stephen Frost

Date:

01 February 2022, 18:07:36

Greetings,

On Tue, Feb 1, 2022 at 12:50 Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Feb 1, 2022 at 07:45:06AM +0100, Antonin Houska wrote:
> > With pg_upgrade modified to preserve the relfilenode, tablespace oid, and
> > database oid, we are now closer to implementing cluster file encryption
> > using XTS. I think we have a few steps left:
> >
> > 1. modify temporary file I/O to use a more centralized API
> > 2. modify the existing cluster file encryption patch to use XTS with a
> > IV that uses more than the LSN
> > 3. add XTS regression test code like CTR
> > 4. create WAL encryption code using CTR
> >
> > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
> > The feature wiki page is:
> >
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> >
> > Do people want to advance this feature forward?
>
> I confirm that we (Cybertec) do and that we're ready to spend more time on the
> community implementation.

Well, I sent an email a week ago asking if people want to advance this
feature forward, and so far you are the only person to reply, which I
think means there isn't enough interest in this feature to advance it.

This confuses me. Clearly there’s plenty of interest, but asking on hackers in a deep old sub thread isn’t a terribly good way to judge that. Yet even when there is an active positive response, you argue that there isn’t enough.

In general, I agree that the items you laid out are what the next steps are. There are patches for some of those items already too and some of them, such as consolidating the temporary file access, are beneficial even without the potential to use them for encryption.

Instead of again asking if people want this feature (many, many, many do), I’d encourage Antonin to start a new thread with the patch to do the temporary file access consolidation which then provides a buffered access and reduces the number of syscalls and work towards getting that committed, ideally as part of this release.

Thanks,

Stephen

Re: XTS cipher mode for cluster file encryption

From

Bruce Momjian

Date:

01 February 2022, 18:27:03

On Tue, Feb  1, 2022 at 01:07:36PM -0500, Stephen Frost wrote:
>     Well, I sent an email a week ago asking if people want to advance this
>     feature forward, and so far you are the only person to reply, which I
>     think means there isn't enough interest in this feature to advance it.
> 
> This confuses me. Clearly there’s plenty of interest, but asking on hackers in
> a deep old sub thread isn’t a terribly good way to judge that.  Yet even when
> there is an active positive response, you argue that there isn’t enough.

Uh, I have been lead down the path of disinterest/confusion on this
feature enough that I am looking for positive feedback on every new step
so I don't get stuck out in front with insufficient support.  Yes, only
one person replying is enough for me to say there isn't interest.  I
guess I now have two.  My email was short and ended with a question so I
thought the people interested in the steps I suggested would give some
kind of feedback --- I certainly try to reply to all emails on this
topic.

> In general, I agree that the items you laid out are what the next steps are. 
> There are patches for some of those items already too and some of them, such as
> consolidating the temporary file access, are beneficial even without the
> potential to use them for encryption.  

Great.  I can update my patch for July consideration.

> Instead of again asking if people want this feature (many, many, many do), I’d
> encourage Antonin to start a new thread with the patch to do the temporary file
> access consolidation which then provides a buffered access and reduces the
> number of syscalls and work towards getting that committed, ideally as part of
> this release.

Yes, agreed.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: XTS cipher mode for cluster file encryption

From

Andres Freund

Date:

01 February 2022, 18:36:45

Hi,

On 2022-02-01 13:27:03 -0500, Bruce Momjian wrote:
> On Tue, Feb  1, 2022 at 01:07:36PM -0500, Stephen Frost wrote:
> >     Well, I sent an email a week ago asking if people want to advance this
> >     feature forward, and so far you are the only person to reply, which I
> >     think means there isn't enough interest in this feature to advance it.
> > 
> > This confuses me. Clearly there’s plenty of interest, but asking on hackers in
> > a deep old sub thread isn’t a terribly good way to judge that.  Yet even when
> > there is an active positive response, you argue that there isn’t enough.
> 
> Uh, I have been lead down the path of disinterest/confusion on this
> feature enough that I am looking for positive feedback on every new step
> so I don't get stuck out in front with insufficient support.  Yes, only
> one person replying is enough for me to say there isn't interest.  I
> guess I now have two.  My email was short and ended with a question so I
> thought the people interested in the steps I suggested would give some
> kind of feedback --- I certainly try to reply to all emails on this
> topic.

Personally I can't keep up with all threads on -hackers all the
time. Especially not long and at times very busy threads. So I agree with
Stephen that it's not saying much whether / not people react to an email deep
in a thread.


> > Instead of again asking if people want this feature (many, many, many do), I’d
> > encourage Antonin to start a new thread with the patch to do the temporary file
> > access consolidation which then provides a buffered access and reduces the
> > number of syscalls and work towards getting that committed, ideally as part of
> > this release.

I think it is quite unlikely that patches of that invasiveness can be merged
this release.

Greetings,

Andres Freund

Re: XTS cipher mode for cluster file encryption

From

Michael Banck

Date:

01 February 2022, 22:44:41

Hi,

On Tue, Feb 01, 2022 at 01:07:36PM -0500, Stephen Frost wrote:
> On Tue, Feb 1, 2022 at 12:50 Bruce Momjian <bruce@momjian.us> wrote:
> > On Tue, Feb  1, 2022 at 07:45:06AM +0100, Antonin Houska wrote:
> > > > With pg_upgrade modified to preserve the relfilenode, tablespace
> > > > oid, and database oid, we are now closer to implementing cluster
> > > > file encryption using XTS.  I think we have a few steps left:
> > > >
> > > > 1.  modify temporary file I/O to use a more centralized API
> > > > 2.  modify the existing cluster file encryption patch to use XTS with a
> > > >     IV that uses more than the LSN
> > > > 3.  add XTS regression test code like CTR
> > > > 4.  create WAL encryption code using CTR
> > > >
> > > > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
> > > > The feature wiki page is:
> > > >
> > > >     https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> > > >
> > > > Do people want to advance this feature forward?
> > >
> > > I confirm that we (Cybertec) do and that we're ready to spend more
> > > time on the community implementation.
> >
> > Well, I sent an email a week ago asking if people want to advance this
> > feature forward, and so far you are the only person to reply, which I
> > think means there isn't enough interest in this feature to advance it.
> 
> This confuses me. Clearly there’s plenty of interest, but asking on hackers
> in a deep old sub thread isn’t a terribly good way to judge that.  Yet even
> when there is an active positive response, you argue that there isn’t
> enough.

Even more so because not Antonin not only replied as an individual, but
in the name of a whole company developing Postgres in general and TDE in
particular.


Michael

-- 
Michael Banck
Teamleiter PostgreSQL-Team
Projektleiter
Tel.: +49 2166 9901-171
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: XTS cipher mode for cluster file encryption

From

Sasasu

Date:

03 February 2022, 06:50:22

On 2022/2/2 01:50, Bruce Momjian wrote:
> Well, I sent an email a week ago asking if people want to advance this
> feature forward, and so far you are the only person to reply, which I
> think means there isn't enough interest in this feature to advance it.

I am still focus on this thread.

and I have a small patch to solve the current buffile problem.
https://www.postgresql.org/message-id/a859a753-70f2-bb17-6830-19dbcad11c17%40sasa.su

Attachment

OpenPGP_signature