Thread: XTS cipher mode for cluster file encryption
As you might have seen from my email in another thread, thanks to Stephen and Cybertec staff, I am back working on cluster file encryption/TDE. Stephen was going to research if XTS cipher mode would be a good fit for this since it was determined that CTR cipher mode was too vulnerable to IV reuse, and the LSN provides insufficient uniqueness. Stephen reported having trouble finding a definitive answer, so I figured I would research it myself. Of course, I found the same lack of information that Stephen did. ;-) None of my classic cryptographic books cover XTS or the XEX encryption mode it is based on, since XTS was only standardized in 2007 and recommended in 2010. (Yeah, don't get me started on poor cryptographic documentation.) Therefore, I decide to go backward and look at CTR and CBC to see how the nonce is used there, and if XTS fixes problems with nonce reuse. First, I originally chose CTR mode since it was a streaming cipher, and we therefore could skip certain page fields like the LSN. However, CTR is very sensitive to LSN reuse since the input bits generate encrypted bits in exactly the same locations on the page. (It uses a simple XOR against a cipher). Since sometimes pages with different page contents are encrypted with the same LSN, especially on replicas, this method failed. Second is CBC mode. which is a block cipher. I thought that meant that you could only encrypt 16-byte chunks, meaning you couldn't skip encryption of certain page fields unless they are 16-byte chunks. However, there is something called ciphertext stealing (https://en.wikipedia.org/wiki/Ciphertext_stealing#CBC_ciphertext_stealing) which allows that. I am not sure if OpenSSL supports this, but looking at my OpenSSL 1.1.1d manual entry for EVP_aes, cipher stealing is only mentioned for XTS. Anyway, CBC mode still needs a nonce for the first 16-byte block, and then feeds the encrypted output of the first block as a IV to the second block, etc. This gives us the same problem with finding a nonce per page. However, since it is a block cipher, the bits don't output in the same locations they have on input, so that is less of a problem. There is also the problem that the encrypted output from one 16-byte block could repeat, causing leakage. So, let's look how XTS is designed. First, it uses two keys. If you are using AES128, you need _two_ 128-bit keys. If using AES256, you need two 256-bit keys. The first of the two keys is used like normal, to encrypt the data. The second key, which is also secret, is used to encrypt the values used for the IV for the first 16-byte block (in our case dboid, relfilenode, blocknum, maybe LSN). This is most clearly explained here: https://www.kingston.com/unitedstates/en/solutions/data-security/xts-encryption That IV is XOR'ed against both the input value and the encryption output value, as explained here as key tweaking: https://crossbowerbt.github.io/xts_mode_tweaking.html The purpose of using it before and after encryption is explained here: https://crypto.stackexchange.com/questions/24431/what-is-the-benefit-of-applying-the-tweak-a-second-time-using-xts The second 16-byte block gets an IV that is the multiplication of the first IV and an alpha value raised to the second power but mapped to a finite field (Galois field, modulus a prime). This effectively means an attacker has _no_ idea what the IV is since it involves a secret key, and each 16-byte block uses a different, unpredictable IV value. XTS also supports ciphertext stealing by default so we can use the LSN if we want, but we aren't sure we need to. Finally, there is an interesting web page about when not to use XTS: https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ Basically, what XTS does is to make the IV unknown to attackers and non-repeating except for multiple writes to a specific 16-byte block (with no LSN change). What isn't clear is if repeated encryption of different data in the same 16-byte block can leak data. This probably needs more research and maybe we need to write something up like the above and let security researchers review it since there doesn't seem to be enough documentation for us to decide ourselves. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > As you might have seen from my email in another thread, thanks to > Stephen and Cybertec staff, I am back working on cluster file > encryption/TDE. > > Stephen was going to research if XTS cipher mode would be a good fit for > this since it was determined that CTR cipher mode was too vulnerable to > IV reuse, and the LSN provides insufficient uniqueness. Stephen > reported having trouble finding a definitive answer, so I figured I > would research it myself. > > Of course, I found the same lack of information that Stephen did. ;-) > None of my classic cryptographic books cover XTS or the XEX encryption > mode it is based on, since XTS was only standardized in 2007 and > recommended in 2010. (Yeah, don't get me started on poor cryptographic > documentation.) > > Therefore, I decide to go backward and look at CTR and CBC to see how > the nonce is used there, and if XTS fixes problems with nonce reuse. > > First, I originally chose CTR mode since it was a streaming cipher, and > we therefore could skip certain page fields like the LSN. However, CTR > is very sensitive to LSN reuse since the input bits generate encrypted > bits in exactly the same locations on the page. (It uses a simple XOR > against a cipher). Since sometimes pages with different page contents > are encrypted with the same LSN, especially on replicas, this method > failed. > > Second is CBC mode. which is a block cipher. I thought that meant that > you could only encrypt 16-byte chunks, meaning you couldn't skip > encryption of certain page fields unless they are 16-byte chunks. > However, there is something called ciphertext stealing > (https://en.wikipedia.org/wiki/Ciphertext_stealing#CBC_ciphertext_stealing) > which allows that. I am not sure if OpenSSL supports this, but looking > at my OpenSSL 1.1.1d manual entry for EVP_aes, cipher stealing is only > mentioned for XTS. > > Anyway, CBC mode still needs a nonce for the first 16-byte block, and > then feeds the encrypted output of the first block as a IV to the second > block, etc. This gives us the same problem with finding a nonce per > page. However, since it is a block cipher, the bits don't output in the > same locations they have on input, so that is less of a problem. There > is also the problem that the encrypted output from one 16-byte block > could repeat, causing leakage. > > So, let's look how XTS is designed. First, it uses two keys. If you > are using AES128, you need _two_ 128-bit keys. If using AES256, you > need two 256-bit keys. The first of the two keys is used like normal, > to encrypt the data. The second key, which is also secret, is used to > encrypt the values used for the IV for the first 16-byte block (in our > case dboid, relfilenode, blocknum, maybe LSN). This is most clearly > explained here: > > https://www.kingston.com/unitedstates/en/solutions/data-security/xts-encryption > > That IV is XOR'ed against both the input value and the encryption output > value, as explained here as key tweaking: > > https://crossbowerbt.github.io/xts_mode_tweaking.html > > The purpose of using it before and after encryption is explained here: > > https://crypto.stackexchange.com/questions/24431/what-is-the-benefit-of-applying-the-tweak-a-second-time-using-xts > > The second 16-byte block gets an IV that is the multiplication of the > first IV and an alpha value raised to the second power but mapped to a > finite field (Galois field, modulus a prime). This effectively means an > attacker has _no_ idea what the IV is since it involves a secret key, > and each 16-byte block uses a different, unpredictable IV value. XTS > also supports ciphertext stealing by default so we can use the LSN if we > want, but we aren't sure we need to. Yeah, this all seems to be about where I got to too. > Finally, there is an interesting web page about when not to use XTS: > > https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ This particular article always struck me as more of a reason for us, at least, to use XTS than to not- in particular the very first comment it makes, which seems to be pretty well supported, is: "XTS is the de-facto standard disk encryption mode." Much of the rest of it is the well trodden discussion we've had about how FDE (or TDE in our case) doesn't protect against all the attack vectors that sometimes people think it does. Another point is that XTS isn't authenticated- something else we know quite well around here and isn't news. > Basically, what XTS does is to make the IV unknown to attackers and > non-repeating except for multiple writes to a specific 16-byte block > (with no LSN change). What isn't clear is if repeated encryption of > different data in the same 16-byte block can leak data. Any time a subset of the data is changed but the rest of it isn't, there's a leak of information. This is a really good example of exactly what that looks like: https://github.com/robertdavidgraham/ecb-penguin In our case, if/when this happens (no LSN change, repeated encryption of the same block), someone might be able to deduce that hint bits were being updated/changed, and where some of those are in the block. That said, I don't think that's really a huge issue or something that's a show stopper or a reason to hold off on using XTS. Note that what those bits actually *are* isn't leaked, just that they changed in some fashion inside of that 16-byte cipher text block. That they're directly leaked with CTR is why there was concern raised about using that method, as discussed above and previously. > This probably needs more research and maybe we need to write something > up like the above and let security researchers review it since there > doesn't seem to be enough documentation for us to decide ourselves. The one issue identified here is hopefully answered above and given that what you've found matches what I found, I'd argue that moving forward with XTS makes sense. The other bit of research that I wanted to do, and thanks for sending this and prodding me to go do so, was to look at other implementations and see what they do for the IV when it comes to using XTS, and this is what I found: https://wiki.gentoo.org/wiki/Dm-crypt_full_disk_encryption Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 and then this: https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt where plain64 is defined as: plain64: the initial vector is the 64-bit little-endian version of the sector number, padded with zeros if necessary That is, the default for LUKS is AES, XTS, with a simple IV. That strikes me as a pretty ringing endorsement. Now, to address the concern around re-encrypting a block with the same key+IV but different data and leaking what parts of the page changed, I do think we should use the LSN and have it change regularly (including unlogged tables) but that's just because it's relatively easy for us to do and means an attacker wouldn't be able to tell what part of the page changed when the LSN was also changed. That was also recommended by NIST and that's a pretty strong endorsement also. I'm all for getting security folks and whomever else to come and review this thread and chime in with their thoughts, but I don't think it's a reason to hold off on moving forward with the approach that we've been converging towards. Thanks! Stephen
Attachment
On 10/15/21 21:22, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: >> As you might have seen from my email in another thread, thanks to >> Stephen and Cybertec staff, I am back working on cluster file >> encryption/TDE. >> >> Stephen was going to research if XTS cipher mode would be a good fit for >> this since it was determined that CTR cipher mode was too vulnerable to >> IV reuse, and the LSN provides insufficient uniqueness. Stephen >> reported having trouble finding a definitive answer, so I figured I >> would research it myself. >> >> Of course, I found the same lack of information that Stephen did. ;-) >> None of my classic cryptographic books cover XTS or the XEX encryption >> mode it is based on, since XTS was only standardized in 2007 and >> recommended in 2010. (Yeah, don't get me started on poor cryptographic >> documentation.) >> >> Therefore, I decide to go backward and look at CTR and CBC to see how >> the nonce is used there, and if XTS fixes problems with nonce reuse. >> >> First, I originally chose CTR mode since it was a streaming cipher, and >> we therefore could skip certain page fields like the LSN. However, CTR >> is very sensitive to LSN reuse since the input bits generate encrypted >> bits in exactly the same locations on the page. (It uses a simple XOR >> against a cipher). Since sometimes pages with different page contents >> are encrypted with the same LSN, especially on replicas, this method >> failed. >> >> Second is CBC mode. which is a block cipher. I thought that meant that >> you could only encrypt 16-byte chunks, meaning you couldn't skip >> encryption of certain page fields unless they are 16-byte chunks. >> However, there is something called ciphertext stealing >> (https://en.wikipedia.org/wiki/Ciphertext_stealing#CBC_ciphertext_stealing) >> which allows that. I am not sure if OpenSSL supports this, but looking >> at my OpenSSL 1.1.1d manual entry for EVP_aes, cipher stealing is only >> mentioned for XTS. >> >> Anyway, CBC mode still needs a nonce for the first 16-byte block, and >> then feeds the encrypted output of the first block as a IV to the second >> block, etc. This gives us the same problem with finding a nonce per >> page. However, since it is a block cipher, the bits don't output in the >> same locations they have on input, so that is less of a problem. There >> is also the problem that the encrypted output from one 16-byte block >> could repeat, causing leakage. >> >> So, let's look how XTS is designed. First, it uses two keys. If you >> are using AES128, you need _two_ 128-bit keys. If using AES256, you >> need two 256-bit keys. The first of the two keys is used like normal, >> to encrypt the data. The second key, which is also secret, is used to >> encrypt the values used for the IV for the first 16-byte block (in our >> case dboid, relfilenode, blocknum, maybe LSN). This is most clearly >> explained here: >> >> https://www.kingston.com/unitedstates/en/solutions/data-security/xts-encryption >> >> That IV is XOR'ed against both the input value and the encryption output >> value, as explained here as key tweaking: >> >> https://crossbowerbt.github.io/xts_mode_tweaking.html >> >> The purpose of using it before and after encryption is explained here: >> >> https://crypto.stackexchange.com/questions/24431/what-is-the-benefit-of-applying-the-tweak-a-second-time-using-xts >> >> The second 16-byte block gets an IV that is the multiplication of the >> first IV and an alpha value raised to the second power but mapped to a >> finite field (Galois field, modulus a prime). This effectively means an >> attacker has _no_ idea what the IV is since it involves a secret key, >> and each 16-byte block uses a different, unpredictable IV value. XTS >> also supports ciphertext stealing by default so we can use the LSN if we >> want, but we aren't sure we need to. > > Yeah, this all seems to be about where I got to too. > >> Finally, there is an interesting web page about when not to use XTS: >> >> https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ > > This particular article always struck me as more of a reason for us, at > least, to use XTS than to not- in particular the very first comment it > makes, which seems to be pretty well supported, is: "XTS is the de-facto > standard disk encryption mode." Much of the rest of it is the well > trodden discussion we've had about how FDE (or TDE in our case) doesn't > protect against all the attack vectors that sometimes people think it > does. Another point is that XTS isn't authenticated- something else we > know quite well around here and isn't news. > >> Basically, what XTS does is to make the IV unknown to attackers and >> non-repeating except for multiple writes to a specific 16-byte block >> (with no LSN change). What isn't clear is if repeated encryption of >> different data in the same 16-byte block can leak data. > > Any time a subset of the data is changed but the rest of it isn't, > there's a leak of information. This is a really good example of exactly > what that looks like: > > https://github.com/robertdavidgraham/ecb-penguin > > In our case, if/when this happens (no LSN change, repeated encryption > of the same block), someone might be able to deduce that hint bits were > being updated/changed, and where some of those are in the block. > > That said, I don't think that's really a huge issue or something that's > a show stopper or a reason to hold off on using XTS. Note that what > those bits actually *are* isn't leaked, just that they changed in some > fashion inside of that 16-byte cipher text block. That they're directly > leaked with CTR is why there was concern raised about using that method, > as discussed above and previously. > Yeah. With CTR you pretty learn where the hint bits are exactly, while with XTS the whole ciphertext changes. This also means CTR is much more malleable, i.e. you can tweak the ciphertext bits to flip the plaintext, while with XTS that's not really possible - it's pretty much guaranteed to break the block structure. Not sure if that's an issue for our use case, but if it is then neither of the two modes is a solution. >> This probably needs more research and maybe we need to write something >> up like the above and let security researchers review it since there >> doesn't seem to be enough documentation for us to decide ourselves. > > The one issue identified here is hopefully answered above and given that > what you've found matches what I found, I'd argue that moving forward > with XTS makes sense. > +1 > The other bit of research that I wanted to do, and thanks for sending > this and prodding me to go do so, was to look at other implementations > and see what they do for the IV when it comes to using XTS, and this is > what I found: > > https://wiki.gentoo.org/wiki/Dm-crypt_full_disk_encryption > > Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 > > and then this: > > https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt > > where plain64 is defined as: > > plain64: the initial vector is the 64-bit little-endian version of the > sector number, padded with zeros if necessary > > That is, the default for LUKS is AES, XTS, with a simple IV. That > strikes me as a pretty ringing endorsement. > Seems reasonable, on the assumption the threat models are the same. > Now, to address the concern around re-encrypting a block with the same > key+IV but different data and leaking what parts of the page changed, I > do think we should use the LSN and have it change regularly (including > unlogged tables) but that's just because it's relatively easy for us to > do and means an attacker wouldn't be able to tell what part of the page > changed when the LSN was also changed. That was also recommended by > NIST and that's a pretty strong endorsement also. > Not sure - it seems a bit weird to force LSN change even in cases that don't generate any WAL. I was not following the encryption thread and maybe it was discussed/rejected there, but I've always imagined we'd have a global nonce generator (similar to a sequence) and we'd store it at the end of each block, or something like that. > I'm all for getting security folks and whomever else to come and review > this thread and chime in with their thoughts, but I don't think it's a > reason to hold off on moving forward with the approach that we've been > converging towards. > +1 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: > Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 > > and then this: > > https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt > > where plain64 is defined as: > > plain64: the initial vector is the 64-bit little-endian version of the > sector number, padded with zeros if necessary > > That is, the default for LUKS is AES, XTS, with a simple IV. That > strikes me as a pretty ringing endorsement. Yes, that sounds promising. It might not hurt to check for other precedents as well, but that seems like a pretty good one. I'm not very convinced that using the LSN for any of this is a good idea. Something that changes most of the time but not all the time seems more like it could hurt by masking fuzzy thinking more than it helps anything. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-10-15 15:22:48 -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > Finally, there is an interesting web page about when not to use XTS: > > > > https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ > > This particular article always struck me as more of a reason for us, at > least, to use XTS than to not- in particular the very first comment it > makes, which seems to be pretty well supported, is: "XTS is the de-facto > standard disk encryption mode." I don't find that line of argument *that* convincing. The reason XTS is the de-facto standard is that for generic block layer encryption is that you can't add additional data for each block without very significant overhead (basically needing journaling to ensure that the data doesn't get out of sync). But we don't really face the same situation - we *can* add additional data. With something like AES-GCM-SIV we can use the additional data to get IV reuse resistance *and* authentication. And while perhaps we are ok with the IV reuse guarantees XTS has, it seems pretty clear that we'll want want guaranteed authenticity at some point. And then we'll need extra data anyway. Thus, to me, it doesn't seem worth going down the XTS route, just to temporarily save a bit of implementation effort. We'll have to endure that pain anyway. Greetings, Andres Freund
On 10/15/21 23:02, Robert Haas wrote: > On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: >> Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 >> >> and then this: >> >> https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt >> >> where plain64 is defined as: >> >> plain64: the initial vector is the 64-bit little-endian version of the >> sector number, padded with zeros if necessary >> >> That is, the default for LUKS is AES, XTS, with a simple IV. That >> strikes me as a pretty ringing endorsement. > > Yes, that sounds promising. It might not hurt to check for other > precedents as well, but that seems like a pretty good one. > TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE products at [1], and some of them use XTS, but I would take that with a grain of salt - some of the products are somewhat obscure, very old, or both. What is probably more interesting is that there's an IEEE standard [2] dealing with encrypted shared storage, and that uses XTS too. I'd bet there's a bunch of smart cryptographers involved. [1] https://en.wikipedia.org/wiki/Comparison_of_disk_encryption_software [2] https://en.wikipedia.org/wiki/IEEE_P1619 > I'm not very convinced that using the LSN for any of this is a good > idea. Something that changes most of the time but not all the time > seems more like it could hurt by masking fuzzy thinking more than it > helps anything. > I haven't been following the discussion about using LSN, but I agree that while using it seems convenient, the consequences of some changes not incrementing LSN seem potentially disastrous, depending on the encryption mode. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote: > > That said, I don't think that's really a huge issue or something that's > > a show stopper or a reason to hold off on using XTS. Note that what > > those bits actually *are* isn't leaked, just that they changed in some > > fashion inside of that 16-byte cipher text block. That they're directly > > leaked with CTR is why there was concern raised about using that method, > > as discussed above and previously. > > > > Yeah. With CTR you pretty learn where the hint bits are exactly, while with > XTS the whole ciphertext changes. > > This also means CTR is much more malleable, i.e. you can tweak the > ciphertext bits to flip the plaintext, while with XTS that's not really > possible - it's pretty much guaranteed to break the block structure. Not > sure if that's an issue for our use case, but if it is then neither of the > two modes is a solution. Yes, this is a vary good point. Let's look at the impact of _not_ using the LSN. For CTR (already rejected) bit changes would be visible by comparing old/new page contents. For CBC (also not under consideration) the first 16-byte block would show a change, and all later 16-byte blocks would show a change. For CBC, you see the 16-byte blocks change, but you have no idea how many bits were changed, and in what locations in the 16-byte block (AES uses substitution and diffusion). For XTS, because earlier blocks don't change the IV used by later blocks like CBC, you would be able to see each 16-byte block that changed in the 8k page. Again, you would not know the number of bits changed or their locations. Do we think knowing which 16-byte blocks on an 8k page change would leak useful information? If so, we should use the LSN and just accept that some cases might leak as described above. If we don't care, then we can skip the use of the LSN and simplify the patch. > Not sure - it seems a bit weird to force LSN change even in cases that don't > generate any WAL. I was not following the encryption thread and maybe it was > discussed/rejected there, but I've always imagined we'd have a global nonce > generator (similar to a sequence) and we'd store it at the end of each > block, or something like that. Storing the nonce in the page means more code complexity, possible performance impact, and the inability to create standbys via binary replication that use cluster file encryption. As a final comment to Andres's email, adding a GCM has the problems above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which could also affect the integrity of the data. Someone could also restore and old copy of a patch to revert a change, and that would not be detected even by GCM. I consider this a checkbox feature and making it too complex will cause it to be rightly rejected. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote: > As a final comment to Andres's email, adding a GCM has the problems > above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which > could also affect the integrity of the data. Someone could also restore > and old copy of a patch to revert a change, and that would not be > detected even by GCM. > I consider this a checkbox feature and making it too complex will cause > it to be rightly rejected. You're just deferring / hiding the complexity. For one, we'll need integrity before long if we add encryption support. Then we'll deal with a more complex on-disk format because there will be two different ways of encrypting. For another, you're spreading out the security analysis to a lot of places in the code and more importantly to future changes affecting on-disk data. If it's really just a checkbox feature without a real use case, then we should just reject requests for it and use our energy for useful things. Greetings, Andres Freund
On Sat, Oct 16, 2021 at 09:15:05AM -0700, Andres Freund wrote: > Hi, > > On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote: > > As a final comment to Andres's email, adding a GCM has the problems > > above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which > > could also affect the integrity of the data. Someone could also restore > > and old copy of a patch to revert a change, and that would not be > > detected even by GCM. > > > I consider this a checkbox feature and making it too complex will cause > > it to be rightly rejected. > > You're just deferring / hiding the complexity. For one, we'll need integrity > before long if we add encryption support. Then we'll deal with a more complex > on-disk format because there will be two different ways of encrypting. For > another, you're spreading out the security analysis to a lot of places in the > code and more importantly to future changes affecting on-disk data. > > If it's really just a checkbox feature without a real use case, then we should > just reject requests for it and use our energy for useful things. Agreed. That is the conclusion I came to in May: https://www.postgresql.org/message-id/20210526210201.GZ3048%40momjian.us https://www.postgresql.org/message-id/20210527160003.GF5646%40momjian.us -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 10/16/21 16:16, Bruce Momjian wrote: > On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote: >>> That said, I don't think that's really a huge issue or something that's >>> a show stopper or a reason to hold off on using XTS. Note that what >>> those bits actually *are* isn't leaked, just that they changed in some >>> fashion inside of that 16-byte cipher text block. That they're directly >>> leaked with CTR is why there was concern raised about using that method, >>> as discussed above and previously. >>> >> >> Yeah. With CTR you pretty learn where the hint bits are exactly, while with >> XTS the whole ciphertext changes. >> >> This also means CTR is much more malleable, i.e. you can tweak the >> ciphertext bits to flip the plaintext, while with XTS that's not really >> possible - it's pretty much guaranteed to break the block structure. Not >> sure if that's an issue for our use case, but if it is then neither of the >> two modes is a solution. > > Yes, this is a vary good point. Let's look at the impact of _not_ using > the LSN. For CTR (already rejected) bit changes would be visible by > comparing old/new page contents. For CBC (also not under consideration) > the first 16-byte block would show a change, and all later 16-byte > blocks would show a change. For CBC, you see the 16-byte blocks change, > but you have no idea how many bits were changed, and in what locations > in the 16-byte block (AES uses substitution and diffusion). For XTS, > because earlier blocks don't change the IV used by later blocks like > CBC, you would be able to see each 16-byte block that changed in the 8k > page. Again, you would not know the number of bits changed or their > locations. > > Do we think knowing which 16-byte blocks on an 8k page change would leak > useful information? If so, we should use the LSN and just accept that > some cases might leak as described above. If we don't care, then we can > skip the use of the LSN and simplify the patch. > >> Not sure - it seems a bit weird to force LSN change even in cases that don't >> generate any WAL. I was not following the encryption thread and maybe it was >> discussed/rejected there, but I've always imagined we'd have a global nonce >> generator (similar to a sequence) and we'd store it at the end of each >> block, or something like that. > > Storing the nonce in the page means more code complexity, possible > performance impact, and the inability to create standbys via binary > replication that use cluster file encryption. > Would it really be that complex? Reserving a bunch of bytes at the end of each encrypted page (a bit like the "special" space, but after encryption) seems fairly straightforward. And I don't quite see why would this have a measurable impact, given the nonce is 16B at most. The encryption is likely way more expensive. Moreover, it seems fairly reasonable to trade a bit of code complexity for something LSN-based which seems simpler but apparently has various weak points and is much harder to reason about. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/16/21 18:28, Bruce Momjian wrote: > On Sat, Oct 16, 2021 at 09:15:05AM -0700, Andres Freund wrote: >> Hi, >> >> On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote: >>> As a final comment to Andres's email, adding a GCM has the problems >>> above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which >>> could also affect the integrity of the data. Someone could also restore >>> and old copy of a patch to revert a change, and that would not be >>> detected even by GCM. >> >>> I consider this a checkbox feature and making it too complex will cause >>> it to be rightly rejected. >> >> You're just deferring / hiding the complexity. For one, we'll need integrity >> before long if we add encryption support. Then we'll deal with a more complex >> on-disk format because there will be two different ways of encrypting. For >> another, you're spreading out the security analysis to a lot of places in the >> code and more importantly to future changes affecting on-disk data. >> I've argued for storing the nonce, but I don't quite see why would we need integrity guarantees? AFAICS the threat model the patch aims to address is an attacker who can observe the data (e.g. a low-privileged OS user), but can't modify the files. Which seems like a reasonable model for shared environments. IMO extending this to cases where the attacker can modify the data moves the goalposts quite significantly. And it's quite possible authenticated encryption would not be enough to prevent that, because that still works only at block level, and you can probably do a lot of harm with replay attacks (e.g. replacing blocks with older versions). And if you can modify the data directory / config files, what are the chances you can't just get access to the database, trace the processes or whatever? We already have a way to check integrity by storing page checksum, but I'm not sure if that's good enough (there's a lot of subtle issues with building proper AEAD scheme). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Just a mention. the HMAC (or AE/AD) can be disabled in AES-GCM. HMAC in AES-GCM is an encrypt-then-hash MAC. CRC-32 is not a crypto-safe hash (technically CRC-32 is not a hash function). Cryptographers may unhappy with CRC-32. I think CRC or SHA is not such important. If IV can be stored, I believe there should have enough space to store HMAC. On 2021/10/18 05:23, Tomas Vondra wrote: > > I've argued for storing the nonce, but I don't quite see why would we > need integrity guarantees?
Attachment
On 2021/10/16 04:57, Tomas Vondra wrote: > > Seems reasonable, on the assumption the threat models are the same. On 2021/10/16 03:22, Stephen Frost wrote: > plain64: the initial vector is the 64-bit little-endian version of the > sector number, padded with zeros if necessary > > That is, the default for LUKS is AES, XTS, with a simple IV. That > strikes me as a pretty ringing endorsement On 2021/10/18 05:23, Tomas Vondra wrote: > > AFAICS the threat model the patch aims to address is an attacker who can > observe the data (e.g. a low-privileged OS user), but can't modify the > files. Which seems like a reasonable model for shared environments. I agree this threat model. And if PostgreSQL is using XTS, there is no different with dm-encrypt. The user can use dm-encrypt directly.
Attachment
Greetings, * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > On 10/15/21 21:22, Stephen Frost wrote: > >Now, to address the concern around re-encrypting a block with the same > >key+IV but different data and leaking what parts of the page changed, I > >do think we should use the LSN and have it change regularly (including > >unlogged tables) but that's just because it's relatively easy for us to > >do and means an attacker wouldn't be able to tell what part of the page > >changed when the LSN was also changed. That was also recommended by > >NIST and that's a pretty strong endorsement also. > > Not sure - it seems a bit weird to force LSN change even in cases that don't > generate any WAL. I was not following the encryption thread and maybe it was > discussed/rejected there, but I've always imagined we'd have a global nonce > generator (similar to a sequence) and we'd store it at the end of each > block, or something like that. The 'LSN' being referred to here isn't the regular LSN that is associated with the WAL but rather the separate FakeLSN counter which we already have. I wasn't suggesting having the regular LSN change in cases that don't generate WAL. * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: > > Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 > > > > and then this: > > > > https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt > > > > where plain64 is defined as: > > > > plain64: the initial vector is the 64-bit little-endian version of the > > sector number, padded with zeros if necessary > > > > That is, the default for LUKS is AES, XTS, with a simple IV. That > > strikes me as a pretty ringing endorsement. > > Yes, that sounds promising. It might not hurt to check for other > precedents as well, but that seems like a pretty good one. > > I'm not very convinced that using the LSN for any of this is a good > idea. Something that changes most of the time but not all the time > seems more like it could hurt by masking fuzzy thinking more than it > helps anything. This argument doesn't come across as very strong at all to me, particularly when we have explicit recommendations from NIST that having the IV vary more is beneficial. While this would be using the LSN, the fact that the LSN changes most of the time but not all of the time isn't new and is something we already have to deal with. I'd think we'd address the concern about mis-thinking around how this works by providing a README and/or an appropriate set of comments around what's being done and why. * Andres Freund (andres@anarazel.de) wrote: > On 2021-10-15 15:22:48 -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > Finally, there is an interesting web page about when not to use XTS: > > > > > > https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ > > > > This particular article always struck me as more of a reason for us, at > > least, to use XTS than to not- in particular the very first comment it > > makes, which seems to be pretty well supported, is: "XTS is the de-facto > > standard disk encryption mode." > > I don't find that line of argument *that* convincing. The reason XTS is the > de-facto standard is that for generic block layer encryption is that you can't > add additional data for each block without very significant overhead > (basically needing journaling to ensure that the data doesn't get out of > sync). But we don't really face the same situation - we *can* add additional > data. No, we can't always add additional data, and that's part of the consideration for an XTS option- there are things we can do if we use XTS that we can't with GCM or another solution. Specifically, being able to perform physical replication from an unencrypted cluster to an encrypted one is a worthwhile use-case that we shouldn't be just tossing out. > With something like AES-GCM-SIV we can use the additional data to get IV reuse > resistance *and* authentication. And while perhaps we are ok with the IV reuse > guarantees XTS has, it seems pretty clear that we'll want want guaranteed > authenticity at some point. And then we'll need extra data anyway. I agree that it'd be useful to have an authenticated encryption option. Implementing XTS doesn't preclude us from adding that capability down the road and it's simpler with fewer dependencies. These all strike me as good reasons to add XTS first. > Thus, to me, it doesn't seem worth going down the XTS route, just to > temporarily save a bit of implementation effort. We'll have to endure that > pain anyway. This isn't a valid argument as it isn't just about implementation but about the capabilities we will have once it's done. * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > On 10/15/21 23:02, Robert Haas wrote: > >On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: > >>That is, the default for LUKS is AES, XTS, with a simple IV. That > >>strikes me as a pretty ringing endorsement. > > > >Yes, that sounds promising. It might not hurt to check for other > >precedents as well, but that seems like a pretty good one. > > TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE > products at [1], and some of them use XTS, but I would take that with a > grain of salt - some of the products are somewhat obscure, very old, or > both. > > What is probably more interesting is that there's an IEEE standard [2] > dealing with encrypted shared storage, and that uses XTS too. I'd bet > there's a bunch of smart cryptographers involved. Thanks for finding those and linking to them, that's helpful. > >I'm not very convinced that using the LSN for any of this is a good > >idea. Something that changes most of the time but not all the time > >seems more like it could hurt by masking fuzzy thinking more than it > >helps anything. > > I haven't been following the discussion about using LSN, but I agree that > while using it seems convenient, the consequences of some changes not > incrementing LSN seem potentially disastrous, depending on the encryption > mode. Yes, this depends on the encryption mode, and is why we are specifically talking about XTS here as it's an encryption mode that doesn't suffer from this risk and therefore it's perfectly fine to use the LSN/FakeLSN with XTS (and would also be alright for AES-GCM-SIV as it's specifically designed to be resistant to IV reuse). * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote: > > > That said, I don't think that's really a huge issue or something that's > > > a show stopper or a reason to hold off on using XTS. Note that what > > > those bits actually *are* isn't leaked, just that they changed in some > > > fashion inside of that 16-byte cipher text block. That they're directly > > > leaked with CTR is why there was concern raised about using that method, > > > as discussed above and previously. > > > > Yeah. With CTR you pretty learn where the hint bits are exactly, while with > > XTS the whole ciphertext changes. > > > > This also means CTR is much more malleable, i.e. you can tweak the > > ciphertext bits to flip the plaintext, while with XTS that's not really > > possible - it's pretty much guaranteed to break the block structure. Not > > sure if that's an issue for our use case, but if it is then neither of the > > two modes is a solution. > > Yes, this is a vary good point. Let's look at the impact of _not_ using > the LSN. For CTR (already rejected) bit changes would be visible by > comparing old/new page contents. For CBC (also not under consideration) > the first 16-byte block would show a change, and all later 16-byte > blocks would show a change. For CBC, you see the 16-byte blocks change, > but you have no idea how many bits were changed, and in what locations > in the 16-byte block (AES uses substitution and diffusion). For XTS, > because earlier blocks don't change the IV used by later blocks like > CBC, you would be able to see each 16-byte block that changed in the 8k > page. Again, you would not know the number of bits changed or their > locations. > > Do we think knowing which 16-byte blocks on an 8k page change would leak > useful information? If so, we should use the LSN and just accept that > some cases might leak as described above. If we don't care, then we can > skip the use of the LSN and simplify the patch. While there may not be an active attack against PG that leverages such a leak, I have a hard time seeing why we would intentionally design this in when we have a great option that's directly available to us and doesn't cause such a leak with nearly such regularity as not using the LSN would, and also follows recommendations of using XTS from NIST. Further, not using the LSN wouldn't really be an option if we did eventually implement AES-GCM-SIV, so why not have the two cases be consistent? > > Not sure - it seems a bit weird to force LSN change even in cases that don't > > generate any WAL. I was not following the encryption thread and maybe it was > > discussed/rejected there, but I've always imagined we'd have a global nonce > > generator (similar to a sequence) and we'd store it at the end of each > > block, or something like that. > > Storing the nonce in the page means more code complexity, possible > performance impact, and the inability to create standbys via binary > replication that use cluster file encryption. Right- the point of XTS is that we can do things that we can't with GCM and that it's simpler. > As a final comment to Andres's email, adding a GCM has the problems > above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which > could also affect the integrity of the data. Someone could also restore > and old copy of a patch to revert a change, and that would not be > detected even by GCM. That's an argument based on how things stand today. I appreciate that it's no small thing to consider changes to those other systems but I would argue that having authentication of the heap is still better than not (but also agree that XTS is simpler to implement and therefore makes sense to do first and see how things stand after that's done). Surely, we would want progress made here to be done so incrementally as a patch that attempted to change all of those other systems to be encrypted and authenticated with AES-GCM-SIV would be far too large to consider in one shot anyway. > I consider this a checkbox feature and making it too complex will cause > it to be rightly rejected. Presuming that 'checkbox feature' here means "we need it to please $someone but no one will ever use it" or something along those lines, this is very clearly not the case and therefore we shouldn't be describing it or treating it as such. Even if the meaning here is "there's other ways people could get this capability" the reality is that those other methods are simply not always available and in those cases, people will choose to not use PostgreSQL. Nearly every other database system which we might compare ourselves to has a solution in this area and people actively use those solutions in a lot of deployments. * Andres Freund (andres@anarazel.de) wrote: > On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote: > > As a final comment to Andres's email, adding a GCM has the problems > > above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which > > could also affect the integrity of the data. Someone could also restore > > and old copy of a patch to revert a change, and that would not be > > detected even by GCM. > > > I consider this a checkbox feature and making it too complex will cause > > it to be rightly rejected. > > You're just deferring / hiding the complexity. For one, we'll need integrity > before long if we add encryption support. Then we'll deal with a more complex > on-disk format because there will be two different ways of encrypting. For > another, you're spreading out the security analysis to a lot of places in the > code and more importantly to future changes affecting on-disk data. I don't follow this argument. The XTS approach is explicitly the same on-disk format as what we have unencrypted today, just encrypted, and that's the whole point of going with that approach. If we were to implement AES-GCM-SIV, then that would introduce a new on-disk format and then we'd have two- one which has space on each page for the authentication information, and one which doesn't. > If it's really just a checkbox feature without a real use case, then we should > just reject requests for it and use our energy for useful things. This capability certainly has a real use-case and it's one that a lot of organizations are looking for PG to provide a solution for. That we don't today is keeping those organizations from using PG in at least some cases, and for some organizations, it prevents them from using PG at all, as they understandably would rather not deal with a hybrid of using PG for some things and having to use another solution for other things. * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > On 10/16/21 16:16, Bruce Momjian wrote: > >Storing the nonce in the page means more code complexity, possible > >performance impact, and the inability to create standbys via binary > >replication that use cluster file encryption. > > Would it really be that complex? Reserving a bunch of bytes at the end of > each encrypted page (a bit like the "special" space, but after encryption) > seems fairly straightforward. And I don't quite see why would this have a > measurable impact, given the nonce is 16B at most. The encryption is likely > way more expensive. There's a patch to do exactly this- make space available towards the end of the page. If we go down the route of using a different page format then we lose the ability to do physical replication between an unencrypted cluster and an encrypted one. That's certainly a nice capability to have and it will help people migrate to an encrypted PG instance, plus it's overall simpler to work with, which is also an advantge. > Moreover, it seems fairly reasonable to trade a bit of code complexity for > something LSN-based which seems simpler but apparently has various weak > points and is much harder to reason about. This isn't just about code complexity but is also about the resulting capabilities from these different approaches. * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > On 10/16/21 18:28, Bruce Momjian wrote: > >On Sat, Oct 16, 2021 at 09:15:05AM -0700, Andres Freund wrote: > >>On 2021-10-16 10:16:25 -0400, Bruce Momjian wrote: > >>>As a final comment to Andres's email, adding a GCM has the problems > >>>above, plus it wouldn't detect changes to pg_xact, fsm, vm, etc, which > >>>could also affect the integrity of the data. Someone could also restore > >>>and old copy of a patch to revert a change, and that would not be > >>>detected even by GCM. > >> > >>>I consider this a checkbox feature and making it too complex will cause > >>>it to be rightly rejected. > >> > >>You're just deferring / hiding the complexity. For one, we'll need integrity > >>before long if we add encryption support. Then we'll deal with a more complex > >>on-disk format because there will be two different ways of encrypting. For > >>another, you're spreading out the security analysis to a lot of places in the > >>code and more importantly to future changes affecting on-disk data. > > I've argued for storing the nonce, but I don't quite see why would we need > integrity guarantees? > > AFAICS the threat model the patch aims to address is an attacker who can > observe the data (e.g. a low-privileged OS user), but can't modify the > files. Which seems like a reasonable model for shared environments. There are multiple threat models which we should be considering and that's why we may want to eventually add integrity. > IMO extending this to cases where the attacker can modify the data moves the > goalposts quite significantly. And it's quite possible authenticated > encryption would not be enough to prevent that, because that still works > only at block level, and you can probably do a lot of harm with replay > attacks (e.g. replacing blocks with older versions). And if you can modify > the data directory / config files, what are the chances you can't just get > access to the database, trace the processes or whatever? I agree that working towards an authenticated solution is a larger task. I don't agree that we should throw out the possibility that we may want to implement it eventually as there are certainly threat models where an attacker might have access to the storage but not to the database or the system on which the database is running. Implementing a system to address such an attack vector would take more consideration than just having authenticated encryption provided by PG, but it certainly couldn't be done without that either. > We already have a way to check integrity by storing page checksum, but I'm > not sure if that's good enough (there's a lot of subtle issues with building > proper AEAD scheme). No, it isn't good enough. * Sasasu (i@sasa.su) wrote: > Just a mention. the HMAC (or AE/AD) can be disabled in AES-GCM. HMAC in > AES-GCM is an encrypt-then-hash MAC. Not sure why you would though. > CRC-32 is not a crypto-safe hash (technically CRC-32 is not a hash > function). Cryptographers may unhappy with CRC-32. Yes, that's correct (and it isn't even CRC-32 that we have, heh). > I think CRC or SHA is not such important. If IV can be stored, I believe > there should have enough space to store HMAC. This would be the case, yes. If we can find a way to make room for an IV then we could make room to store the tag too, and we certainly should (and we should include a variety of additional data in the AEAD- block number, relfileno, etc). * Sasasu (i@sasa.su) wrote: > On 2021/10/16 04:57, Tomas Vondra wrote: > > Seems reasonable, on the assumption the threat models are the same. > > On 2021/10/16 03:22, Stephen Frost wrote: > >plain64: the initial vector is the 64-bit little-endian version of the > >sector number, padded with zeros if necessary > > > >That is, the default for LUKS is AES, XTS, with a simple IV. That > >strikes me as a pretty ringing endorsement > On 2021/10/18 05:23, Tomas Vondra wrote: > > AFAICS the threat model the patch aims to address is an attacker who can > > observe the data (e.g. a low-privileged OS user), but can't modify the > > files. Which seems like a reasonable model for shared environments. > > I agree this threat model. > > And if PostgreSQL is using XTS, there is no different with dm-encrypt. > The user can use dm-encrypt directly. dm-encrypt is not always an option and it doesn't actually address the threat-model that Tomas brought up here anyway, as it would be below the level that the low-privileged OS user would be looking at. That's not the only threat model to consider, but it is one which could potentially be addressed by either XTS or AES-GCM-SIV. There are threat models which dm-crypt would address, of course, such as data-at-rest (hard drive theft, improper disposal of storage media, backups which don't have their own encryption, etc), but, again, dm-crypt isn't always an option that is available and so I don't agree that we should throw this out just because dm-crypt exists and may be useable in some cases. Thanks, Stephen
Attachment
On Fri, Oct 15, 2021 at 5:21 PM Andres Freund <andres@anarazel.de> wrote: > I don't find that line of argument *that* convincing. The reason XTS is the > de-facto standard is that for generic block layer encryption is that you can't > add additional data for each block without very significant overhead > (basically needing journaling to ensure that the data doesn't get out of > sync). But we don't really face the same situation - we *can* add additional > data. Yes. The downside is that there is some code complexity, and also some runtime overhead even for cases that don't use encryption, because some things that now are compile time constants might need to be computed at runtime. That can probably be made pretty small, though. > With something like AES-GCM-SIV we can use the additional data to get IV reuse > resistance *and* authentication. And while perhaps we are ok with the IV reuse > guarantees XTS has, it seems pretty clear that we'll want want guaranteed > authenticity at some point. And then we'll need extra data anyway. > > Thus, to me, it doesn't seem worth going down the XTS route, just to > temporarily save a bit of implementation effort. We'll have to endure that > pain anyway. I agree up to a point, but I do also kind of feel like we should be leaving it up to whoever is working on an implementation to decide what they want to implement. I don't particularly like this discussion where it feels like people are trying to tell other people what they have to do because "the community has decided X." It's pretty clear that there are multiple opinions here, and I don't really see any of them to be without merit, nor do I see why Bruce or Stephen or you or anyone else should get to say "what the community has decided" in the absence of a clear consensus. I do really like the idea of using AES-GCM-SIV not because I know anything about it, but because the integrity checking seems cool, and storing the nonce seems like it would improve security. However, based on what I know now, I would not vote to reject an XTS-based patch and, as Stephen and Bruce have said, maybe with the right design it permits upgrades from non-encrypted clusters to encrypted clusters. I'm actually kind of doubtful about that, because that seems to require some pretty specific and somewhat painful implementation decisions. For example, I think if your solution for rotating keys is to shut down the standby, re-encrypt it with a new key, start it up again, and fail over to it, then you probably ever can't do key rotation in any other way. The keys now have to be non-WAL-logged so that the standby can be different, which means you can't add a new key on the master and run around re-encrypting everything with it, WAL-logging those changes as you go. Now I realize that implementing that is really challenging anyway so maybe some people wouldn't like to go that way, but then maybe other people would. Another thing you probably can't do in this model is encrypt different parts of the database with different keys, because how would you keep track of that? Certainly not in the system catalogs, if none of that can show up in the WAL stream. But, you know, still: if somebody showed up with a fully-working XTS patch with everything in good working order, I don't see that we have enough evidence to reject it just because it's XTS. And I would hope that the people favoring XTS would not vote to reject a fully working GCM patch just because it's GCM. I think what we ought to be doing at this point is combining our efforts to try to get some things committed which make future work in this area committed - like that patch to preserve relfilenode and database OID, or maybe some patches to drive all of our I/O through a smaller number of code paths instead of having every different type of temporary file we write reinvent the wheel. These discussions about what encryption type we ought to use are useful for ruling out options that we know are bad, but beyond that I'm not sure they have much value. AES in any mode could seem like a much less safe choice by the time we get a committed feature here than it does today - even if somehow that were to happen for v15. I expect there are people out there trying to break it even as I write these words, and it seems likely that they will eventually succeed, but as to when, who can say? -- Robert Haas EDB: http://www.enterprisedb.com
On 10/18/21 04:19, Sasasu wrote: > Just a mention. the HMAC (or AE/AD) can be disabled in AES-GCM. HMAC in > AES-GCM is an encrypt-then-hash MAC. > > CRC-32 is not a crypto-safe hash (technically CRC-32 is not a hash > function). Cryptographers may unhappy with CRC-32. > True. If you can flip enough bits in the page, it probably is not very hard to generate a page with the desired checksum. It's probably harder with XTS, but likely not much more. > I think CRC or SHA is not such important. If IV can be stored, I believe > there should have enough space to store HMAC. > Right, I agree. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/18/21 17:56, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: >> On 10/15/21 21:22, Stephen Frost wrote: >>> Now, to address the concern around re-encrypting a block with the same >>> key+IV but different data and leaking what parts of the page changed, I >>> do think we should use the LSN and have it change regularly (including >>> unlogged tables) but that's just because it's relatively easy for us to >>> do and means an attacker wouldn't be able to tell what part of the page >>> changed when the LSN was also changed. That was also recommended by >>> NIST and that's a pretty strong endorsement also. >> >> Not sure - it seems a bit weird to force LSN change even in cases that don't >> generate any WAL. I was not following the encryption thread and maybe it was >> discussed/rejected there, but I've always imagined we'd have a global nonce >> generator (similar to a sequence) and we'd store it at the end of each >> block, or something like that. > > The 'LSN' being referred to here isn't the regular LSN that is > associated with the WAL but rather the separate FakeLSN counter which we > already have. I wasn't suggesting having the regular LSN change in > cases that don't generate WAL. > I'm not very familiar with FakeLSN, but isn't that just about unlogged tables? How does that help cases like setting hint bits, which may not generate WAL? > * Robert Haas (robertmhaas@gmail.com) wrote: >> On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: >>> Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 >>> >>> and then this: >>> >>> https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt >>> >>> where plain64 is defined as: >>> >>> plain64: the initial vector is the 64-bit little-endian version of the >>> sector number, padded with zeros if necessary >>> >>> That is, the default for LUKS is AES, XTS, with a simple IV. That >>> strikes me as a pretty ringing endorsement. >> >> Yes, that sounds promising. It might not hurt to check for other >> precedents as well, but that seems like a pretty good one. >> >> I'm not very convinced that using the LSN for any of this is a good >> idea. Something that changes most of the time but not all the time >> seems more like it could hurt by masking fuzzy thinking more than it >> helps anything. > > This argument doesn't come across as very strong at all to me, > particularly when we have explicit recommendations from NIST that having > the IV vary more is beneficial. While this would be using the LSN, the > fact that the LSN changes most of the time but not all of the time isn't > new and is something we already have to deal with. I'd think we'd > address the concern about mis-thinking around how this works by > providing a README and/or an appropriate set of comments around what's > being done and why. > I don't think anyone objects to varying IV more, as recommended by NIST. AFAICS the issue at hand is exactly the opposite - maybe not varying it enough, in some cases. It might be enough for MVCC purposes yet it might result in fatal failure of the encryption scheme. That's my concern, at least, and I assume it's what Robert meant by "fuzzy thinking" too. FWIW I think we seem to be mixing nonces, IVs and tweak values. Although various encryption schemes place different requirements on those anyway. > * Andres Freund (andres@anarazel.de) wrote: >> On 2021-10-15 15:22:48 -0400, Stephen Frost wrote: >>> * Bruce Momjian (bruce@momjian.us) wrote: >>>> Finally, there is an interesting web page about when not to use XTS: >>>> >>>> https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ >>> >>> This particular article always struck me as more of a reason for us, at >>> least, to use XTS than to not- in particular the very first comment it >>> makes, which seems to be pretty well supported, is: "XTS is the de-facto >>> standard disk encryption mode." >> >> I don't find that line of argument *that* convincing. The reason XTS is the >> de-facto standard is that for generic block layer encryption is that you can't >> add additional data for each block without very significant overhead >> (basically needing journaling to ensure that the data doesn't get out of >> sync). But we don't really face the same situation - we *can* add additional >> data. > > No, we can't always add additional data, and that's part of the > consideration for an XTS option- there are things we can do if we use > XTS that we can't with GCM or another solution. Specifically, being > able to perform physical replication from an unencrypted cluster to an > encrypted one is a worthwhile use-case that we shouldn't be just tossing > out. > Yeah, XTS seems like a reasonable first step, both because it doesn't require storing extra data and it's widespread use in FDE software (of course, there's a link between those). And I suspect replication between encrypted and unencrypted clusters is going to be a huge can of worms, even with XTS. It's probably a silly / ugly idea, but can't we simply store a special "page format" flag in controldat - when set to 'true' during initdb, each page would have a bit of space (at the end) reserved for additional encryption data. Say, ~64B should be enough. On the encrypted cluster this would store the nonce/IV/... and on the unencrypted cluster it'd be simply unused. 64B seems like a negligible amount of data. And when set to 'false' the cluster would not allow encryption. >> With something like AES-GCM-SIV we can use the additional data to get IV reuse >> resistance *and* authentication. And while perhaps we are ok with the IV reuse >> guarantees XTS has, it seems pretty clear that we'll want want guaranteed >> authenticity at some point. And then we'll need extra data anyway. > > I agree that it'd be useful to have an authenticated encryption option. > Implementing XTS doesn't preclude us from adding that capability down > the road and it's simpler with fewer dependencies. These all strike me > as good reasons to add XTS first. > True. If XTS addresses the threat model we aimed to solve ... >> Thus, to me, it doesn't seem worth going down the XTS route, just to >> temporarily save a bit of implementation effort. We'll have to endure that >> pain anyway. > > This isn't a valid argument as it isn't just about implementation but > about the capabilities we will have once it's done. > > * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: >> On 10/15/21 23:02, Robert Haas wrote: >>> On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: >>>> That is, the default for LUKS is AES, XTS, with a simple IV. That >>>> strikes me as a pretty ringing endorsement. >>> >>> Yes, that sounds promising. It might not hurt to check for other >>> precedents as well, but that seems like a pretty good one. >> >> TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE >> products at [1], and some of them use XTS, but I would take that with a >> grain of salt - some of the products are somewhat obscure, very old, or >> both. >> >> What is probably more interesting is that there's an IEEE standard [2] >> dealing with encrypted shared storage, and that uses XTS too. I'd bet >> there's a bunch of smart cryptographers involved. > > Thanks for finding those and linking to them, that's helpful. > >>> I'm not very convinced that using the LSN for any of this is a good >>> idea. Something that changes most of the time but not all the time >>> seems more like it could hurt by masking fuzzy thinking more than it >>> helps anything. >> >> I haven't been following the discussion about using LSN, but I agree that >> while using it seems convenient, the consequences of some changes not >> incrementing LSN seem potentially disastrous, depending on the encryption >> mode. > > Yes, this depends on the encryption mode, and is why we are specifically > talking about XTS here as it's an encryption mode that doesn't suffer > from this risk and therefore it's perfectly fine to use the LSN/FakeLSN > with XTS (and would also be alright for AES-GCM-SIV as it's specifically > designed to be resistant to IV reuse). > I'm not quite sure about the "perfectly fine" bit, as it's making XTS vulnerable to traffic analysis attacks (comparing multiple copies of an encrypted block). It may be a reasonable trade-off, of course. > * Bruce Momjian (bruce@momjian.us) wrote: >> On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote: >>>> That said, I don't think that's really a huge issue or something that's >>>> a show stopper or a reason to hold off on using XTS. Note that what >>>> those bits actually *are* isn't leaked, just that they changed in some >>>> fashion inside of that 16-byte cipher text block. That they're directly >>>> leaked with CTR is why there was concern raised about using that method, >>>> as discussed above and previously. >>> >>> Yeah. With CTR you pretty learn where the hint bits are exactly, while with >>> XTS the whole ciphertext changes. >>> >>> This also means CTR is much more malleable, i.e. you can tweak the >>> ciphertext bits to flip the plaintext, while with XTS that's not really >>> possible - it's pretty much guaranteed to break the block structure. Not >>> sure if that's an issue for our use case, but if it is then neither of the >>> two modes is a solution. >> >> Yes, this is a vary good point. Let's look at the impact of _not_ using >> the LSN. For CTR (already rejected) bit changes would be visible by >> comparing old/new page contents. For CBC (also not under consideration) >> the first 16-byte block would show a change, and all later 16-byte >> blocks would show a change. For CBC, you see the 16-byte blocks change, >> but you have no idea how many bits were changed, and in what locations >> in the 16-byte block (AES uses substitution and diffusion). For XTS, >> because earlier blocks don't change the IV used by later blocks like >> CBC, you would be able to see each 16-byte block that changed in the 8k >> page. Again, you would not know the number of bits changed or their >> locations. >> >> Do we think knowing which 16-byte blocks on an 8k page change would leak >> useful information? If so, we should use the LSN and just accept that >> some cases might leak as described above. If we don't care, then we can >> skip the use of the LSN and simplify the patch. > > While there may not be an active attack against PG that leverages such a > leak, I have a hard time seeing why we would intentionally design this > in when we have a great option that's directly available to us and > doesn't cause such a leak with nearly such regularity as not using the > LSN would, and also follows recommendations of using XTS from NIST. > Further, not using the LSN wouldn't really be an option if we did > eventually implement AES-GCM-SIV, so why not have the two cases be > consistent? > I'm a bit confused, because the question was what happens if we encrypt the page twice with the same LSN or any tweak value in general. It certainly does not matter when it comes to malleability or replay attacks, because in that case the attacker is the one who modifies the block (and obviously won't change the LSN). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/18/21 17:56, Stephen Frost wrote: >> ... >> I've argued for storing the nonce, but I don't quite see why would we need >> integrity guarantees? >> >> AFAICS the threat model the patch aims to address is an attacker who can >> observe the data (e.g. a low-privileged OS user), but can't modify the >> files. Which seems like a reasonable model for shared environments. > > There are multiple threat models which we should be considering and > that's why we may want to eventually add integrity. > So what are these threat models? If we should be considering them it'd be nice to have a description, explaining what capabilities must the attacker have ... My (perhaps naive) understanding is that the authentication / integrity provides (partial) protection against attackers that may modify instance data - modify files, etc. But I'd guess an attacker with such capability can do various other (simpler) things to extract data. Say, modify the config to load an extension that dumps keys from memory, or whatever. So what's a plausible / practical threat model that would be mitigated by the authenticated encryption? It'd be a bit silly to add complexity to allow AEAD, only to find out there are ways around it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021/10/19 00:37, Robert Haas wrote: > I think what we ought to be doing at > this point is combining our efforts to try to get some things > committed which make future work in this area committed - like that > patch to preserve relfilenode and database OID, or maybe some patches > to drive all of our I/O through a smaller number of code paths instead > of having every different type of temporary file we write reinvent the > wheel. A unified block-based I/O API sounds great. Has anyone tried to do this before? It would be nice if the front-end tools could also use these API. As there are so many threat models, I propose to do the TDE feature by a set of hooks. those hooks are on the critical path of IO operation, add the ability to let extension replace the IO API. and also load extension when initdb, single-mode, and in front-end tools. This sounds Like using $LD_PRELOAD to replace pread(2) and pwrite(2), which widely used in folder based encryption. but the hook will pass more context (filenode, tableoid, blocksize, and many) to the under layer, this hook API will look like object_access_hook. then implement the simplest AES-XTS. and put it to contrib. provide a tool to deactivate AES-XTS to make PostgreSQL upgradeable. I think this is the most peaceful method. GCM people will not reject this just because XTS. and XTS people will satisfied(maybe?) with the complexity. for performance, just one more long-jump compare with current AES-XTS code.
Attachment
On Tue, Oct 19, 2021 at 11:46 AM Sasasu <i@sasa.su> wrote: > As there are so many threat models, I propose to do the TDE feature by a > set of hooks. This is too invasive to do using hooks. We are inevitably going to need to make significant changes in core. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > On 10/18/21 17:56, Stephen Frost wrote: > >* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > >>On 10/15/21 21:22, Stephen Frost wrote: > >>>Now, to address the concern around re-encrypting a block with the same > >>>key+IV but different data and leaking what parts of the page changed, I > >>>do think we should use the LSN and have it change regularly (including > >>>unlogged tables) but that's just because it's relatively easy for us to > >>>do and means an attacker wouldn't be able to tell what part of the page > >>>changed when the LSN was also changed. That was also recommended by > >>>NIST and that's a pretty strong endorsement also. > >> > >>Not sure - it seems a bit weird to force LSN change even in cases that don't > >>generate any WAL. I was not following the encryption thread and maybe it was > >>discussed/rejected there, but I've always imagined we'd have a global nonce > >>generator (similar to a sequence) and we'd store it at the end of each > >>block, or something like that. > > > >The 'LSN' being referred to here isn't the regular LSN that is > >associated with the WAL but rather the separate FakeLSN counter which we > >already have. I wasn't suggesting having the regular LSN change in > >cases that don't generate WAL. > > I'm not very familiar with FakeLSN, but isn't that just about unlogged > tables? How does that help cases like setting hint bits, which may not > generate WAL? Errr, there seems to have been some confusion in this thread between the different ideas being tossed around. The point of using FakeLSN for unlogged tables consistently is to provide variability in the value used as the IV. I wasn't suggesting to use FakeLSN to provide a uniqueness guarantee- the reason we're talking about using XTS is specifically because, unlike CTR, unique IVs are not required. Further, we don't need to find any additional space on the page if we use XTS, meaning we can do things like go from an unencrypted to an encrypted system with a basebackup+physical replication. > >* Robert Haas (robertmhaas@gmail.com) wrote: > >>On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: > >>>Specifically: The default cipher for LUKS is nowadays aes-xts-plain64 > >>> > >>>and then this: > >>> > >>>https://gitlab.com/cryptsetup/cryptsetup/-/wikis/DMCrypt > >>> > >>>where plain64 is defined as: > >>> > >>>plain64: the initial vector is the 64-bit little-endian version of the > >>>sector number, padded with zeros if necessary > >>> > >>>That is, the default for LUKS is AES, XTS, with a simple IV. That > >>>strikes me as a pretty ringing endorsement. > >> > >>Yes, that sounds promising. It might not hurt to check for other > >>precedents as well, but that seems like a pretty good one. > >> > >>I'm not very convinced that using the LSN for any of this is a good > >>idea. Something that changes most of the time but not all the time > >>seems more like it could hurt by masking fuzzy thinking more than it > >>helps anything. > > > >This argument doesn't come across as very strong at all to me, > >particularly when we have explicit recommendations from NIST that having > >the IV vary more is beneficial. While this would be using the LSN, the > >fact that the LSN changes most of the time but not all of the time isn't > >new and is something we already have to deal with. I'd think we'd > >address the concern about mis-thinking around how this works by > >providing a README and/or an appropriate set of comments around what's > >being done and why. > > I don't think anyone objects to varying IV more, as recommended by NIST. Great. > AFAICS the issue at hand is exactly the opposite - maybe not varying it > enough, in some cases. It might be enough for MVCC purposes yet it might > result in fatal failure of the encryption scheme. That's my concern, at > least, and I assume it's what Robert meant by "fuzzy thinking" too. XTS does not require the IV to be unique for every invocation, and indeed other systems like dm-crypt don't use a unique IV for XTS. We really can't divorce the encryption methodology from the parameters that are being used. CTR and GCM are the methods that require a unique IV (or, at least, a very very very likely unique one if you can't actually implement a proper counter) but that's *not* what we're talking about here. The methods being discussed are XTS and GCM-SIV, the former explicitly doesn't require a unique IV and the latter is specifically designed to reduce the impact of an IV being re-used. Both are, as NIST points out, better off with a varying IV, but having the IV be reused from time to time in either XTS or GCM-SIV does not result in a fatal failure of the encryption scheme. > FWIW I think we seem to be mixing nonces, IVs and tweak values. Although > various encryption schemes place different requirements on those anyway. The differences between those are pretty subtle and it gets a bit confusing when things like OpenSSL accept an IV but then use it as the 'tweak', such as with XTS. In general though, most methods accept a key, a nonce/IV/tweak, and then produce ciphertext and possbily a tag, so I don't think we've been incorrect in usage but rather perhaps a bit sloppy by not using the right term for the specific methodology. > >* Andres Freund (andres@anarazel.de) wrote: > >>On 2021-10-15 15:22:48 -0400, Stephen Frost wrote: > >>>* Bruce Momjian (bruce@momjian.us) wrote: > >>>>Finally, there is an interesting web page about when not to use XTS: > >>>> > >>>> https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/ > >>> > >>>This particular article always struck me as more of a reason for us, at > >>>least, to use XTS than to not- in particular the very first comment it > >>>makes, which seems to be pretty well supported, is: "XTS is the de-facto > >>>standard disk encryption mode." > >> > >>I don't find that line of argument *that* convincing. The reason XTS is the > >>de-facto standard is that for generic block layer encryption is that you can't > >>add additional data for each block without very significant overhead > >>(basically needing journaling to ensure that the data doesn't get out of > >>sync). But we don't really face the same situation - we *can* add additional > >>data. > > > >No, we can't always add additional data, and that's part of the > >consideration for an XTS option- there are things we can do if we use > >XTS that we can't with GCM or another solution. Specifically, being > >able to perform physical replication from an unencrypted cluster to an > >encrypted one is a worthwhile use-case that we shouldn't be just tossing > >out. > > Yeah, XTS seems like a reasonable first step, both because it doesn't > require storing extra data and it's widespread use in FDE software (of > course, there's a link between those). And I suspect replication between > encrypted and unencrypted clusters is going to be a huge can of worms, even > with XTS. Glad you agree with XTS being a reasonable first step. XTS will at least make physical replication possible- other methods (such as below) simply won't work. > It's probably a silly / ugly idea, but can't we simply store a special "page > format" flag in controldat - when set to 'true' during initdb, each page > would have a bit of space (at the end) reserved for additional encryption > data. Say, ~64B should be enough. On the encrypted cluster this would store > the nonce/IV/... and on the unencrypted cluster it'd be simply unused. 64B > seems like a negligible amount of data. And when set to 'false' the cluster > would not allow encryption. This is essentially what the patch that was posted does, but the problem is that you can't do physical replication when 64B have been stolen off of a page because the page in the unencrypted database might be entirely full and not able to physically fit those extra 64B, and then what? > >>With something like AES-GCM-SIV we can use the additional data to get IV reuse > >>resistance *and* authentication. And while perhaps we are ok with the IV reuse > >>guarantees XTS has, it seems pretty clear that we'll want want guaranteed > >>authenticity at some point. And then we'll need extra data anyway. > > > >I agree that it'd be useful to have an authenticated encryption option. > >Implementing XTS doesn't preclude us from adding that capability down > >the road and it's simpler with fewer dependencies. These all strike me > >as good reasons to add XTS first. > > True. If XTS addresses the threat model we aimed to solve ... It addresses a valid threat model that people are interested in PG having a solution for, yes. > >>Thus, to me, it doesn't seem worth going down the XTS route, just to > >>temporarily save a bit of implementation effort. We'll have to endure that > >>pain anyway. > > > >This isn't a valid argument as it isn't just about implementation but > >about the capabilities we will have once it's done. > > > >* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > >>On 10/15/21 23:02, Robert Haas wrote: > >>>On Fri, Oct 15, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: > >>>>That is, the default for LUKS is AES, XTS, with a simple IV. That > >>>>strikes me as a pretty ringing endorsement. > >>> > >>>Yes, that sounds promising. It might not hurt to check for other > >>>precedents as well, but that seems like a pretty good one. > >> > >>TrueCrypt/VeraCrypt uses XTS too, I think. There's an overview of other FDE > >>products at [1], and some of them use XTS, but I would take that with a > >>grain of salt - some of the products are somewhat obscure, very old, or > >>both. > >> > >>What is probably more interesting is that there's an IEEE standard [2] > >>dealing with encrypted shared storage, and that uses XTS too. I'd bet > >>there's a bunch of smart cryptographers involved. > > > >Thanks for finding those and linking to them, that's helpful. > > > >>>I'm not very convinced that using the LSN for any of this is a good > >>>idea. Something that changes most of the time but not all the time > >>>seems more like it could hurt by masking fuzzy thinking more than it > >>>helps anything. > >> > >>I haven't been following the discussion about using LSN, but I agree that > >>while using it seems convenient, the consequences of some changes not > >>incrementing LSN seem potentially disastrous, depending on the encryption > >>mode. > > > >Yes, this depends on the encryption mode, and is why we are specifically > >talking about XTS here as it's an encryption mode that doesn't suffer > >from this risk and therefore it's perfectly fine to use the LSN/FakeLSN > >with XTS (and would also be alright for AES-GCM-SIV as it's specifically > >designed to be resistant to IV reuse). > > I'm not quite sure about the "perfectly fine" bit, as it's making XTS > vulnerable to traffic analysis attacks (comparing multiple copies of an > encrypted block). It may be a reasonable trade-off, of course. Considering the default usage in dmcrypt isn't even varying the IV as much as what we're talking about here, I'd say that, yes, it's quite reasonable for our use-case and allows us to vary the IV quite a bit which reduces the attack surface in a meaningful way. That dmcrypt has this risk and it isn't considered enough of an issue for them to use something other than plain64 as the default makes it certainly seem reasonable to me. > >* Bruce Momjian (bruce@momjian.us) wrote: > >>On Fri, Oct 15, 2021 at 10:57:03PM +0200, Tomas Vondra wrote: > >>>>That said, I don't think that's really a huge issue or something that's > >>>>a show stopper or a reason to hold off on using XTS. Note that what > >>>>those bits actually *are* isn't leaked, just that they changed in some > >>>>fashion inside of that 16-byte cipher text block. That they're directly > >>>>leaked with CTR is why there was concern raised about using that method, > >>>>as discussed above and previously. > >>> > >>>Yeah. With CTR you pretty learn where the hint bits are exactly, while with > >>>XTS the whole ciphertext changes. > >>> > >>>This also means CTR is much more malleable, i.e. you can tweak the > >>>ciphertext bits to flip the plaintext, while with XTS that's not really > >>>possible - it's pretty much guaranteed to break the block structure. Not > >>>sure if that's an issue for our use case, but if it is then neither of the > >>>two modes is a solution. > >> > >>Yes, this is a vary good point. Let's look at the impact of _not_ using > >>the LSN. For CTR (already rejected) bit changes would be visible by > >>comparing old/new page contents. For CBC (also not under consideration) > >>the first 16-byte block would show a change, and all later 16-byte > >>blocks would show a change. For CBC, you see the 16-byte blocks change, > >>but you have no idea how many bits were changed, and in what locations > >>in the 16-byte block (AES uses substitution and diffusion). For XTS, > >>because earlier blocks don't change the IV used by later blocks like > >>CBC, you would be able to see each 16-byte block that changed in the 8k > >>page. Again, you would not know the number of bits changed or their > >>locations. > >> > >>Do we think knowing which 16-byte blocks on an 8k page change would leak > >>useful information? If so, we should use the LSN and just accept that > >>some cases might leak as described above. If we don't care, then we can > >>skip the use of the LSN and simplify the patch. > > > >While there may not be an active attack against PG that leverages such a > >leak, I have a hard time seeing why we would intentionally design this > >in when we have a great option that's directly available to us and > >doesn't cause such a leak with nearly such regularity as not using the > >LSN would, and also follows recommendations of using XTS from NIST. > >Further, not using the LSN wouldn't really be an option if we did > >eventually implement AES-GCM-SIV, so why not have the two cases be > >consistent? > > I'm a bit confused, because the question was what happens if we encrypt the > page twice with the same LSN or any tweak value in general. It certainly > does not matter when it comes to malleability or replay attacks, because in > that case the attacker is the one who modifies the block (and obviously > won't change the LSN). The question that I was responding to above was specifically about if knowing which 16-byte blocks on an 8K page changes was an issue or not and that's what I was addressing. As for if we encrypt the page twice with the same LSN/tweak/IV or not- that depends on the specific encryption methodology being used as to how much of an issue that is, as discussed above. * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > On 10/18/21 17:56, Stephen Frost wrote: > >> ... > >>I've argued for storing the nonce, but I don't quite see why would we need > >>integrity guarantees? > >> > >>AFAICS the threat model the patch aims to address is an attacker who can > >>observe the data (e.g. a low-privileged OS user), but can't modify the > >>files. Which seems like a reasonable model for shared environments. > > > >There are multiple threat models which we should be considering and > >that's why we may want to eventually add integrity. > > So what are these threat models? If we should be considering them it'd be > nice to have a description, explaining what capabilities must the attacker > have ... The first and simplest to consider is the basic "data at rest" threat model, where a hard drive is not properly wiped or a backup isn't properly encrypted, or a laptop or other removable media is stolen, etc. These are well addressed through XTS as in those cases what is needed is confidentiality, not integrity, as the attacker isn't able to modify the existing system in such cases. Another threat model to consider is if the attacker has read-only access to the data directory through, say, unix group read privileges or maybe the ability to monitor the traffic on the SAN, or the ability to read-only mount the LUN on to another system. This might be obtained by attacking a backup process where the system was configured to run physical backups using an unprivileged OS user who only has group read access to the cluster (and the necessary but non-superuser privleges in the database system to start/stop the backup), or various potential attacks at the storage layer. This is similar to the "data at rest" case above in that XTS works well to address this, but because the attacker would have ongoing access (rather than just one-time, such as in the first case), information such as which blocks are being changed inside of a given 8k page might be able to be determined and that could be useful information, though a point here: they would already be able to see clearly which 8k pages are being changed and which aren't, and there's not really any way for us to prevent that reasonably. As such, I'd argue that using XTS is reasonable and we can mitigate some of this concern by using the LSN in the tweak instead of just the block number as the 'plain64' option in dmcrypt does. That doing so would mean that we'd also be able to reasonably use the same IV for both XTS and AES-GCM-SIV, should we choose to implement that at some point down the road, is a nice perk to this approach. > My (perhaps naive) understanding is that the authentication / integrity > provides (partial) protection against attackers that may modify instance > data - modify files, etc. But I'd guess an attacker with such capability can > do various other (simpler) things to extract data. Say, modify the config to > load an extension that dumps keys from memory, or whatever. Another attack vector to consider is an attacker who is able to actively modify files on the system in some way. The specific files they are able to modify matter in such a case. There's no doubt that it's more difficult to address such an attack vector and it's unlikely we'd be able to provide a 100% solution in all cases, but that doesn't mean we should throw out the idea entirely. > So what's a plausible / practical threat model that would be mitigated by > the authenticated encryption? In a similar vein to above- consider a storage-level attacker who is able to gain read/write access to a particular volume. If that volume is used only for a tablespace, a TDE implementation which uses AES-GCM-SIV would go a long way towards protecting the system. We may even go so far as to encourage users to ensure that their 'main' PG data directory, where configuration, transaction log, et al, are stored be kept on a more secure (perhaps local) volume. This isn't a 100% solution, of course, due to the visibility map and the free space map being stored with the tables in the tablespaces, but attacks on those would be much more coarse and difficult to mount effectively. We should also be thinking about ways to address those other subsystems and improve the situation around them (and not just for encryption but also for detection of corruption) but trying to get all of that done in one patch or even one major release would make this a massive change that would almost certainly be rejected as too destablizing. > It'd be a bit silly to add complexity to allow AEAD, only to find out there > are ways around it. There are ways around it. There likely always will be. We need to be clear about what it provides and what it doesn't. We need to stop telling ourselves that the only answer is a 100% solution and therefore it's impossible to do. Users who care about these capabilities will understand that it's not 100% and they will still happily use it because it's better than 0% which is where we are today and is why they are going with other solutions. Yes, if it's trivial to get around then perhaps it's not much better than 0% and if that's the case then it doesn't make sense to do it, but none of what has been discussed here thus far has made me feel like either the XTS or the GCM-SIV approaches would be trivial to to circumvent for the threat models they're intended to address, though it certainly takes more care and more thought when we're trying to address someone who has write access to part of the system and that we need to be clear what is addressed and what isn't in all of these cases. Thanks, Stephen
Attachment
Greetings, * Sasasu (i@sasa.su) wrote: > On 2021/10/19 00:37, Robert Haas wrote: > >I think what we ought to be doing at > >this point is combining our efforts to try to get some things > >committed which make future work in this area committed - like that > >patch to preserve relfilenode and database OID, or maybe some patches > >to drive all of our I/O through a smaller number of code paths instead > >of having every different type of temporary file we write reinvent the > >wheel. > > A unified block-based I/O API sounds great. Has anyone tried to do this > before? It would be nice if the front-end tools could also use these API. The TDE patch from Cybertec did go down this route, but the API ended up being rather different which menat a lot of changes in other parts of the system. If we can get a block-based temporary file method that maintains more-or-less the same API, that'd be great, but I'm not sure that we can really do so and I am not entirely convinced that we should make the TDE effort depend on an otherwise quite independent effort of making all temp files usage be block based. > As there are so many threat models, I propose to do the TDE feature by a set > of hooks. those hooks are on the critical path of IO operation, add the > ability to let extension replace the IO API. and also load extension when > initdb, single-mode, and in front-end tools. > This sounds Like using $LD_PRELOAD to replace pread(2) and pwrite(2), which > widely used in folder based encryption. but the hook will pass more context > (filenode, tableoid, blocksize, and many) to the under layer, this hook API > will look like object_access_hook. > then implement the simplest AES-XTS. and put it to contrib. provide a tool > to deactivate AES-XTS to make PostgreSQL upgradeable. > > I think this is the most peaceful method. GCM people will not reject this > just because XTS. and XTS people will satisfied(maybe?) with the complexity. > for performance, just one more long-jump compare with current AES-XTS code. I agree with Robert- using hooks for this really isn't realistic. Where would you store the tag for GCM without changes in core, for starters..? Certainly wouldn't make sense to provide GCM only to throw the tag away. Even between XTS and GCM, to say nothing of other possible methods, there's going to be some serious differences that a single hook-based API wouldn't be able to address. Thanks, Stephen
Attachment
On 2021/10/20 02:54, Stephen Frost wrote: > I agree with Robert- using hooks for this really isn't realistic. OK, I agree use hooks is too invasive. Just a whim, never mind. But If PG has a clear block-based IO API, TDE is much easier to understand. security people may lack database knowledge but they can understand block IO. This will allow more people to join PG community. On 2021/10/20 02:54, Stephen Frost wrote: > Where would you store the tag for GCM without changes in core? If can add 32bit reserve field (in CGM is 28bits) will be best. data file size will increase 0.048% (if BLCKSZ = 8KiB), I think it is acceptable even for the user who does not use TDE. but need ondisk format change. If without of modify anything in core and doing GCM, the under-layer can write out a key fork, fsync(2) key fork with the same strategy for main fork. this is crash-safe. The consistency is ensured by WAL. (means wal_log_hints need set to on) Or the underlayer can re-struct the IO request. insert one IV block per 2730(=BLKSZ/IV_SIZE) data blocks. this method like the _mdfd_getseg() in md.c which split file by 1GiB. No perception in the upper layers. Both of them can use cache to reduce performance downgrade. for WAL encryption, the CybertecDB implement is correct. we can not write any extra data without adding a reserved field in core. because can not guarantee consistency. If use GCM for WAL encryption must disable HMAC verification. * only shows the possibility, not mean anyone should implement TDE in that way. * blahblah
Attachment
Greetings, * Sasasu (i@sasa.su) wrote: > But If PG has a clear block-based IO API, TDE is much easier to understand. PG does have a block-based IO API, it's just not exposed as hooks. In particular, take a look at md.c, though perhaps you'd be more interested in the higher level bufmgr.c routines. For the specific places where encryption may hook in, looking at the DataChecksumsEnabled() call sites may be informative (there aren't that many of them). > security people may lack database knowledge but they can understand block > IO. > This will allow more people to join PG community. We'd certainly welcome them. I don't think we're going to try to redesign our entire IO subsystem in the hopes that they'll show up though. > On 2021/10/20 02:54, Stephen Frost wrote: > > Where would you store the tag for GCM without changes in core? > > If can add 32bit reserve field (in CGM is 28bits) will be best. That's the idea that's been discussed, but the approach put forward is to do it in a manner that allows the same binaries to work with a TDE-enabled cluster and a non-TDE cluster which means two different formats on disk. This is still a pretty big deal and would require logical replication or pg_dump/restore to go from unencrypted to encrypted. > data file size will increase 0.048% (if BLCKSZ = 8KiB), I think it is > acceptable even for the user who does not use TDE. but need ondisk format > change. Breaking our ondisk format explicitly means that pg_upgrade won't work any longer and folks won't be able to do in-place upgrades. That's a pretty huge deal and it's something we've not done in over a decade. I doubt that's going to fly. > If without of modify anything in core and doing GCM, the under-layer can > write out a key fork, fsync(2) key fork with the same strategy for main > fork. this is crash-safe. The consistency is ensured by WAL. (means > wal_log_hints need set to on) > Or the underlayer can re-struct the IO request. insert one IV block per > 2730(=BLKSZ/IV_SIZE) data blocks. this method like the _mdfd_getseg() in > md.c which split file by 1GiB. No perception in the upper layers. > Both of them can use cache to reduce performance downgrade. Yes, using another fork for this is something that's been considered but it's not without its own drawbacks, in particular having to do another write and later fsync when a page changes. Further, none of this is necessary for XTS, but only for GCM. This is why it was put forward that GCM involves a lot more changes to the system and means that we won't be able to do things like binary replication to switch from an unencrypted to encrypted cluster. Those are good reasons to consider an XTS implementation first and then later, perhaps, implement GCM. > for WAL encryption, the CybertecDB implement is correct. we can not write > any extra data without adding a reserved field in core. because can not > guarantee consistency. If use GCM for WAL encryption must disable HMAC > verification. What's the point of using GCM if we aren't going to actually verify the tag? Also, the Cybertec patch didn't add an extra reserved field to the page format, and it used CTR anyway.. Thanks, Stephen
Attachment
On 2021/10/20 20:24, Stephen Frost wrote: > PG does have a block-based IO API, it's just not exposed as hooks. In > particular, take a look at md.c, though perhaps you'd be more interested > in the higher level bufmgr.c routines. For the specific places where > encryption may hook in, looking at the DataChecksumsEnabled() call sites > may be informative (there aren't that many of them). md.c is great, easy to understand. but PG does not have a unified API. There has many unexpected pread()/pwrite() in many corners. md.c only for heap table, bufmgr.c only for a buffered heap table. eg: XLogWrite() looks like a block API, but is a range write. equivalent to the append(2) eg: ALTER DATABASE SET TABLESPACE , the movedb() call. use copy_file() on heap table. which is just pread() pwrite() with 8*BLCKSZ. eg: all front-end tools use pread() to read heap table. in particular, pg_rewind write heap table by offset. eg: in contrib, pg_standby use system("cp") to copy WAL. On 2021/10/20 20:24, Stephen Frost wrote: > Breaking our ondisk format explicitly means that pg_upgrade won't work > any longer and folks won't be able to do in-place upgrades. That's a > pretty huge deal and it's something we've not done in over a decade. > I doubt that's going to fly. I completely agree. On 2021/10/20 20:24, Stephen Frost wrote: > Yes, using another fork for this is something that's been considered but > it's not without its own drawbacks, in particular having to do another > write and later fsync when a page changes. > > Further, none of this is necessary for XTS, but only for GCM. This is > why it was put forward that GCM involves a lot more changes to the > system and means that we won't be able to do things like binary > replication to switch from an unencrypted to encrypted cluster. Those > are good reasons to consider an XTS implementation first and then later, > perhaps, implement GCM. same as Robert Haas. I wish PG can do some infrastructure first. add more abstract layers like md.c (maybe a block-based API with ondisk format version field). so people can dive in without understanding the things which isolated by the abstract layer. On 2021/10/20 20:24, Stephen Frost wrote: > What's the point of using GCM if we aren't going to actually verify the > tag? Also, the Cybertec patch didn't add an extra reserved field to the > page format, and it used CTR anyway.. Oh, I am wrong, Cybertec patch can not use XTS, because WAL may not be aligned to 16bytes. for WAL need a stream cipher. The CTR implement is still correct. CTR with hash(offset) as IV is basically equal to XTS. if use another AES key to encrypt the hash(offset), and block size is 16bytes it is XTS. The point is that can not save random IV for WAL without adding a reserved field, no matter use GCM or CTR. Because WAL only does append to the end, using CTR for WAL is safer than using XTS for heap table. If you want more security for WAL encryption, add HKDF[1]. [1]: https://en.wikipedia.org/wiki/HKDF
Attachment
Greetings, * Sasasu (i@sasa.su) wrote: > On 2021/10/20 20:24, Stephen Frost wrote: > > PG does have a block-based IO API, it's just not exposed as hooks. In > > particular, take a look at md.c, though perhaps you'd be more interested > > in the higher level bufmgr.c routines. For the specific places where > > encryption may hook in, looking at the DataChecksumsEnabled() call sites > > may be informative (there aren't that many of them). > > md.c is great, easy to understand. but PG does not have a unified API. There > has many unexpected pread()/pwrite() in many corners. md.c only for heap > table, bufmgr.c only for a buffered heap table. There's certainly other calls out there, yes. > eg: XLogWrite() looks like a block API, but is a range write. equivalent to > the append(2) XLog is certainly another thing that has to be dealt with, of course, but I don't see us trying to shoehorn that into using md.c somehow. > eg: ALTER DATABASE SET TABLESPACE , the movedb() call. use copy_file() on > heap table. which is just pread() pwrite() with 8*BLCKSZ. > eg: all front-end tools use pread() to read heap table. in particular, > pg_rewind write heap table by offset. > eg: in contrib, pg_standby use system("cp") to copy WAL. None of these are actually working with or changing the data though, they're just copying it. I don't think we'd actually want these to decrypt and reencrypt the data. > On 2021/10/20 20:24, Stephen Frost wrote: > > Breaking our ondisk format explicitly means that pg_upgrade won't work > > any longer and folks won't be able to do in-place upgrades. That's a > > pretty huge deal and it's something we've not done in over a decade. > > I doubt that's going to fly. > > I completely agree. Great. > On 2021/10/20 20:24, Stephen Frost wrote: > > Yes, using another fork for this is something that's been considered but > > it's not without its own drawbacks, in particular having to do another > > write and later fsync when a page changes. > > > > Further, none of this is necessary for XTS, but only for GCM. This is > > why it was put forward that GCM involves a lot more changes to the > > system and means that we won't be able to do things like binary > > replication to switch from an unencrypted to encrypted cluster. Those > > are good reasons to consider an XTS implementation first and then later, > > perhaps, implement GCM. > > same as Robert Haas. I wish PG can do some infrastructure first. add more > abstract layers like md.c (maybe a block-based API with ondisk format > version field). so people can dive in without understanding the things which > isolated by the abstract layer. I really don't think this is necessary. Similar to PageSetChecksumCopy and PageSetChecksumInplace, I'm sure we would have functions which are called in the appropriate spots to do encryption (such as 'encrypt_page' and 'encrypt_block' in the Cybertec patch) and folks could review those in relative isolation to the rest. Dealing with blocks in PG is already pretty well handled, the infrastructure that needs to be added is around handling temporary files and is being actively worked on ... if we could move past this debate around if we should be adding support for XTS or if only GCM-SIV would be accepted. > On 2021/10/20 20:24, Stephen Frost wrote: > > What's the point of using GCM if we aren't going to actually verify the > > tag? Also, the Cybertec patch didn't add an extra reserved field to the > > page format, and it used CTR anyway.. > > Oh, I am wrong, Cybertec patch can not use XTS, because WAL may not be > aligned to 16bytes. for WAL need a stream cipher. The CTR implement is still > correct. No, the CTR approach isn't great because, as has been discussed quite a bit already, using the LSN as the IV means that different data might be re-encrypted with the same LSN and that's not an acceptable thing to have happen with CTR. > CTR with hash(offset) as IV is basically equal to XTS. if use another AES > key to encrypt the hash(offset), and block size is 16bytes it is XTS. I don't understand why we're talking about CTR+other-stuff. Maybe if you use CTR and then do other things then it's equivilant in some fashion to XTS ... but then it's not CTR anymore and we shouldn't be calling it that. Saying that we should do "CTR+other stuff" (which happens to make it equivilant to XTS) instead of just saying we should use "XTS" is very confusing and further means that we're starting down the path of trying to come up with our own hack on existing encryption schemes, and regardless of who shows up on this list to claim that doing so makes sense and is "right", I'm going to be extremely skeptical. > The point is that can not save random IV for WAL without adding a reserved > field, no matter use GCM or CTR. Yes, it's correct that we can't use a random IV for the WAL without figuring out how to keep track of that random IV. Thankfully, for WAL (unlike heap and index blocks) we don't really have that issue- we hopefully aren't going to write different WAL records at the same LSN and so using the LSN there should be alright. There's some odd edge cases around this too (split-brain situations in particular where one of the systems does crash recovery and doesn't actually promote and therefore stays on the same timeline), but that kind of thing ends up breaking other things too (WAL archiving, as an example) and isn't really something we can support. Further, I figure we'll tell people to use different keys for different systems anyway to avoid this risk. > Because WAL only does append to the end, using CTR for WAL is safer than > using XTS for heap table. If you want more security for WAL encryption, add > HKDF[1]. I don't follow how you're making these comparisons as to what is "safer" for which when talking about two different encryption methodologies and two very different systems (the heap vs. WAL) or if you're actually suggesting something here. We've discussed at length how using CTR for heap isn't a good idea even if we're using the LSN for the IV, while if we use XTS then we don't have the issues that CTR has with IV re-use and using the LSN (plus block number and perhaps other things). Nothing in what has been discussed here has really changed anything there that I can see and so it's unclear to me why we continue to go round and round with it. Thanks, Stephen
Attachment
On 2021/10/22 01:28, Stephen Frost wrote: > None of these are actually working with or changing the data though, > they're just copying it. I don't think we'd actually want these to > decrypt and reencrypt the data. OK, but why ALTER TABLE SET TABLESPACE use smgrread() and smgrextend() instead of copy_file(). TDE needs to modify these code paths, and make the patch bigger. On 2021/10/22 01:28, Stephen Frost wrote: > No, the CTR approach isn't great because, as has been discussed quite a > bit already, using the LSN as the IV means that different data might be > re-encrypted with the same LSN and that's not an acceptable thing to > have happen with CTR. On 2021/10/22 01:28, Stephen Frost wrote: > Thankfully, for WAL > (unlike heap and index blocks) we don't really have that issue- we > hopefully aren't going to write different WAL records at the same LSN > and so using the LSN there should be alright. On 2021/10/22 01:28, Stephen Frost wrote: > We've discussed at length how using CTR for heap isn't a good idea even > if we're using the LSN for the IV, while if we use XTS then we don't > have the issues that CTR has with IV re-use and using the LSN (plus > block number and perhaps other things). Nothing in what has been > discussed here has really changed anything there that I can see and so > it's unclear to me why we continue to go round and round with it. I am not re-discuss using CTR for heap table. I mean use some CTR-like algorithm *only* for WAL encryption. My idea is exactly the same when you are typing "we hopefully aren't going to write different WAL records at the same LSN and so using the LSN there should be alright." The point of disagreement between you and me is only on the block-based API. On 2021/10/22 01:28, Stephen Frost wrote: > it's unclear to me why we continue to go round and round with it. same to me. I am monitoring this thread about 9 months, watching yours discuss key management/CBC/CRT/GCM round and round.
Attachment
Greetings, * Sasasu (i@sasa.su) wrote: > On 2021/10/22 01:28, Stephen Frost wrote: > >None of these are actually working with or changing the data though, > >they're just copying it. I don't think we'd actually want these to > >decrypt and reencrypt the data. > > OK, but why ALTER TABLE SET TABLESPACE use smgrread() and smgrextend() > instead of copy_file(). > TDE needs to modify these code paths, and make the patch bigger. Tables and databases are handled differently, yes. With ALTER TABLE SET TABLESPACE, we're allocating a new refilenode and WAL'ing the table as FPIs. What happens with databases is fundamentally different- no one is allowed to be connected to the database being moved and we write a single 'database changed tablespace' record in the WAL for this case. When it comes to TDE, this probably is actually helpful as we're going to likely want the relfilenode to be included as part of the IV. > On 2021/10/22 01:28, Stephen Frost wrote: > > No, the CTR approach isn't great because, as has been discussed quite a > > bit already, using the LSN as the IV means that different data might be > > re-encrypted with the same LSN and that's not an acceptable thing to > > have happen with CTR. > On 2021/10/22 01:28, Stephen Frost wrote: > > Thankfully, for WAL > > (unlike heap and index blocks) we don't really have that issue- we > > hopefully aren't going to write different WAL records at the same LSN > > and so using the LSN there should be alright. > On 2021/10/22 01:28, Stephen Frost wrote: > > We've discussed at length how using CTR for heap isn't a good idea even > > if we're using the LSN for the IV, while if we use XTS then we don't > > have the issues that CTR has with IV re-use and using the LSN (plus > > block number and perhaps other things). Nothing in what has been > > discussed here has really changed anything there that I can see and so > > it's unclear to me why we continue to go round and round with it. > > I am not re-discuss using CTR for heap table. I mean use some CTR-like > algorithm *only* for WAL encryption. My idea is exactly the same when you > are typing "we hopefully aren't going to write different WAL records at the > same LSN and so using the LSN there should be alright." I don't like the idea of "CTR-like". What's wrong with using CTR for WAL encryption? Based on the available information, that seems like the exact use-case for CTR. > The point of disagreement between you and me is only on the block-based API. I'm glad to hear that, at least. Thanks, Stephen
Attachment
On Mon, Oct 18, 2021 at 11:56:03AM -0400, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > > On 10/15/21 21:22, Stephen Frost wrote: > > >Now, to address the concern around re-encrypting a block with the same > > >key+IV but different data and leaking what parts of the page changed, I > > >do think we should use the LSN and have it change regularly (including > > >unlogged tables) but that's just because it's relatively easy for us to > > >do and means an attacker wouldn't be able to tell what part of the page > > >changed when the LSN was also changed. That was also recommended by > > >NIST and that's a pretty strong endorsement also. > > > > Not sure - it seems a bit weird to force LSN change even in cases that don't > > generate any WAL. I was not following the encryption thread and maybe it was > > discussed/rejected there, but I've always imagined we'd have a global nonce > > generator (similar to a sequence) and we'd store it at the end of each > > block, or something like that. > > The 'LSN' being referred to here isn't the regular LSN that is > associated with the WAL but rather the separate FakeLSN counter which we > already have. I wasn't suggesting having the regular LSN change in > cases that don't generate WAL. Yes, my original patch created dummy WAL records for dummy LSNs but that is no longer needed with XTS. > > I'm not very convinced that using the LSN for any of this is a good > > idea. Something that changes most of the time but not all the time > > seems more like it could hurt by masking fuzzy thinking more than it > > helps anything. > > This argument doesn't come across as very strong at all to me, > particularly when we have explicit recommendations from NIST that having > the IV vary more is beneficial. While this would be using the LSN, the > fact that the LSN changes most of the time but not all of the time isn't > new and is something we already have to deal with. I'd think we'd > address the concern about mis-thinking around how this works by > providing a README and/or an appropriate set of comments around what's > being done and why. Agreed. I think we would need to document when we reencrypt a page with the same LSN, and of course write-based attacks. > > Do we think knowing which 16-byte blocks on an 8k page change would leak > > useful information? If so, we should use the LSN and just accept that > > some cases might leak as described above. If we don't care, then we can > > skip the use of the LSN and simplify the patch. > > While there may not be an active attack against PG that leverages such a > leak, I have a hard time seeing why we would intentionally design this > in when we have a great option that's directly available to us and > doesn't cause such a leak with nearly such regularity as not using the > LSN would, and also follows recommendations of using XTS from NIST. Agreed. > > I consider this a checkbox feature and making it too complex will cause > > it to be rightly rejected. > > Presuming that 'checkbox feature' here means "we need it to please > $someone but no one will ever use it" or something along those lines, > this is very clearly not the case and therefore we shouldn't be > describing it or treating it as such. Even if the meaning here is > "there's other ways people could get this capability" the reality is > that those other methods are simply not always available and in those > cases, people will choose to not use PostgreSQL. Nearly every other > database system which we might compare ourselves to has a solution in > this area and people actively use those solutions in a lot of > deployments. I think people will use this feature, but I called it a 'checkbox feature' because they usually are not looking for a complex or flexible feature, but rather something that is simple to setup and effective. > > And if PostgreSQL is using XTS, there is no different with dm-encrypt. > > The user can use dm-encrypt directly. > > dm-encrypt is not always an option and it doesn't actually address the > threat-model that Tomas brought up here anyway, as it would be below the > level that the low-privileged OS user would be looking at. That's not > the only threat model to consider, but it is one which could potentially > be addressed by either XTS or AES-GCM-SIV. There are threat models > which dm-crypt would address, of course, such as data-at-rest (hard > drive theft, improper disposal of storage media, backups which don't > have their own encryption, etc), but, again, dm-crypt isn't always an > option that is available and so I don't agree that we should throw this > out just because dm-crypt exists and may be useable in some cases. I actually think a Postgres integrity-check feature would need to create an abstraction layer on top of all writes to PGDATA and tablespaces so the filesystem would look unencrypted to Postgres. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Oct 19, 2021 at 02:44:26PM -0400, Stephen Frost wrote: > There are ways around it. There likely always will be. We need to be > clear about what it provides and what it doesn't. We need to stop > telling ourselves that the only answer is a 100% solution and therefore > it's impossible to do. Users who care about these capabilities will > understand that it's not 100% and they will still happily use it because > it's better than 0% which is where we are today and is why they are > going with other solutions. Yes, if it's trivial to get around then > perhaps it's not much better than 0% and if that's the case then it > doesn't make sense to do it, but none of what has been discussed here > thus far has made me feel like either the XTS or the GCM-SIV approaches > would be trivial to to circumvent for the threat models they're intended > to address, though it certainly takes more care and more thought when > we're trying to address someone who has write access to part of the > system and that we need to be clear what is addressed and what isn't in > all of these cases. Stephen, your emails on this thread have been very helpful and on-topic. I think the distinction above is that it is useful to fully protect against some attack types, even if we don't protect against all attack types. For example, if we protect 100% against read attacks, it doesn't mean that gets reduced to 50% because we don't protect against write attacks --- we are still 100% read-protected and 0% write protected. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Oct 18, 2021 at 12:37:39PM -0400, Robert Haas wrote: > > Thus, to me, it doesn't seem worth going down the XTS route, just to > > temporarily save a bit of implementation effort. We'll have to endure that > > pain anyway. > > I agree up to a point, but I do also kind of feel like we should be > leaving it up to whoever is working on an implementation to decide > what they want to implement. I don't particularly like this discussion Uh, our TODO has this list: Desirability -> Design -> Implement -> Test -> Review -> Commit I think we have to agree on Desirability and Design before anyone starts work since it is more likely a patch will be rejected without this. > where it feels like people are trying to tell other people what they > have to do because "the community has decided X." It's pretty clear > that there are multiple opinions here, and I don't really see any of > them to be without merit, nor do I see why Bruce or Stephen or you or > anyone else should get to say "what the community has decided" in the > absence of a clear consensus. I don't see anyone saying we have agreed on anything, but I do hear people say they are willing to work on some things, and not others. > I do really like the idea of using AES-GCM-SIV not because I know > anything about it, but because the integrity checking seems cool, and > storing the nonce seems like it would improve security. However, based > on what I know now, I would not vote to reject an XTS-based patch and, > as Stephen and Bruce have said, maybe with the right design it permits > upgrades from non-encrypted clusters to encrypted clusters. I'm > actually kind of doubtful about that, because that seems to require > some pretty specific and somewhat painful implementation decisions. > For example, I think if your solution for rotating keys is to shut > down the standby, re-encrypt it with a new key, start it up again, and > fail over to it, then you probably ever can't do key rotation in any > other way. The keys now have to be non-WAL-logged so that the standby > can be different, which means you can't add a new key on the master > and run around re-encrypting everything with it, WAL-logging those > changes as you go. Now I realize that implementing that is really > challenging anyway so maybe some people wouldn't like to go that way, > but then maybe other people would. Another thing you probably can't do > in this model is encrypt different parts of the database with > different keys, because how would you keep track of that? Certainly > not in the system catalogs, if none of that can show up in the WAL > stream. The design is to have a heap/index key and a WAL key. You create a binary replica that uses a different heap/index key but the same WAL key, switch-over to it, and then change the WAL key. `> But, you know, still: if somebody showed up with a fully-working XTS > patch with everything in good working order, I don't see that we have > enough evidence to reject it just because it's XTS. And I would hope > that the people favoring XTS would not vote to reject a fully working > GCM patch just because it's GCM. I think what we ought to be doing at I don't think that would happen, but I do think patch size, code maintenance, and upgradability would be reasonable considerations. > this point is combining our efforts to try to get some things > committed which make future work in this area committed - like that > patch to preserve relfilenode and database OID, or maybe some patches > to drive all of our I/O through a smaller number of code paths instead > of having every different type of temporary file we write reinvent the > wheel. These discussions about what encryption type we ought to use > are useful for ruling out options that we know are bad, but beyond > that I'm not sure they have much value. AES in any mode could seem > like a much less safe choice by the time we get a committed feature > here than it does today - even if somehow that were to happen for v15. > I expect there are people out there trying to break it even as I write > these words, and it seems likely that they will eventually succeed, > but as to when, who can say? Yes, we should start on things we know we need, but we will have to have these discussions until we have desirability and design most people agree on. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Fri, Oct 22, 2021 at 11:36:37AM -0400, Stephen Frost wrote: > > I am not re-discuss using CTR for heap table. I mean use some CTR-like > > algorithm *only* for WAL encryption. My idea is exactly the same when you > > are typing "we hopefully aren't going to write different WAL records at the > > same LSN and so using the LSN there should be alright." > > I don't like the idea of "CTR-like". What's wrong with using CTR for > WAL encryption? Based on the available information, that seems like the > exact use-case for CTR. Agreed. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Oct 19, 2021 at 02:44:26PM -0400, Stephen Frost wrote: > Another threat model to consider is if the attacker has read-only access > to the data directory through, say, unix group read privileges or maybe > the ability to monitor the traffic on the SAN, or the ability to > read-only mount the LUN on to another system. This might be obtained by > attacking a backup process where the system was configured to run > physical backups using an unprivileged OS user who only has group read > access to the cluster (and the necessary but non-superuser privleges in > the database system to start/stop the backup), or various potential > attacks at the storage layer. This is similar to the "data at rest" > case above in that XTS works well to address this, but because the > attacker would have ongoing access (rather than just one-time, such as > in the first case), information such as which blocks are being changed > inside of a given 8k page might be able to be determined and that could > be useful information, though a point here: they would already be able > to see clearly which 8k pages are being changed and which aren't, and > there's not really any way for us to prevent that reasonably. As such, > I'd argue that using XTS is reasonable and we can mitigate some of this > concern by using the LSN in the tweak instead of just the block number > as the 'plain64' option in dmcrypt does. That doing so would mean that That is an excellent point, and something we should mention in our documentation --- the fact that a change of 8k granularity will be visible, and in certain specified cases, 16-byte change granularity will also be visible. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Oct 18, 2021 at 12:37:39PM -0400, Robert Haas wrote: > I do really like the idea of using AES-GCM-SIV not because I know > anything about it, but because the integrity checking seems cool, and ---------- > storing the nonce seems like it would improve security. However, based Frankly, I think we need to be cautious about doing anything related to security for "cool" motivations. (This might be how OpenSSL became such a mess.) For non-security features, you can often add a few lines of code to enable some cool use-case. For security features, you have to block its targeted attack methods fully or it is useless. (It doesn't need to block all attack methods.) To fully block attack methods, security features must be thoroughly designed and all potential interactions must be researched. When adding non-security Postgres features, cool features can be more easily implemented because they are built on the sold foundation of Postgres. For security features, you have to assume that attacks can come from anywhere, so the foundation is unclear and caution is wise. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote: > * Sasasu (i@sasa.su) wrote: > > A unified block-based I/O API sounds great. Has anyone tried to do this > > before? It would be nice if the front-end tools could also use these API. > > The TDE patch from Cybertec did go down this route, but the API ended up > being rather different which menat a lot of changes in other parts of > the system. If we can get a block-based temporary file method that > maintains more-or-less the same API, that'd be great, but I'm not sure > that we can really do so and I am not entirely convinced that we should > make the TDE effort depend on an otherwise quite independent effort of > making all temp files usage be block based. Uh, I thought people felt the Cybertec patch was too large and that a unified API for temporary file I/O-encryption was a requirement. Would a CTR-steaming-encryption API for temporary tables be easier to implement? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет: > Greetings, > > I really don't think this is necessary. Similar to PageSetChecksumCopy > and PageSetChecksumInplace, I'm sure we would have functions which are > called in the appropriate spots to do encryption (such as 'encrypt_page' > and 'encrypt_block' in the Cybertec patch) and folks could review those > in relative isolation to the rest. Dealing with blocks in PG is already > pretty well handled, the infrastructure that needs to be added is around > handling temporary files and is being actively worked on ... if we could > move past this debate around if we should be adding support for XTS or > if only GCM-SIV would be accepted. > > ..... > > No, the CTR approach isn't great because, as has been discussed quite a > bit already, using the LSN as the IV means that different data might be > re-encrypted with the same LSN and that's not an acceptable thing to > have happen with CTR. > > ..... > > We've discussed at length how using CTR for heap isn't a good idea even > if we're using the LSN for the IV, while if we use XTS then we don't > have the issues that CTR has with IV re-use and using the LSN (plus > block number and perhaps other things). Nothing in what has been > discussed here has really changed anything there that I can see and so > it's unclear to me why we continue to go round and round with it. > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2] [3][4]. It is explicitely created to solve large block encryption issue - disk encryption. It is used to encrypt 4kb at whole, but in fact has no (practical) limit on block size: it is near-zero modified to encrypt 1kb or 8kb or 32kb. It has benefits of both XTS and GCM-SIV: - like GCM-SIV every bit of cipher text depends on every bit of plain text - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every block's bit will change. - like XTS it doesn't need to change plain text format and doesn't need in additional Nonce/Auth Code. Adiantum stands on "giant's shoulders": AES, Chacha and Poly1305. It is included into Linux kernel since 5.0 . Adiantum/HPolyC approach (hash+cipher+stream-cipher+hash) could be used with other primitives as well. For example, Chacha12 could be replaces with AES-GCM or AES-XTS with IV derived from hash+cipher. [1] https://security.googleblog.com/2019/02/introducing-adiantum-encryption-for.html [2] https://en.wikipedia.org/wiki/Adiantum_(cipher) [3] https://tosc.iacr.org/index.php/ToSC/article/view/7360 [4] https://github.com/google/adiantum [5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=059c2a4d8e164dccc3078e49e7f286023b019a98 ------- regards Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote: > > * Sasasu (i@sasa.su) wrote: > > > A unified block-based I/O API sounds great. Has anyone tried to do this > > > before? It would be nice if the front-end tools could also use these API. > > > > The TDE patch from Cybertec did go down this route, but the API ended up > > being rather different which menat a lot of changes in other parts of > > the system. If we can get a block-based temporary file method that > > maintains more-or-less the same API, that'd be great, but I'm not sure > > that we can really do so and I am not entirely convinced that we should > > make the TDE effort depend on an otherwise quite independent effort of > > making all temp files usage be block based. > > Uh, I thought people felt the Cybertec patch was too large and that a > unified API for temporary file I/O-encryption was a requirement. Would > a CTR-steaming-encryption API for temporary tables be easier to > implement? Having a unified API for temporary file I/O (that could then be extended to provide encryption) would definitely help with reducing the size of a TDE patch. The approach used in the Cybertec patch was to make temporary file access block based, but the way that was implemented was with an API different from pread/pwrite and that meant changing pretty much all of the call sites for temporary file access, which naturally resulted in changes in a lot of otherwise unrelated code. There was an argument put forth that a block-based API for temporary file access would generally be good as it would mean fewer syscalls. If we can get behind that and make it happen in (relatively) short order then we'd certainly be better off when it comes to implementing TDE which also deals with temporary files. I'm a bit concerned about that approach due to the changes needed but I'm also not against it. I do think that an API which was more-or-less the same as what's used today would be a smaller change and therefore might be easier to get in and that it'd also make a TDE patch smaller. Perhaps both could be accomplished (an API that's similar to today, but the actual file access being block-based). Either way, we should get that unification done in the core code first as an independent effort. That's what I hope the Cybertec folks have had a chance to work on. As for the specific encryption method to use, using CTR would be simpler as it doesn't require access to be block-based, though we would need to make sure to not re-use the IV across any of the temporary files being created (potentially concurrently). Probably not that hard to do but just something to make sure we do. Of course, if we arrange for block-based access then we could use XTS or perhaps GCM/GCM-SIV if we wanted to. Thanks, Stephen
Attachment
Greetings, * Yura Sokolov (y.sokolov@postgrespro.ru) wrote: > В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет: > > I really don't think this is necessary. Similar to PageSetChecksumCopy > > and PageSetChecksumInplace, I'm sure we would have functions which are > > called in the appropriate spots to do encryption (such as 'encrypt_page' > > and 'encrypt_block' in the Cybertec patch) and folks could review those > > in relative isolation to the rest. Dealing with blocks in PG is already > > pretty well handled, the infrastructure that needs to be added is around > > handling temporary files and is being actively worked on ... if we could > > move past this debate around if we should be adding support for XTS or > > if only GCM-SIV would be accepted. > > > > ..... > > > > No, the CTR approach isn't great because, as has been discussed quite a > > bit already, using the LSN as the IV means that different data might be > > re-encrypted with the same LSN and that's not an acceptable thing to > > have happen with CTR. > > > > ..... > > > > We've discussed at length how using CTR for heap isn't a good idea even > > if we're using the LSN for the IV, while if we use XTS then we don't > > have the issues that CTR has with IV re-use and using the LSN (plus > > block number and perhaps other things). Nothing in what has been > > discussed here has really changed anything there that I can see and so > > it's unclear to me why we continue to go round and round with it. > > > > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2] > [3][4]. That sounds like a great thing to think about adding ... after we get something in that's based on XTS. > It is explicitely created to solve large block encryption issue - disk > encryption. It is used to encrypt 4kb at whole, but in fact has no > (practical) limit on block size: it is near-zero modified to encrypt 1kb > or 8kb or 32kb. > > It has benefits of both XTS and GCM-SIV: > - like GCM-SIV every bit of cipher text depends on every bit of plain text > - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse > LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every > block's bit will change. The advantage of GCM-SIV is that it provides integrity as well as confidentiality. > - like XTS it doesn't need to change plain text format and doesn't need in > additional Nonce/Auth Code. Sure, in which case it's something that could potentially be added later as another option in the future. I don't think we'll always have just one encryption method and it's good to generally think about what it might look like to have others but I don't think it makes sense to try and get everything in all at once. Thanks, Stephen
Attachment
On Mon, Oct 25, 2021 at 11:58:14AM -0400, Stephen Frost wrote: > As for the specific encryption method to use, using CTR would be simpler > as it doesn't require access to be block-based, though we would need to > make sure to not re-use the IV across any of the temporary files being > created (potentially concurrently). Probably not that hard to do but > just something to make sure we do. Of course, if we arrange for > block-based access then we could use XTS or perhaps GCM/GCM-SIV if we > wanted to. Agreed on all points. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
В Пн, 25/10/2021 в 12:12 -0400, Stephen Frost пишет: > Greetings, > > * Yura Sokolov (y.sokolov@postgrespro.ru) wrote: > > В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет: > > > I really don't think this is necessary. Similar to PageSetChecksumCopy > > > and PageSetChecksumInplace, I'm sure we would have functions which are > > > called in the appropriate spots to do encryption (such as 'encrypt_page' > > > and 'encrypt_block' in the Cybertec patch) and folks could review those > > > in relative isolation to the rest. Dealing with blocks in PG is already > > > pretty well handled, the infrastructure that needs to be added is around > > > handling temporary files and is being actively worked on ... if we could > > > move past this debate around if we should be adding support for XTS or > > > if only GCM-SIV would be accepted. > > > > > > ..... > > > > > > No, the CTR approach isn't great because, as has been discussed quite a > > > bit already, using the LSN as the IV means that different data might be > > > re-encrypted with the same LSN and that's not an acceptable thing to > > > have happen with CTR. > > > > > > ..... > > > > > > We've discussed at length how using CTR for heap isn't a good idea even > > > if we're using the LSN for the IV, while if we use XTS then we don't > > > have the issues that CTR has with IV re-use and using the LSN (plus > > > block number and perhaps other things). Nothing in what has been > > > discussed here has really changed anything there that I can see and so > > > it's unclear to me why we continue to go round and round with it. > > > > > > > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2] > > [3][4]. > > That sounds like a great thing to think about adding ... after we get > something in that's based on XTS. Why? I see no points to do it after. Why not XTS after Adiantum? Ok, I see one: XTS is standartized. > > It is explicitely created to solve large block encryption issue - disk > > encryption. It is used to encrypt 4kb at whole, but in fact has no > > (practical) limit on block size: it is near-zero modified to encrypt 1kb > > or 8kb or 32kb. > > > > It has benefits of both XTS and GCM-SIV: > > - like GCM-SIV every bit of cipher text depends on every bit of plain text > > - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse > > LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every > > block's bit will change. > > The advantage of GCM-SIV is that it provides integrity as well as > confidentiality. Integrity could be based on simple non-cryptographic checksum, and it could be checked after decryption. It would be imposible to intentionally change encrypted page in a way it will pass checksum after decription. Currently we have 16bit checksum, and it is very small. But having larger checksum is orthogonal (ie doesn't bound) to having encryption. In fact, Adiantum is easily made close to SIV construction: - just leave last 8/16 bytes zero. If after decription they are zero, then integrity check passed. That is because SIV and Adiantum are very similar in its structure: - SIV: -- hash -- then stream cipher - Adiantum: -- hash (except last 16bytes) -- then encrypt last 16bytes with hash, -- then stream cipher -- then hash. If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last 16bytes with hash" become equivalent to just "hash", and Adiantum became logical equivalent to SIV. > > - like XTS it doesn't need to change plain text format and doesn't need in > > additional Nonce/Auth Code. > > Sure, in which case it's something that could potentially be added later > as another option in the future. I don't think we'll always have just > one encryption method and it's good to generally think about what it > might look like to have others but I don't think it makes sense to try > and get everything in all at once. And among others Adiantum looks best: it is fast even without hardware acceleration, it provides whole block encryption (ie every bit depends on every bit) and it doesn't bound to plain-text format. > Thanks, > > Stephen regards, Yura PS. Construction beside SIV and Adiantum could be used with XTS as well. I.e. instead of AES-GCM-SIV it could be AES-XTS-SIV. And same way AES-XTS could be used instead of Chacha12 in Adiantum.
On 2021/10/26 04:32, Yura Sokolov wrote: > And among others Adiantum looks best: it is fast even without hardware > acceleration, No, AES is fast on modern high-end hardware. on X86 AMD 3700X type 1024 bytes 8192 bytes 16384 bytes aes-128-ctr 8963982.50k 11124613.88k 11509149.42k aes-128-gcm 3978860.44k 4669417.10k 4732070.64k aes-128-xts 7776628.39k 9073664.63k 9264617.74k chacha20-poly1305 2043729.73k 2131296.36k 2141002.10k on ARM RK3399, A53 middle-end with AES-NI type 1024 bytes 8192 bytes 16384 bytes aes-128-ctr 1663857.66k 1860930.22k 1872991.57k aes-128-xts 685086.38k 712906.07k 716073.64k aes-128-gcm 985578.84k 1054818.30k 1056768.00k chacha20-poly1305 309012.82k 318889.98k 319711.91k I think the baseline is the speed when using read(2) syscall on /dev/zero (which is 3.6GiB/s, on ARM is 980MiB/s) chacha is fast on the low-end arm, but I haven't seen any HTTPS sites using chacha, including Cloudflare and Google. On 2021/10/26 04:32, Yura Sokolov wrote: >> That sounds like a great thing to think about adding ... after we get >> something in that's based on XTS. > Why? I see no points to do it after. Why not XTS after Adiantum? > > Ok, I see one: XTS is standartized. :> PostgreSQL even not discuss single-table key rotation or remote KMS. I think it's too hard to use an encryption algorithm which openssl doesn't implement.
Attachment
В Вт, 26/10/2021 в 11:08 +0800, Sasasu пишет: > On 2021/10/26 04:32, Yura Sokolov wrote: > > And among others Adiantum looks best: it is fast even without hardware > > acceleration, > > No, AES is fast on modern high-end hardware. > > on X86 AMD 3700X > type 1024 bytes 8192 bytes 16384 bytes > aes-128-ctr 8963982.50k 11124613.88k 11509149.42k > aes-128-gcm 3978860.44k 4669417.10k 4732070.64k > aes-128-xts 7776628.39k 9073664.63k 9264617.74k > chacha20-poly1305 2043729.73k 2131296.36k 2141002.10k > > on ARM RK3399, A53 middle-end with AES-NI > type 1024 bytes 8192 bytes 16384 bytes > aes-128-ctr 1663857.66k 1860930.22k 1872991.57k > aes-128-xts 685086.38k 712906.07k 716073.64k > aes-128-gcm 985578.84k 1054818.30k 1056768.00k > chacha20-poly1305 309012.82k 318889.98k 319711.91k > > I think the baseline is the speed when using read(2) syscall on > /dev/zero (which is 3.6GiB/s, on ARM is 980MiB/s) > chacha is fast on the low-end arm, but I haven't seen any HTTPS sites > using chacha, including Cloudflare and Google. 1. Chacha20-poly1305 includes authentication code (poly1305), aes-gcm also includes (GCM). But aes-128-(ctr,xts) doesn't. Therefore, Chacha should be compared with ctr,xts, not Chacha-Poly1305. 2. Chacha20 has security margin x2.8: only 7 rounds from 20 are broken. AES-128 has security margin x1.4: broken 7 rounds from 10. That is why Adiantum uses Chacha12: it is still "more secure" than AES-128. Yes, AES with AES-NI is fastest. But not so much. And, AES-CTR could be easily used instead of ChaCha12 in Adiantum. Adiantum uses ChaCha12 as a stream cipher, and any other stream cipher will be ok as well with minor modifications to algorithm. > > On 2021/10/26 04:32, Yura Sokolov wrote: > >> That sounds like a great thing to think about adding ... after we get > >> something in that's based on XTS. > > Why? I see no points to do it after. Why not XTS after Adiantum? > > > > Ok, I see one: XTS is standartized. > :> > PostgreSQL even not discuss single-table key rotation or remote KMS. > I think it's too hard to use an encryption algorithm which openssl > doesn't implement. That is argument. But, again, openssl could be used for primitives: AES + AES-CTR + Poly/GCM. And Adiantum like construction could be composed from them quite easily.
On 2021/10/26 17:33, Yura Sokolov wrote: > Yes, AES with AES-NI is fastest. But not so much. > > And, AES-CTR could be easily used instead of ChaCha12 in Adiantum. > Adiantum uses ChaCha12 as a stream cipher, and any other stream cipher will > be ok as well with minor modifications to algorithm. not so much ~= half speed. I prefer to use AES on all devices because AES is faster and more power efficient. using chacha only on low-end arm devices running complex program which most people do not have this device. I reserve my opinion on this point, but I agree with you on the rest. And I also agree and think it should add more algorithms. The current implementation does not have any reserved fields, which makes any upgrade like adding a new algorithm unfeasible. On 2021/10/26 17:33, Yura Sokolov wrote: > That is argument. But, again, openssl could be used for primitives: > AES + AES-CTR + Poly/GCM. And Adiantum like construction could be > composed from them quite easily. implement Adiantum is a small problem (which doesn't look good, lack security audits). the real problem is there are too many code path can trigger disk IO. TDE need modify them. each code path has different behavior (align or unaligned, once or more than once). and front-end tools even not use VF layer, they use pread with offset. TDE need fix them all. at the same time, keep the patch small enough. I still think there need a unified IO API, not only modify BufFileDumpBuffer() and BufFileLoadBuffer(). the front-end tools also use this API will be great. with that API, TDE can focus on block IO. then give out a small patch. Other works can also benefit from this API, like PostgreSQL with AIO, PostgreSQL on S3 (BLKSZ=4M), PostgreSQL on PMEM(no FS) and many.
Attachment
Greetings, * Yura Sokolov (y.sokolov@postgrespro.ru) wrote: > В Пн, 25/10/2021 в 12:12 -0400, Stephen Frost пишет: > > * Yura Sokolov (y.sokolov@postgrespro.ru) wrote: > > > В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет: > > > > I really don't think this is necessary. Similar to PageSetChecksumCopy > > > > and PageSetChecksumInplace, I'm sure we would have functions which are > > > > called in the appropriate spots to do encryption (such as 'encrypt_page' > > > > and 'encrypt_block' in the Cybertec patch) and folks could review those > > > > in relative isolation to the rest. Dealing with blocks in PG is already > > > > pretty well handled, the infrastructure that needs to be added is around > > > > handling temporary files and is being actively worked on ... if we could > > > > move past this debate around if we should be adding support for XTS or > > > > if only GCM-SIV would be accepted. > > > > > > > > ..... > > > > > > > > No, the CTR approach isn't great because, as has been discussed quite a > > > > bit already, using the LSN as the IV means that different data might be > > > > re-encrypted with the same LSN and that's not an acceptable thing to > > > > have happen with CTR. > > > > > > > > ..... > > > > > > > > We've discussed at length how using CTR for heap isn't a good idea even > > > > if we're using the LSN for the IV, while if we use XTS then we don't > > > > have the issues that CTR has with IV re-use and using the LSN (plus > > > > block number and perhaps other things). Nothing in what has been > > > > discussed here has really changed anything there that I can see and so > > > > it's unclear to me why we continue to go round and round with it. > > > > > > > > > > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2] > > > [3][4]. > > > > That sounds like a great thing to think about adding ... after we get > > something in that's based on XTS. > > Why? I see no points to do it after. Why not XTS after Adiantum? > > Ok, I see one: XTS is standartized. That's certainly one aspect of it. It's also more easily available to us, and frankly the people working on this and writing the patches have a better understanding of XTS from having looked into it. That it's also more widely used is another reason. > > > It is explicitely created to solve large block encryption issue - disk > > > encryption. It is used to encrypt 4kb at whole, but in fact has no > > > (practical) limit on block size: it is near-zero modified to encrypt 1kb > > > or 8kb or 32kb. > > > > > > It has benefits of both XTS and GCM-SIV: > > > - like GCM-SIV every bit of cipher text depends on every bit of plain text > > > - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse > > > LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every > > > block's bit will change. > > > > The advantage of GCM-SIV is that it provides integrity as well as > > confidentiality. > > Integrity could be based on simple non-cryptographic checksum, and it could > be checked after decryption. It would be imposible to intentionally change > encrypted page in a way it will pass checksum after decription. No, it wouldn't be impossible when we're talking about non-cryptographic checksums. That is, in fact, why you'd call them that. If it were impossible (or at least utterly impractical) then you'd be able to claim that it's cryptographic-level integrity validation. > Currently we have 16bit checksum, and it is very small. But having larger > checksum is orthogonal (ie doesn't bound) to having encryption. Sure, but that would also require a page-format change. We've pointed out the downsides of that and what it would prevent in terms of use-cases. That's still something that might happen but it would be a different effort from this. > In fact, Adiantum is easily made close to SIV construction: > - just leave last 8/16 bytes zero. If after decription they are zero, > then integrity check passed. > That is because SIV and Adiantum are very similar in its structure: > - SIV: > -- hash > -- then stream cipher > - Adiantum: > -- hash (except last 16bytes) > -- then encrypt last 16bytes with hash, > -- then stream cipher > -- then hash. > If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last > 16bytes with hash" become equivalent to just "hash", and Adiantum became > logical equivalent to SIV. While I appreciate your interest in this, I don't think it makes sense for us to try and implement something of our own- we're not cryptographers. Best is to look at published guideance and what other projects have had success doing, and that's what this thread has been about. > > > - like XTS it doesn't need to change plain text format and doesn't need in > > > additional Nonce/Auth Code. > > > > Sure, in which case it's something that could potentially be added later > > as another option in the future. I don't think we'll always have just > > one encryption method and it's good to generally think about what it > > might look like to have others but I don't think it makes sense to try > > and get everything in all at once. > > And among others Adiantum looks best: it is fast even without hardware > acceleration, it provides whole block encryption (ie every bit depends > on every bit) and it doesn't bound to plain-text format. And it could still be added later as another option if folks really want it to be. I've outlined why it makes sense to go with XTS first but I don't mean that to imply that we'll only ever have that. Indeed, once we've actually got something, adding other methods will almost certainly be simpler. Trying to do everything from the start will make this very difficult to accomplish though. Thanks, Stephen
Attachment
On 10/26/21 21:43, Stephen Frost wrote: > Greetings, > > * Yura Sokolov (y.sokolov@postgrespro.ru) wrote: >> ... >> >> Integrity could be based on simple non-cryptographic checksum, and it could >> be checked after decryption. It would be imposible to intentionally change >> encrypted page in a way it will pass checksum after decription. > > No, it wouldn't be impossible when we're talking about non-cryptographic > checksums. That is, in fact, why you'd call them that. If it were > impossible (or at least utterly impractical) then you'd be able to claim > that it's cryptographic-level integrity validation. > Yeah, our checksums are probabilistic protection against rare and random bitflips cause by hardware, not against an attacker in the crypto sense. To explain why it's not enough, consider our checksum is uint16, i.e. there are only 64k possible values. In other words, you can try flipping bits in the encrypted page, and after generating 64k you're guaranteed to have at least one collision. Yes, it's harder to get collision with the existing checksum, and compression methods that diffuse bits better makes it harder to get a valid page after decryption, but it's simply not the same thing as a crypto integrity. Let's not try inventing something custom, there's been enough crypto failures due to smart custom stuff in the past already. BTW I'm not sure what the existing patches do, but I wonder if we should calculate the checksum before or after encryption. I'd say it should be after encryption, because checksums were meant as a protection against issues at the storage level, so the checksum should be on what's written to storage, and it'd also allow offline verification of checksums etc. (Of course, that'd make the whole idea of relying on our checksums even more futile.) Note: Maybe there are reasons why the checksum needs to be calculated before encryption, not sure. >> Currently we have 16bit checksum, and it is very small. But having larger >> checksum is orthogonal (ie doesn't bound) to having encryption. > > Sure, but that would also require a page-format change. We've pointed > out the downsides of that and what it would prevent in terms of > use-cases. That's still something that might happen but it would be a > different effort from this. > ... and if such page format ends up happening, it'd be fairly easy to just add some extra crypto data into the page header and not rely on the data checksums at all. >> In fact, Adiantum is easily made close to SIV construction: >> - just leave last 8/16 bytes zero. If after decription they are zero, >> then integrity check passed. >> That is because SIV and Adiantum are very similar in its structure: >> - SIV: >> -- hash >> -- then stream cipher >> - Adiantum: >> -- hash (except last 16bytes) >> -- then encrypt last 16bytes with hash, >> -- then stream cipher >> -- then hash. >> If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last >> 16bytes with hash" become equivalent to just "hash", and Adiantum became >> logical equivalent to SIV. > > While I appreciate your interest in this, I don't think it makes sense > for us to try and implement something of our own- we're not > cryptographers. Best is to look at published guideance and what other > projects have had success doing, and that's what this thread has been > about. > Yeah, I personally don't see much difference between XTS and Adiantum. There are a bunch of benefits, but the main reason why Google developed it seems to be performance on low-end ARM machines (i.e. phones). Which is nice, but it's probably not hugely important - very few people run Pg on such machines, especially in performance-sensitive context. It's true Adiantum is probably more resilient to IV reuse etc. but it's not like XTS is suddenly obsolete, and it certainly doesn't solve the integrity issue etc. >>>> - like XTS it doesn't need to change plain text format and doesn't need in >>>> additional Nonce/Auth Code. >>> >>> Sure, in which case it's something that could potentially be added later >>> as another option in the future. I don't think we'll always have just >>> one encryption method and it's good to generally think about what it >>> might look like to have others but I don't think it makes sense to try >>> and get everything in all at once. >> >> And among others Adiantum looks best: it is fast even without hardware >> acceleration, it provides whole block encryption (ie every bit depends >> on every bit) and it doesn't bound to plain-text format. > > And it could still be added later as another option if folks really want > it to be. I've outlined why it makes sense to go with XTS first but I > don't mean that to imply that we'll only ever have that. Indeed, once > we've actually got something, adding other methods will almost certainly > be simpler. Trying to do everything from the start will make this very > difficult to accomplish though. > Yeah. So maybe the best thing is simply to roll with both - design the whole feature in a way that allows selecting the encryption scheme, with two options. That's generally a good engineering practice, as it ensures things are not coupled too much. And it's not like the encryption methods are expected to be super difficult. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 26, 2021 at 11:11:39PM +0200, Tomas Vondra wrote: > BTW I'm not sure what the existing patches do, but I wonder if we should > calculate the checksum before or after encryption. I'd say it should be > after encryption, because checksums were meant as a protection against > issues at the storage level, so the checksum should be on what's written to > storage, and it'd also allow offline verification of checksums etc. (Of > course, that'd make the whole idea of relying on our checksums even more > futile.) > > Note: Maybe there are reasons why the checksum needs to be calculated before > encryption, not sure. Yes, these are the tradeoffs --- allowing offline checksum checking without requiring the key vs. giving _some_ integrity checking and requiring the key. > Yeah, I personally don't see much difference between XTS and Adiantum. > > There are a bunch of benefits, but the main reason why Google developed it > seems to be performance on low-end ARM machines (i.e. phones). Which is > nice, but it's probably not hugely important - very few people run Pg on > such machines, especially in performance-sensitive context. > > It's true Adiantum is probably more resilient to IV reuse etc. but it's not > like XTS is suddenly obsolete, and it certainly doesn't solve the integrity > issue etc. > > > > > > - like XTS it doesn't need to change plain text format and doesn't need in > > > > > additional Nonce/Auth Code. > > > > > > > > Sure, in which case it's something that could potentially be added later > > > > as another option in the future. I don't think we'll always have just > > > > one encryption method and it's good to generally think about what it > > > > might look like to have others but I don't think it makes sense to try > > > > and get everything in all at once. > > > > > > And among others Adiantum looks best: it is fast even without hardware > > > acceleration, it provides whole block encryption (ie every bit depends > > > on every bit) and it doesn't bound to plain-text format. > > > > And it could still be added later as another option if folks really want > > it to be. I've outlined why it makes sense to go with XTS first but I > > don't mean that to imply that we'll only ever have that. Indeed, once > > we've actually got something, adding other methods will almost certainly > > be simpler. Trying to do everything from the start will make this very > > difficult to accomplish though. > > > ... > So maybe the best thing is simply to roll with both - design the whole > feature in a way that allows selecting the encryption scheme, with two > options. That's generally a good engineering practice, as it ensures things > are not coupled too much. And it's not like the encryption methods are > expected to be super difficult. I am not in favor of adding additional options to this feature unless we can explain why users should choose one over the other. There is also the problem of OpenSSL not supporting Adiantum. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Oct 26, 2021 at 11:11:39PM +0200, Tomas Vondra wrote: > > BTW I'm not sure what the existing patches do, but I wonder if we should > > calculate the checksum before or after encryption. I'd say it should be > > after encryption, because checksums were meant as a protection against > > issues at the storage level, so the checksum should be on what's written to > > storage, and it'd also allow offline verification of checksums etc. (Of > > course, that'd make the whole idea of relying on our checksums even more > > futile.) > > > > Note: Maybe there are reasons why the checksum needs to be calculated before > > encryption, not sure. > > Yes, these are the tradeoffs --- allowing offline checksum checking > without requiring the key vs. giving _some_ integrity checking and > requiring the key. I'm in favor of calculating the checksum before encrypting as that will still catch the storage level bit-flips that it was implemented to address in the first place and will also make it so that we're very likely to realize we have an incorrect key before trying to do anything with the page. That it might also serve as a deterrent against attackers trying to randomly flip bits in a page has perhaps some value but without a cryptographic-level hash it isn't really enough to prevent against an attacker who has write access to a page. Any tools which include checking the checksum on pages already have to deal with clusters where checksums aren't enabled anyway and I wouldn't expect it to generally be an issue for tools which want to validate checksums on an encrypted cluster to be able to have the appropriate key(s) necessary for doing so and to be able to perform the decryption to do the check. We can certainly make pg_checksums do this and I don't see it as an issue for pgbackrest, as two examples that I've specifically thought about. > > > > > > - like XTS it doesn't need to change plain text format and doesn't need in > > > > > > additional Nonce/Auth Code. > > > > > > > > > > Sure, in which case it's something that could potentially be added later > > > > > as another option in the future. I don't think we'll always have just > > > > > one encryption method and it's good to generally think about what it > > > > > might look like to have others but I don't think it makes sense to try > > > > > and get everything in all at once. > > > > > > > > And among others Adiantum looks best: it is fast even without hardware > > > > acceleration, it provides whole block encryption (ie every bit depends > > > > on every bit) and it doesn't bound to plain-text format. > > > > > > And it could still be added later as another option if folks really want > > > it to be. I've outlined why it makes sense to go with XTS first but I > > > don't mean that to imply that we'll only ever have that. Indeed, once > > > we've actually got something, adding other methods will almost certainly > > > be simpler. Trying to do everything from the start will make this very > > > difficult to accomplish though. > ... > > So maybe the best thing is simply to roll with both - design the whole > > feature in a way that allows selecting the encryption scheme, with two > > options. That's generally a good engineering practice, as it ensures things > > are not coupled too much. And it's not like the encryption methods are > > expected to be super difficult. > > I am not in favor of adding additional options to this feature unless we > can explain why users should choose one over the other. There is also > the problem of OpenSSL not supporting Adiantum. I can understand the general idea that we should be sure to engineer this in a way that multiple methods can be used, as surely one day folks will say that AES128 isn't acceptable any more. In terms of what we'll do from the start, I would think providing the options of AES128 and AES256 would be good to ensure that we have the bits covered to support multiple methods and I don't think that would put us into a situation of having to really explain which to use to users (we don't for pgcrypto anyway, as an example). I agree that we shouldn't be looking at adding in a whole new crypto library for this though, that's a large and independent effort (see the work on NSS happening nearby). Thanks, Stephen
Attachment
On 2021/11/2 02:24, Stephen Frost wrote: > I can understand the general idea that we should be sure to engineer > this in a way that multiple methods can be used, as surely one day folks > will say that AES128 isn't acceptable any more. Cheers!
Attachment
On Mon, Nov 1, 2021 at 02:24:36PM -0400, Stephen Frost wrote: > I can understand the general idea that we should be sure to engineer > this in a way that multiple methods can be used, as surely one day folks > will say that AES128 isn't acceptable any more. In terms of what we'll > do from the start, I would think providing the options of AES128 and > AES256 would be good to ensure that we have the bits covered to support > multiple methods and I don't think that would put us into a situation of My patch supports AES128, AES192, and AES256. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Nov 1, 2021 at 02:24:36PM -0400, Stephen Frost wrote: > > I can understand the general idea that we should be sure to engineer > > this in a way that multiple methods can be used, as surely one day folks > > will say that AES128 isn't acceptable any more. In terms of what we'll > > do from the start, I would think providing the options of AES128 and > > AES256 would be good to ensure that we have the bits covered to support > > multiple methods and I don't think that would put us into a situation of > > My patch supports AES128, AES192, and AES256. Right, so we're already showing that it's flexible to allow for multiple encryption methods. If folks want more then it's on them to research how they'd work exactly and explain why they'd be useful to add and how users might make an informed choice (though, again, I don't think we need to go *too* deep into that as we don't for, eg, pgcrypto, and I don't believe we've ever heard people complain about that). Thanks, Stephen
Attachment
On Mon, Nov 1, 2021 at 02:24:36PM -0400, Stephen Frost wrote: > I can understand the general idea that we should be sure to engineer > this in a way that multiple methods can be used, as surely one day folks > will say that AES128 isn't acceptable any more. In terms of what we'll > do from the start, I would think providing the options of AES128 and > AES256 would be good to ensure that we have the bits covered to support > multiple methods and I don't think that would put us into a situation of > having to really explain which to use to users (we don't for pgcrypto > anyway, as an example). I agree that we shouldn't be looking at adding > in a whole new crypto library for this though, that's a large and > independent effort (see the work on NSS happening nearby). Since it has been two weeks since the last activity on this thread, I have updated the TDE wiki to reflect the conclusions and discussions: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Stephen Frost <sfrost@snowman.net> wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote: > > > * Sasasu (i@sasa.su) wrote: > > > > A unified block-based I/O API sounds great. Has anyone tried to do this > > > > before? It would be nice if the front-end tools could also use these API. > > > > > > The TDE patch from Cybertec did go down this route, but the API ended up > > > being rather different which menat a lot of changes in other parts of > > > the system. If we can get a block-based temporary file method that > > > maintains more-or-less the same API, that'd be great, but I'm not sure > > > that we can really do so and I am not entirely convinced that we should > > > make the TDE effort depend on an otherwise quite independent effort of > > > making all temp files usage be block based. > > > > Uh, I thought people felt the Cybertec patch was too large and that a > > unified API for temporary file I/O-encryption was a requirement. Would > > a CTR-steaming-encryption API for temporary tables be easier to > > implement? > > Having a unified API for temporary file I/O (that could then be extended > to provide encryption) would definitely help with reducing the size of a > TDE patch. The approach used in the Cybertec patch was to make > temporary file access block based, but the way that was implemented was > with an API different from pread/pwrite and that meant changing pretty > much all of the call sites for temporary file access, which naturally > resulted in changes in a lot of otherwise unrelated code. The changes to buffile.c are not trivial, but we haven't really changed the API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead(). What our patch affects on the caller side is that BufFileOpenTransient(), BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient() replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1], see the diffs attached. (I expect that [2] will get committed someday so that the TDE feature won't affect pgstat.c in the future at all.) -- Antonin Houska Web: https://www.cybertec-postgresql.com [1] https://github.com/cybertec-postgresql/postgres/tree/PG_14_TDE_1_1 [2] https://commitfest.postgresql.org/34/1708/ diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index d474ea1d0a..9cb0708e41 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -99,6 +99,7 @@ #include "replication/slot.h" #include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */ #include "storage/bufmgr.h" +#include "storage/buffile.h" #include "storage/fd.h" #include "storage/sinval.h" #include "utils/builtins.h" @@ -131,21 +132,13 @@ typedef struct ReorderBufferTupleCidEnt CommandId combocid; /* just for debugging */ } ReorderBufferTupleCidEnt; -/* Virtual file descriptor with file offset tracking */ -typedef struct TXNEntryFile -{ - File vfd; /* -1 when the file is closed */ - off_t curOffset; /* offset for next write or read. Reset to 0 - * when vfd is opened. */ -} TXNEntryFile; - /* k-way in-order change iteration support structures */ typedef struct ReorderBufferIterTXNEntry { XLogRecPtr lsn; ReorderBufferChange *change; ReorderBufferTXN *txn; - TXNEntryFile file; + TransientBufFile *file; XLogSegNo segno; } ReorderBufferIterTXNEntry; @@ -243,14 +236,22 @@ static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMe * Disk serialization support functions * --------------------------------------- */ +static void ReorderBufferTweakBase(ReorderBufferTXN *txn, + char tweak_base[TWEAK_BASE_SIZE]); static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb); static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn); static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, - int fd, ReorderBufferChange *change); + TransientBufFile *file, ReorderBufferChange *change); +static void ReorderBufferWriteData(TransientBufFile *file, void *ptr, size_t size, + ReorderBufferTXN *txn); static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, - TXNEntryFile *file, XLogSegNo *segno); + TransientBufFile **file, XLogSegNo *segno); static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, - char *change); + TransientBufFile **file); +static ReorderBufferTupleBuf *ReorderBufferRestoreTuple(ReorderBuffer *rb, + TransientBufFile *file); +static void ReorderBufferReadData(TransientBufFile *file, void *ptr, size_t size, + bool *no_data_p); static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn); static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared); @@ -342,8 +343,6 @@ ReorderBufferAllocate(void) buffer->by_txn_last_xid = InvalidTransactionId; buffer->by_txn_last_txn = NULL; - buffer->outbuf = NULL; - buffer->outbufsize = 0; buffer->size = 0; buffer->spillTxns = 0; @@ -1241,7 +1240,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn, for (off = 0; off < state->nr_txns; off++) { - state->entries[off].file.vfd = -1; + state->entries[off].file = NULL; state->entries[off].segno = 0; } @@ -1423,8 +1422,8 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb, for (off = 0; off < state->nr_txns; off++) { - if (state->entries[off].file.vfd != -1) - FileClose(state->entries[off].file.vfd); + if (state->entries[off].file) + BufFileCloseTransient(state->entries[off].file); } /* free memory we might have "leaked" in the last *Next call */ @@ -3332,21 +3331,39 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid) */ /* - * Ensure the IO buffer is >= sz. + * Initialize the common part of the encryption tweak. */ static void -ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz) +ReorderBufferTweakBase(ReorderBufferTXN *txn, + char tweak_base[TWEAK_BASE_SIZE]) { - if (!rb->outbufsize) - { - rb->outbuf = MemoryContextAlloc(rb->context, sz); - rb->outbufsize = sz; - } - else if (rb->outbufsize < sz) - { - rb->outbuf = repalloc(rb->outbuf, sz); - rb->outbufsize = sz; - } + char *c = tweak_base; + pid_t pid = MyProcPid; + int pid_bytes; + +/* Only this part of the PID fits into the tweak. */ +#define PID_BYTES_USABLE 3 + + StaticAssertStmt(1 + sizeof(TransactionId) + PID_BYTES_USABLE + <= TWEAK_BASE_SIZE, + "tweak components do not fit into TWEAK_BASE_SIZE"); + + memset(tweak_base, 0, TWEAK_BASE_SIZE); + *c = TRANS_BUF_FILE_REORDERBUFFER; + c++; + memcpy(c, &txn->xid, sizeof(TransactionId)); + c += sizeof(TransactionId); + + /* + * There's only room for PID_BYTES_USABLE bytes of the PID. Use the less + * significant part so that PID increment always causes tweak change. + */ + pid_bytes = Min(sizeof(pid), PID_BYTES_USABLE); +#ifdef WORDS_BIGENDIAN + memcpy(c, ((char *) &pid) + sizeof(pid) - pid_bytes, pid_bytes); +#else + memcpy(c, &pid, pid_bytes); +#endif } /* @@ -3517,9 +3534,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { dlist_iter subtxn_i; dlist_mutable_iter change_i; - int fd = -1; + TransientBufFile *file = NULL; XLogSegNo curOpenSegNo = 0; Size spilled = 0; + Size size = txn->size; elog(DEBUG2, "spill %u changes in XID %u to disk", @@ -3545,13 +3563,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) * store in segment in which it belongs by start lsn, don't split over * multiple segments tho */ - if (fd == -1 || + if (file == NULL || !XLByteInSeg(change->lsn, curOpenSegNo, wal_segment_size)) { char path[MAXPGPATH]; + char tweak_base[TWEAK_BASE_SIZE]; - if (fd != -1) - CloseTransientFile(fd); + if (file) + BufFileCloseTransient(file); XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size); @@ -3562,17 +3581,15 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, curOpenSegNo); + if (data_encrypted) + ReorderBufferTweakBase(txn, tweak_base); /* open segment, create it if necessary */ - fd = OpenTransientFile(path, - O_CREAT | O_WRONLY | O_APPEND | PG_BINARY); - - if (fd < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not open file \"%s\": %m", path))); + file = BufFileOpenTransient(path, + O_CREAT | O_WRONLY | O_APPEND | PG_BINARY, + tweak_base); } - ReorderBufferSerializeChange(rb, txn, fd, change); + ReorderBufferSerializeChange(rb, txn, file, change); dlist_delete(&change->node); ReorderBufferReturnChange(rb, change, true); @@ -3597,8 +3614,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) txn->nentries_mem = 0; txn->txn_flags |= RBTXN_IS_SERIALIZED; - if (fd != -1) - CloseTransientFile(fd); + if (file) + BufFileCloseTransient(file); } /* @@ -3606,15 +3623,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) */ static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, - int fd, ReorderBufferChange *change) + TransientBufFile *file, ReorderBufferChange *change) { - ReorderBufferDiskChange *ondisk; + ReorderBufferDiskChange hdr; Size sz = sizeof(ReorderBufferDiskChange); - ReorderBufferSerializeReserve(rb, sz); - - ondisk = (ReorderBufferDiskChange *) rb->outbuf; - memcpy(&ondisk->change, change, sizeof(ReorderBufferChange)); + memcpy((char *) &hdr + offsetof(ReorderBufferDiskChange, change), + change, sizeof(ReorderBufferChange)); switch (change->action) { @@ -3624,7 +3639,6 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, case REORDER_BUFFER_CHANGE_DELETE: case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT: { - char *data; ReorderBufferTupleBuf *oldtup, *newtup; Size oldlen = 0; @@ -3647,84 +3661,71 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, sz += newlen; } - /* make sure we have enough space */ - ReorderBufferSerializeReserve(rb, sz); - - data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange); - /* might have been reallocated above */ - ondisk = (ReorderBufferDiskChange *) rb->outbuf; + hdr.size = sz; + ReorderBufferWriteData(file, &hdr, sizeof(ReorderBufferDiskChange), + txn); if (oldlen) { - memcpy(data, &oldtup->tuple, sizeof(HeapTupleData)); - data += sizeof(HeapTupleData); - - memcpy(data, oldtup->tuple.t_data, oldlen); - data += oldlen; + ReorderBufferWriteData(file, &oldtup->tuple, + sizeof(HeapTupleData), txn); + ReorderBufferWriteData(file, oldtup->tuple.t_data, oldlen, + txn); } if (newlen) { - memcpy(data, &newtup->tuple, sizeof(HeapTupleData)); - data += sizeof(HeapTupleData); - - memcpy(data, newtup->tuple.t_data, newlen); - data += newlen; + ReorderBufferWriteData(file, &newtup->tuple, + sizeof(HeapTupleData), txn); + ReorderBufferWriteData(file, newtup->tuple.t_data, newlen, + txn); } break; } case REORDER_BUFFER_CHANGE_MESSAGE: { - char *data; Size prefix_size = strlen(change->data.msg.prefix) + 1; sz += prefix_size + change->data.msg.message_size + sizeof(Size) + sizeof(Size); - ReorderBufferSerializeReserve(rb, sz); - data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange); - - /* might have been reallocated above */ - ondisk = (ReorderBufferDiskChange *) rb->outbuf; + hdr.size = sz; + ReorderBufferWriteData(file, &hdr, + sizeof(ReorderBufferDiskChange), + txn); /* write the prefix including the size */ - memcpy(data, &prefix_size, sizeof(Size)); - data += sizeof(Size); - memcpy(data, change->data.msg.prefix, - prefix_size); - data += prefix_size; + ReorderBufferWriteData(file, &prefix_size, sizeof(Size), txn); + ReorderBufferWriteData(file, change->data.msg.prefix, + prefix_size, txn); /* write the message including the size */ - memcpy(data, &change->data.msg.message_size, sizeof(Size)); - data += sizeof(Size); - memcpy(data, change->data.msg.message, - change->data.msg.message_size); - data += change->data.msg.message_size; + ReorderBufferWriteData(file, &change->data.msg.message_size, + sizeof(Size), txn); + ReorderBufferWriteData(file, change->data.msg.message, + change->data.msg.message_size, txn); break; } case REORDER_BUFFER_CHANGE_INVALIDATION: { - char *data; Size inval_size = sizeof(SharedInvalidationMessage) * change->data.inval.ninvalidations; sz += inval_size; - ReorderBufferSerializeReserve(rb, sz); - data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange); - - /* might have been reallocated above */ - ondisk = (ReorderBufferDiskChange *) rb->outbuf; - memcpy(data, change->data.inval.invalidations, inval_size); - data += inval_size; - + hdr.size = sz; + ReorderBufferWriteData(file, &hdr, + sizeof(ReorderBufferDiskChange), + txn); + ReorderBufferWriteData(file, change->data.inval.invalidations, + inval_size, + txn); break; } case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT: { Snapshot snap; - char *data; snap = change->data.snapshot; @@ -3732,78 +3733,50 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, sizeof(TransactionId) * snap->xcnt + sizeof(TransactionId) * snap->subxcnt; - /* make sure we have enough space */ - ReorderBufferSerializeReserve(rb, sz); - data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange); - /* might have been reallocated above */ - ondisk = (ReorderBufferDiskChange *) rb->outbuf; + hdr.size = sz; + ReorderBufferWriteData(file, &hdr, + sizeof(ReorderBufferDiskChange), txn); - memcpy(data, snap, sizeof(SnapshotData)); - data += sizeof(SnapshotData); + ReorderBufferWriteData(file, snap, sizeof(SnapshotData), txn); if (snap->xcnt) - { - memcpy(data, snap->xip, - sizeof(TransactionId) * snap->xcnt); - data += sizeof(TransactionId) * snap->xcnt; - } + ReorderBufferWriteData(file, snap->xip, + sizeof(TransactionId) * snap->xcnt, + txn); if (snap->subxcnt) - { - memcpy(data, snap->subxip, - sizeof(TransactionId) * snap->subxcnt); - data += sizeof(TransactionId) * snap->subxcnt; - } + ReorderBufferWriteData(file, snap->subxip, + sizeof(TransactionId) * snap->subxcnt, + txn); break; } case REORDER_BUFFER_CHANGE_TRUNCATE: { Size size; - char *data; /* account for the OIDs of truncated relations */ size = sizeof(Oid) * change->data.truncate.nrelids; sz += size; - /* make sure we have enough space */ - ReorderBufferSerializeReserve(rb, sz); - - data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange); - /* might have been reallocated above */ - ondisk = (ReorderBufferDiskChange *) rb->outbuf; - - memcpy(data, change->data.truncate.relids, size); - data += size; + hdr.size = sz; + ReorderBufferWriteData(file, &hdr, sizeof(ReorderBufferDiskChange), + txn); + ReorderBufferWriteData(file, change->data.truncate.relids, size, + txn); break; } case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM: case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT: case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID: case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID: + hdr.size = sz; + ReorderBufferWriteData(file, &hdr, sizeof(ReorderBufferDiskChange), + txn); /* ReorderBufferChange contains everything important */ break; } - ondisk->size = sz; - - errno = 0; - pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE); - if (write(fd, rb->outbuf, ondisk->size) != ondisk->size) - { - int save_errno = errno; - - CloseTransientFile(fd); - - /* if write didn't set errno, assume problem is no disk space */ - errno = save_errno ? save_errno : ENOSPC; - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not write to data file for XID %u: %m", - txn->xid))); - } - pgstat_report_wait_end(); - /* * Keep the transaction's final_lsn up to date with each change we send to * disk, so that ReorderBufferRestoreCleanup works correctly. (We used to @@ -3814,8 +3787,21 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, */ if (txn->final_lsn < change->lsn) txn->final_lsn = change->lsn; +} - Assert(ondisk->change.action == change->action); +/* + * Wrapper for BufFileWriteTransient() that raises ERROR if the whole chunk + * was not written. XXX Should this be a macro? + */ +static void +ReorderBufferWriteData(TransientBufFile *file, void *ptr, size_t size, + ReorderBufferTXN *txn) +{ + if (BufFileWriteTransient(file, ptr, size) != size) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to data file for XID %u: %m", + txn->xid))); } /* Returns true, if the output plugin supports streaming, false, otherwise. */ @@ -4058,12 +4044,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change) */ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, - TXNEntryFile *file, XLogSegNo *segno) + TransientBufFile **file, XLogSegNo *segno) { Size restored = 0; XLogSegNo last_segno; dlist_mutable_iter cleanup_iter; - File *fd = &file->vfd; Assert(txn->first_lsn != InvalidXLogRecPtr); Assert(txn->final_lsn != InvalidXLogRecPtr); @@ -4084,12 +4069,10 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, while (restored < max_changes_in_memory && *segno <= last_segno) { - int readBytes; - ReorderBufferDiskChange *ondisk; - - if (*fd == -1) + if (*file == NULL) { char path[MAXPGPATH]; + char tweak_base[TWEAK_BASE_SIZE]; /* first time in */ if (*segno == 0) @@ -4104,86 +4087,27 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, *segno); - *fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY); - - /* No harm in resetting the offset even in case of failure */ - file->curOffset = 0; - - if (*fd < 0 && errno == ENOENT) + if (data_encrypted) + ReorderBufferTweakBase(txn, tweak_base); + *file = BufFileOpenTransient(path, O_RDONLY | PG_BINARY, + tweak_base); + if (*file == NULL) { - *fd = -1; + Assert(errno == ENOENT); (*segno)++; continue; } - else if (*fd < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not open file \"%s\": %m", - path))); } - /* - * Read the statically sized part of a change which has information - * about the total size. If we couldn't read a record, we're at the - * end of this file. - */ - ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange)); - readBytes = FileRead(file->vfd, rb->outbuf, - sizeof(ReorderBufferDiskChange), - file->curOffset, WAIT_EVENT_REORDER_BUFFER_READ); - - /* eof */ - if (readBytes == 0) + ReorderBufferRestoreChange(rb, txn, file); + if (*file) + restored++; + else { - FileClose(*fd); - *fd = -1; + /* No data could be restored. */ (*segno)++; continue; } - else if (readBytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not read from reorderbuffer spill file: %m"))); - else if (readBytes != sizeof(ReorderBufferDiskChange)) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes", - readBytes, - (uint32) sizeof(ReorderBufferDiskChange)))); - - file->curOffset += readBytes; - - ondisk = (ReorderBufferDiskChange *) rb->outbuf; - - ReorderBufferSerializeReserve(rb, - sizeof(ReorderBufferDiskChange) + ondisk->size); - ondisk = (ReorderBufferDiskChange *) rb->outbuf; - - readBytes = FileRead(file->vfd, - rb->outbuf + sizeof(ReorderBufferDiskChange), - ondisk->size - sizeof(ReorderBufferDiskChange), - file->curOffset, - WAIT_EVENT_REORDER_BUFFER_READ); - - if (readBytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not read from reorderbuffer spill file: %m"))); - else if (readBytes != ondisk->size - sizeof(ReorderBufferDiskChange)) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes", - readBytes, - (uint32) (ondisk->size - sizeof(ReorderBufferDiskChange))))); - - file->curOffset += readBytes; - - /* - * ok, read a full change from disk, now restore it into proper - * in-memory format - */ - ReorderBufferRestoreChange(rb, txn, rb->outbuf); - restored++; } return restored; @@ -4193,25 +4117,36 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, * Convert change from its on-disk format to in-memory format and queue it onto * the TXN's ->changes list. * - * Note: although "data" is declared char*, at entry it points to a - * maxalign'd buffer, making it safe in most of this function to assume - * that the pointed-to data is suitably aligned for direct access. + * If no data was found in the file, close it and set *file to NULL. */ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, - char *data) + TransientBufFile **file) { - ReorderBufferDiskChange *ondisk; + ReorderBufferDiskChange ondisk; + bool no_data; ReorderBufferChange *change; - ondisk = (ReorderBufferDiskChange *) data; + /* + * Read the statically sized part of a change which has information about + * the total size. If we couldn't read a record, we're at the end of this + * file. + */ + ReorderBufferReadData(*file, &ondisk, sizeof(ReorderBufferDiskChange), + &no_data); + + /* eof */ + if (no_data) + { + BufFileCloseTransient(*file); + *file = NULL; + return; + } change = ReorderBufferGetChange(rb); /* copy static part */ - memcpy(change, &ondisk->change, sizeof(ReorderBufferChange)); - - data += sizeof(ReorderBufferDiskChange); + memcpy(change, &ondisk.change, sizeof(ReorderBufferChange)); /* restore individual stuff */ switch (change->action) @@ -4222,50 +4157,10 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, case REORDER_BUFFER_CHANGE_DELETE: case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT: if (change->data.tp.oldtuple) - { - uint32 tuplelen = ((HeapTuple) data)->t_len; - - change->data.tp.oldtuple = - ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader); - - /* restore ->tuple */ - memcpy(&change->data.tp.oldtuple->tuple, data, - sizeof(HeapTupleData)); - data += sizeof(HeapTupleData); - - /* reset t_data pointer into the new tuplebuf */ - change->data.tp.oldtuple->tuple.t_data = - ReorderBufferTupleBufData(change->data.tp.oldtuple); - - /* restore tuple data itself */ - memcpy(change->data.tp.oldtuple->tuple.t_data, data, tuplelen); - data += tuplelen; - } + change->data.tp.oldtuple = ReorderBufferRestoreTuple(rb, *file); if (change->data.tp.newtuple) - { - /* here, data might not be suitably aligned! */ - uint32 tuplelen; - - memcpy(&tuplelen, data + offsetof(HeapTupleData, t_len), - sizeof(uint32)); - - change->data.tp.newtuple = - ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader); - - /* restore ->tuple */ - memcpy(&change->data.tp.newtuple->tuple, data, - sizeof(HeapTupleData)); - data += sizeof(HeapTupleData); - - /* reset t_data pointer into the new tuplebuf */ - change->data.tp.newtuple->tuple.t_data = - ReorderBufferTupleBufData(change->data.tp.newtuple); - - /* restore tuple data itself */ - memcpy(change->data.tp.newtuple->tuple.t_data, data, tuplelen); - data += tuplelen; - } + change->data.tp.newtuple = ReorderBufferRestoreTuple(rb, *file); break; case REORDER_BUFFER_CHANGE_MESSAGE: @@ -4273,22 +4168,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, Size prefix_size; /* read prefix */ - memcpy(&prefix_size, data, sizeof(Size)); - data += sizeof(Size); + ReorderBufferReadData(*file, &prefix_size, sizeof(Size), NULL); change->data.msg.prefix = MemoryContextAlloc(rb->context, prefix_size); - memcpy(change->data.msg.prefix, data, prefix_size); + ReorderBufferReadData(*file, change->data.msg.prefix, + prefix_size, NULL); Assert(change->data.msg.prefix[prefix_size - 1] == '\0'); - data += prefix_size; /* read the message */ - memcpy(&change->data.msg.message_size, data, sizeof(Size)); - data += sizeof(Size); + ReorderBufferReadData(*file, &change->data.msg.message_size, + sizeof(Size), NULL); change->data.msg.message = MemoryContextAlloc(rb->context, change->data.msg.message_size); - memcpy(change->data.msg.message, data, - change->data.msg.message_size); - data += change->data.msg.message_size; + ReorderBufferReadData(*file, change->data.msg.message, + change->data.msg.message_size, NULL); break; } @@ -4301,29 +4194,32 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, MemoryContextAlloc(rb->context, inval_size); /* read the message */ - memcpy(change->data.inval.invalidations, data, inval_size); + ReorderBufferReadData(*file, change->data.inval.invalidations, + inval_size, NULL); break; } case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT: { - Snapshot oldsnap; + SnapshotData oldsnap; Snapshot newsnap; Size size; - oldsnap = (Snapshot) data; + ReorderBufferReadData(*file, &oldsnap, sizeof(SnapshotData), NULL); size = sizeof(SnapshotData) + - sizeof(TransactionId) * oldsnap->xcnt + - sizeof(TransactionId) * (oldsnap->subxcnt + 0); + sizeof(TransactionId) * oldsnap.xcnt + + sizeof(TransactionId) * (oldsnap.subxcnt + 0); change->data.snapshot = MemoryContextAllocZero(rb->context, size); newsnap = change->data.snapshot; - memcpy(newsnap, data, size); + memcpy(newsnap, &oldsnap, sizeof(SnapshotData)); newsnap->xip = (TransactionId *) (((char *) newsnap) + sizeof(SnapshotData)); + ReorderBufferReadData(*file, newsnap->xip, + size - sizeof(SnapshotData), NULL); newsnap->subxip = newsnap->xip + newsnap->xcnt; newsnap->copied = true; break; @@ -4335,7 +4231,9 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, relids = ReorderBufferGetRelids(rb, change->data.truncate.nrelids); - memcpy(relids, data, change->data.truncate.nrelids * sizeof(Oid)); + ReorderBufferReadData(*file, relids, + change->data.truncate.nrelids * sizeof(Oid), + NULL); change->data.truncate.relids = relids; break; @@ -4362,6 +4260,77 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, ReorderBufferChangeSize(change)); } +/* + * Convert heap tuple from its on-disk format to in-memory format. + */ +static ReorderBufferTupleBuf * +ReorderBufferRestoreTuple(ReorderBuffer *rb, TransientBufFile *file) +{ + HeapTupleData tupdata; + uint32 tuplelen; + ReorderBufferTupleBuf *result; + + ReorderBufferReadData(file, &tupdata, sizeof(HeapTupleData), NULL); + tuplelen = tupdata.t_len; + + result = ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader); + + /* restore ->tuple */ + memcpy(&result->tuple, &tupdata, sizeof(HeapTupleData)); + + /* reset t_data pointer into the new tuplebuf */ + result->tuple.t_data = ReorderBufferTupleBufData(result); + + /* restore tuple data itself */ + ReorderBufferReadData(file, result->tuple.t_data, tuplelen, NULL); + + return result; +} + +/* + * Wrapper for BufFileReadTransient() that raises ERROR if the expected amount + * of bytes was not read. + * + * If valid pointer is passed for no_data_p, set *no_data_p to indicate + * whether zero bytes was read. If NULL is passed, do not tolerate missing + * data. + */ +static void +ReorderBufferReadData(TransientBufFile *file, void *ptr, size_t size, + bool *no_data_p) +{ + int readBytes; + + /* + * Caller should not request zero bytes. This assumption simplifies + * setting of *no_data_p below. + */ + Assert(size > 0); + + if ((readBytes = BufFileReadTransient(file, ptr, size)) != size) + { + if (no_data_p) + *no_data_p = readBytes == 0; + + /* + * It is o.k. to receive exactly zero bytes if caller passed valid + * no_data_p. + */ + if (no_data_p && *no_data_p) + return; + + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes", + readBytes, (uint32) size))); + } + else if (no_data_p) + { + /* Given that size is non-zero, readBytes must be non-zero too. */ + *no_data_p = false; + } +} + /* * Remove all on-disk stored for the passed in transaction. */ diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index a2f75b23b8..af5b329f8a 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -11,6 +11,7 @@ * - Add a pgstat config column to pg_database, so this * entire thing can be enabled/disabled on a per db basis. * + * Portions Copyright (c) 2019-2021, CYBERTEC PostgreSQL International GmbH * Copyright (c) 2001-2021, PostgreSQL Global Development Group * * src/backend/postmaster/pgstat.c @@ -54,6 +55,7 @@ #include "replication/slot.h" #include "replication/walsender.h" #include "storage/backendid.h" +#include "storage/buffile.h" #include "storage/dsm.h" #include "storage/fd.h" #include "storage/ipc.h" @@ -308,10 +310,16 @@ NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_no static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create); static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create); +static void pgstat_tweak_base(bool permanent, bool global, Oid database, + char tweak_base[TWEAK_BASE_SIZE]); static void pgstat_write_statsfiles(bool permanent, bool allDbs); static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent); +static void pgstat_write_bytes(TransientBufFile *file, void *ptr, size_t size, + bool *failed); static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep); static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent); +static void pgstat_read_bytes(TransientBufFile *file, void *ptr, size_t size, + bool *failed); static void backend_read_statsfile(void); static bool pgstat_write_statsfile_needed(void); @@ -3202,6 +3210,9 @@ PgstatCollectorMain(int argc, char *argv[]) MyBackendType = B_STATS_COLLECTOR; init_ps_display(NULL); + /* BufFileOpenTransient() and friends do use VFD. */ + InitFileAccess(); + /* * Read in existing stats files or initialize the stats to zero. */ @@ -3595,6 +3606,26 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create) return result; } +/* + * Initialize the common part of the encryption tweak. + */ +static void +pgstat_tweak_base(bool permanent, bool global, Oid database, + char tweak_base[TWEAK_BASE_SIZE]) +{ + char *c = tweak_base; + + StaticAssertStmt(3 + sizeof(Oid) <= TWEAK_BASE_SIZE, + "tweak components do not fit into TWEAK_BASE_SIZE"); + memset(tweak_base, 0, TWEAK_BASE_SIZE); + *c = TRANS_BUF_FILE_PGSTATS; + c++; + *c = permanent; + c++; + *c = global; + c++; + memcpy(c, &database, sizeof(Oid)); +} /* ---------- * pgstat_write_statsfiles() - @@ -3615,18 +3646,24 @@ pgstat_write_statsfiles(bool permanent, bool allDbs) { HASH_SEQ_STATUS hstat; PgStat_StatDBEntry *dbentry; - FILE *fpout; + TransientBufFile *fpout; + File vfd; int32 format_id; const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname; const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename; - int rc; + bool failed = false; + char tweak_base[TWEAK_BASE_SIZE]; elog(DEBUG2, "writing stats file \"%s\"", statfile); /* * Open the statistics temp file to write out the current values. */ - fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (data_encrypted) + pgstat_tweak_base(permanent, true, InvalidOid, tweak_base); + fpout = BufFileOpenTransient(tmpfile, + O_CREAT | O_WRONLY | O_APPEND | PG_BINARY, + tweak_base); if (fpout == NULL) { ereport(LOG, @@ -3645,32 +3682,27 @@ pgstat_write_statsfiles(bool permanent, bool allDbs) * Write the file header --- currently just a format ID. */ format_id = PGSTAT_FILE_FORMAT_ID; - rc = fwrite(&format_id, sizeof(format_id), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, &format_id, sizeof(format_id), &failed); /* * Write global stats struct */ - rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, &globalStats, sizeof(globalStats), &failed); /* * Write archiver stats struct */ - rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, &archiverStats, sizeof(archiverStats), &failed); /* * Write WAL stats struct */ - rc = fwrite(&walStats, sizeof(walStats), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, &walStats, sizeof(walStats), &failed); /* * Write SLRU stats struct */ - rc = fwrite(slruStats, sizeof(slruStats), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, slruStats, sizeof(slruStats), &failed); /* * Walk through the database table. @@ -3694,9 +3726,11 @@ pgstat_write_statsfiles(bool permanent, bool allDbs) * Write out the DB entry. We don't write the tables or functions * pointers, since they're of no use to any other process. */ - fputc('D', fpout); - rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, "D", 1, &failed); + pgstat_write_bytes(fpout, + dbentry, + offsetof(PgStat_StatDBEntry, tables), + &failed); } /* @@ -3709,29 +3743,36 @@ pgstat_write_statsfiles(bool permanent, bool allDbs) hash_seq_init(&hstat, replSlotStatHash); while ((slotent = (PgStat_StatReplSlotEntry *) hash_seq_search(&hstat)) != NULL) { - fputc('R', fpout); - rc = fwrite(slotent, sizeof(PgStat_StatReplSlotEntry), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, "R", 1, &failed); + pgstat_write_bytes(fpout, slotent, sizeof(PgStat_StatReplSlotEntry), + &failed); } } /* * No more output to be done. Close the temp file and replace the old - * pgstat.stat with it. The ferror() check replaces testing for error - * after each individual fputc or fwrite above. + * pgstat.stat with it. */ - fputc('E', fpout); + pgstat_write_bytes(fpout, "E", 1, &failed); - if (ferror(fpout)) + if (failed) { ereport(LOG, (errcode_for_file_access(), errmsg("could not write temporary statistics file \"%s\": %m", tmpfile))); - FreeFile(fpout); + BufFileCloseTransient(fpout); unlink(tmpfile); + return; } - else if (FreeFile(fpout) < 0) + + /* + * XXX This might PANIC, see FileClose(). Don't we need special behaviour + * for statistics? + */ + vfd = BufFileTransientGetVfd(fpout); + BufFileCloseTransient(fpout); + if (!FileIsClosed(vfd)) { ereport(LOG, (errcode_for_file_access(), @@ -3796,12 +3837,14 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) HASH_SEQ_STATUS fstat; PgStat_StatTabEntry *tabentry; PgStat_StatFuncEntry *funcentry; - FILE *fpout; + TransientBufFile *fpout; + File vfd; int32 format_id; Oid dbid = dbentry->databaseid; - int rc; char tmpfile[MAXPGPATH]; char statfile[MAXPGPATH]; + bool failed = false; + char tweak_base[TWEAK_BASE_SIZE]; get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH); get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH); @@ -3811,7 +3854,11 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) /* * Open the statistics temp file to write out the current values. */ - fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (data_encrypted) + pgstat_tweak_base(permanent, false, dbid, tweak_base); + fpout = BufFileOpenTransient(tmpfile, + O_CREAT | O_WRONLY | O_APPEND | PG_BINARY, + tweak_base); if (fpout == NULL) { ereport(LOG, @@ -3825,8 +3872,7 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) * Write the file header --- currently just a format ID. */ format_id = PGSTAT_FILE_FORMAT_ID; - rc = fwrite(&format_id, sizeof(format_id), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, &format_id, sizeof(format_id), &failed); /* * Walk through the database's access stats per table. @@ -3834,9 +3880,14 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) hash_seq_init(&tstat, dbentry->tables); while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL) { - fputc('T', fpout); - rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, "T", 1, &failed); + if (failed) + break; + + pgstat_write_bytes(fpout, tabentry, sizeof(PgStat_StatTabEntry), + &failed); + if (failed) + break; } /* @@ -3845,28 +3896,42 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) hash_seq_init(&fstat, dbentry->functions); while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL) { - fputc('F', fpout); - rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout); - (void) rc; /* we'll check for error with ferror */ + pgstat_write_bytes(fpout, "F", 1, &failed); + if (failed) + break; + + pgstat_write_bytes(fpout, + funcentry, + sizeof(PgStat_StatFuncEntry), + &failed); + if (failed) + break; } /* * No more output to be done. Close the temp file and replace the old - * pgstat.stat with it. The ferror() check replaces testing for error - * after each individual fputc or fwrite above. + * pgstat.stat with it. */ - fputc('E', fpout); + pgstat_write_bytes(fpout, "E", 1, &failed); - if (ferror(fpout)) + if (failed) { ereport(LOG, (errcode_for_file_access(), errmsg("could not write temporary statistics file \"%s\": %m", tmpfile))); - FreeFile(fpout); + BufFileCloseTransient(fpout); unlink(tmpfile); + return; } - else if (FreeFile(fpout) < 0) + + /* + * XXX This might PANIC, see FileClose(). Don't we need special behaviour + * for statistics? + */ + vfd = BufFileTransientGetVfd(fpout); + BufFileCloseTransient(fpout); + if (!FileIsClosed(vfd)) { ereport(LOG, (errcode_for_file_access(), @@ -3892,6 +3957,25 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) } } +/* + * Convenience routine to write data to file and check for errors. + */ +static void +pgstat_write_bytes(TransientBufFile *file, void *ptr, size_t size, + bool *failed) +{ + /* Do nothing if any previous write failed. */ + if (*failed) + return; + + /* + * Use BufFileWriteTransient() because it handles encryption + * transparently. + */ + if (BufFileWriteTransient(file, ptr, size) != size) + *failed = true; +} + /* ---------- * pgstat_read_statsfiles() - * @@ -3919,10 +4003,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) PgStat_StatDBEntry dbbuf; HASHCTL hash_ctl; HTAB *dbhash; - FILE *fpin; + TransientBufFile *fpin; int32 format_id; bool found; const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename; + bool failed = false; + char tweak_base[TWEAK_BASE_SIZE]; int i; /* @@ -3971,7 +4057,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) * not yet written the stats file the first time. Any other failure * condition is suspicious. */ - if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL) + if (data_encrypted) + pgstat_tweak_base(permanent, true, InvalidOid, tweak_base); + if ((fpin = BufFileOpenTransient(statfile, + O_RDONLY | PG_BINARY, + tweak_base)) == NULL) { if (errno != ENOENT) ereport(pgStatRunningInCollector ? LOG : WARNING, @@ -3984,8 +4074,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) /* * Verify it's of the expected format. */ - if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) || - format_id != PGSTAT_FILE_FORMAT_ID) + pgstat_read_bytes(fpin, &format_id, sizeof(format_id), &failed); + if (failed || format_id != PGSTAT_FILE_FORMAT_ID) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); @@ -3995,7 +4085,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) /* * Read global stats struct */ - if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats)) + pgstat_read_bytes(fpin, &globalStats, sizeof(globalStats), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); @@ -4016,7 +4107,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) /* * Read archiver stats struct */ - if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats)) + pgstat_read_bytes(fpin, &archiverStats, sizeof(archiverStats), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); @@ -4027,7 +4119,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) /* * Read WAL stats struct */ - if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats)) + pgstat_read_bytes(fpin, &walStats, sizeof(walStats), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); @@ -4038,7 +4131,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) /* * Read SLRU stats struct */ - if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats)) + pgstat_read_bytes(fpin, &slruStats, sizeof(slruStats), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); @@ -4052,15 +4146,22 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) */ for (;;) { - switch (fgetc(fpin)) + char c; + + pgstat_read_bytes(fpin, &c, 1, &failed); + + switch (c) { /* * 'D' A PgStat_StatDBEntry struct describing a database * follows. */ case 'D': - if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables), - fpin) != offsetof(PgStat_StatDBEntry, tables)) + pgstat_read_bytes(fpin, + &dbbuf, + offsetof(PgStat_StatDBEntry, tables), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", @@ -4144,8 +4245,10 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) PgStat_StatReplSlotEntry slotbuf; PgStat_StatReplSlotEntry *slotent; - if (fread(&slotbuf, 1, sizeof(PgStat_StatReplSlotEntry), fpin) - != sizeof(PgStat_StatReplSlotEntry)) + pgstat_read_bytes(fpin, &slotbuf, + sizeof(PgStat_StatReplSlotEntry), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", @@ -4186,7 +4289,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) } done: - FreeFile(fpin); + BufFileCloseTransient(fpin); /* If requested to read the permanent file, also get rid of it. */ if (permanent) @@ -4221,10 +4324,12 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, PgStat_StatTabEntry tabbuf; PgStat_StatFuncEntry funcbuf; PgStat_StatFuncEntry *funcentry; - FILE *fpin; + TransientBufFile *fpin; int32 format_id; bool found; char statfile[MAXPGPATH]; + bool failed = false; + char tweak_base[TWEAK_BASE_SIZE]; get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH); @@ -4237,7 +4342,11 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, * not yet written the stats file the first time. Any other failure * condition is suspicious. */ - if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL) + if (data_encrypted) + pgstat_tweak_base(permanent, false, databaseid, tweak_base); + if ((fpin = BufFileOpenTransient(statfile, + O_RDONLY | PG_BINARY, + tweak_base)) == NULL) { if (errno != ENOENT) ereport(pgStatRunningInCollector ? LOG : WARNING, @@ -4250,8 +4359,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, /* * Verify it's of the expected format. */ - if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) || - format_id != PGSTAT_FILE_FORMAT_ID) + pgstat_read_bytes(fpin, &format_id, sizeof(format_id), &failed); + if (failed || format_id != PGSTAT_FILE_FORMAT_ID) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); @@ -4264,14 +4373,19 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, */ for (;;) { - switch (fgetc(fpin)) + char c; + + pgstat_read_bytes(fpin, &c, 1, &failed); + + switch (c) { /* * 'T' A PgStat_StatTabEntry follows. */ case 'T': - if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry), - fpin) != sizeof(PgStat_StatTabEntry)) + pgstat_read_bytes(fpin, &tabbuf, sizeof(PgStat_StatTabEntry), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", @@ -4304,8 +4418,11 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, * 'F' A PgStat_StatFuncEntry follows. */ case 'F': - if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry), - fpin) != sizeof(PgStat_StatFuncEntry)) + pgstat_read_bytes(fpin, + &funcbuf, + sizeof(PgStat_StatFuncEntry), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", @@ -4349,7 +4466,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, } done: - FreeFile(fpin); + BufFileCloseTransient(fpin); if (permanent) { @@ -4386,15 +4503,21 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, PgStat_WalStats myWalStats; PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS]; PgStat_StatReplSlotEntry myReplSlotStats; - FILE *fpin; + TransientBufFile *fpin; int32 format_id; const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename; + bool failed = false; + char tweak_base[TWEAK_BASE_SIZE]; /* * Try to open the stats file. As above, anything but ENOENT is worthy of * complaining about. */ - if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL) + if (data_encrypted) + pgstat_tweak_base(permanent, true, InvalidOid, tweak_base); + if ((fpin = BufFileOpenTransient(statfile, + O_RDONLY | PG_BINARY, + tweak_base)) == NULL) { if (errno != ENOENT) ereport(pgStatRunningInCollector ? LOG : WARNING, @@ -4407,58 +4530,62 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, /* * Verify it's of the expected format. */ - if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) || - format_id != PGSTAT_FILE_FORMAT_ID) + pgstat_read_bytes(fpin, &format_id, sizeof(format_id), &failed); + if (failed || format_id != PGSTAT_FILE_FORMAT_ID) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } /* * Read global stats struct */ - if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), - fpin) != sizeof(myGlobalStats)) + pgstat_read_bytes(fpin, &myGlobalStats, sizeof(myGlobalStats), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } /* * Read archiver stats struct */ - if (fread(&myArchiverStats, 1, sizeof(myArchiverStats), - fpin) != sizeof(myArchiverStats)) + pgstat_read_bytes(fpin, &myArchiverStats, sizeof(myArchiverStats), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } /* * Read WAL stats struct */ - if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats)) + pgstat_read_bytes(fpin, &myWalStats, sizeof(myWalStats), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } /* * Read SLRU stats struct */ - if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats)) + pgstat_read_bytes(fpin, mySLRUStats, sizeof(mySLRUStats), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } @@ -4471,20 +4598,27 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, */ for (;;) { - switch (fgetc(fpin)) + char c; + + pgstat_read_bytes(fpin, &c, 1, &failed); + + switch (c) { /* * 'D' A PgStat_StatDBEntry struct describing a database * follows. */ case 'D': - if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables), - fpin) != offsetof(PgStat_StatDBEntry, tables)) + pgstat_read_bytes(fpin, + &dbentry, + offsetof(PgStat_StatDBEntry, tables), + &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } @@ -4505,13 +4639,14 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, * replication slot follows. */ case 'R': - if (fread(&myReplSlotStats, 1, sizeof(PgStat_StatReplSlotEntry), fpin) - != sizeof(PgStat_StatReplSlotEntry)) + pgstat_read_bytes(fpin, &myReplSlotStats, + sizeof(PgStat_StatReplSlotEntry), &failed); + if (failed) { ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } break; @@ -4524,17 +4659,35 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, ereport(pgStatRunningInCollector ? LOG : WARNING, (errmsg("corrupted statistics file \"%s\"", statfile))); - FreeFile(fpin); + BufFileCloseTransient(fpin); return false; } } } done: - FreeFile(fpin); + BufFileCloseTransient(fpin); return true; } +/* + * Convenience routine to read data from file and check for errors. + */ +static void +pgstat_read_bytes(TransientBufFile *file, void *ptr, size_t size, + bool *failed) +{ + /* Do nothing if any previous read failed. */ + if (*failed) + return; + + /* + * Use BufFileReadTransient() because it handles encryption transparently. + */ + if (BufFileReadTransient(file, ptr, size) != size) + *failed = true; +} + /* * If not already done, read the statistics collector stats file into * some hash tables. The results will be kept until pgstat_clear_snapshot()
On Mon, Nov 29, 2021 at 08:37:31AM +0100, Antonin Houska wrote: > The changes to buffile.c are not trivial, but we haven't really changed the > API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead(). > > What our patch affects on the caller side is that BufFileOpenTransient(), > BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient() > replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and > read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes > become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1], > see the diffs attached. With pg_upgrade modified to preserve the relfilenode, tablespace oid, and database oid, we are now closer to implementing cluster file encryption using XTS. I think we have a few steps left: 1. modify temporary file I/O to use a more centralized API 2. modify the existing cluster file encryption patch to use XTS with a IV that uses more than the LSN 3. add XTS regression test code like CTR 4. create WAL encryption code using CTR If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July. The feature wiki page is: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption Do people want to advance this feature forward? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Nov 29, 2021 at 08:37:31AM +0100, Antonin Houska wrote: > > The changes to buffile.c are not trivial, but we haven't really changed the > > API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead(). > > > > What our patch affects on the caller side is that BufFileOpenTransient(), > > BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient() > > replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and > > read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes > > become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1], > > see the diffs attached. > > With pg_upgrade modified to preserve the relfilenode, tablespace oid, and > database oid, we are now closer to implementing cluster file encryption > using XTS. I think we have a few steps left: > > 1. modify temporary file I/O to use a more centralized API > 2. modify the existing cluster file encryption patch to use XTS with a > IV that uses more than the LSN > 3. add XTS regression test code like CTR > 4. create WAL encryption code using CTR > > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July. > The feature wiki page is: > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption > > Do people want to advance this feature forward? I confirm that we (Cybertec) do and that we're ready to spend more time on the community implementation. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Tue, Feb 1, 2022 at 07:45:06AM +0100, Antonin Houska wrote: > > With pg_upgrade modified to preserve the relfilenode, tablespace oid, and > > database oid, we are now closer to implementing cluster file encryption > > using XTS. I think we have a few steps left: > > > > 1. modify temporary file I/O to use a more centralized API > > 2. modify the existing cluster file encryption patch to use XTS with a > > IV that uses more than the LSN > > 3. add XTS regression test code like CTR > > 4. create WAL encryption code using CTR > > > > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July. > > The feature wiki page is: > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption > > > > Do people want to advance this feature forward? > > I confirm that we (Cybertec) do and that we're ready to spend more time on the > community implementation. Well, I sent an email a week ago asking if people want to advance this feature forward, and so far you are the only person to reply, which I think means there isn't enough interest in this feature to advance it. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings,
On Tue, Feb 1, 2022 at 12:50 Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 1, 2022 at 07:45:06AM +0100, Antonin Houska wrote:
> > With pg_upgrade modified to preserve the relfilenode, tablespace oid, and
> > database oid, we are now closer to implementing cluster file encryption
> > using XTS. I think we have a few steps left:
> >
> > 1. modify temporary file I/O to use a more centralized API
> > 2. modify the existing cluster file encryption patch to use XTS with a
> > IV that uses more than the LSN
> > 3. add XTS regression test code like CTR
> > 4. create WAL encryption code using CTR
> >
> > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
> > The feature wiki page is:
> >
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> >
> > Do people want to advance this feature forward?
>
> I confirm that we (Cybertec) do and that we're ready to spend more time on the
> community implementation.
Well, I sent an email a week ago asking if people want to advance this
feature forward, and so far you are the only person to reply, which I
think means there isn't enough interest in this feature to advance it.
This confuses me. Clearly there’s plenty of interest, but asking on hackers in a deep old sub thread isn’t a terribly good way to judge that. Yet even when there is an active positive response, you argue that there isn’t enough.
In general, I agree that the items you laid out are what the next steps are. There are patches for some of those items already too and some of them, such as consolidating the temporary file access, are beneficial even without the potential to use them for encryption.
Instead of again asking if people want this feature (many, many, many do), I’d encourage Antonin to start a new thread with the patch to do the temporary file access consolidation which then provides a buffered access and reduces the number of syscalls and work towards getting that committed, ideally as part of this release.
Thanks,
Stephen
On Tue, Feb 1, 2022 at 01:07:36PM -0500, Stephen Frost wrote: > Well, I sent an email a week ago asking if people want to advance this > feature forward, and so far you are the only person to reply, which I > think means there isn't enough interest in this feature to advance it. > > This confuses me. Clearly there’s plenty of interest, but asking on hackers in > a deep old sub thread isn’t a terribly good way to judge that. Yet even when > there is an active positive response, you argue that there isn’t enough. Uh, I have been lead down the path of disinterest/confusion on this feature enough that I am looking for positive feedback on every new step so I don't get stuck out in front with insufficient support. Yes, only one person replying is enough for me to say there isn't interest. I guess I now have two. My email was short and ended with a question so I thought the people interested in the steps I suggested would give some kind of feedback --- I certainly try to reply to all emails on this topic. > In general, I agree that the items you laid out are what the next steps are. > There are patches for some of those items already too and some of them, such as > consolidating the temporary file access, are beneficial even without the > potential to use them for encryption. Great. I can update my patch for July consideration. > Instead of again asking if people want this feature (many, many, many do), I’d > encourage Antonin to start a new thread with the patch to do the temporary file > access consolidation which then provides a buffered access and reduces the > number of syscalls and work towards getting that committed, ideally as part of > this release. Yes, agreed. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2022-02-01 13:27:03 -0500, Bruce Momjian wrote: > On Tue, Feb 1, 2022 at 01:07:36PM -0500, Stephen Frost wrote: > > Well, I sent an email a week ago asking if people want to advance this > > feature forward, and so far you are the only person to reply, which I > > think means there isn't enough interest in this feature to advance it. > > > > This confuses me. Clearly there’s plenty of interest, but asking on hackers in > > a deep old sub thread isn’t a terribly good way to judge that. Yet even when > > there is an active positive response, you argue that there isn’t enough. > > Uh, I have been lead down the path of disinterest/confusion on this > feature enough that I am looking for positive feedback on every new step > so I don't get stuck out in front with insufficient support. Yes, only > one person replying is enough for me to say there isn't interest. I > guess I now have two. My email was short and ended with a question so I > thought the people interested in the steps I suggested would give some > kind of feedback --- I certainly try to reply to all emails on this > topic. Personally I can't keep up with all threads on -hackers all the time. Especially not long and at times very busy threads. So I agree with Stephen that it's not saying much whether / not people react to an email deep in a thread. > > Instead of again asking if people want this feature (many, many, many do), I’d > > encourage Antonin to start a new thread with the patch to do the temporary file > > access consolidation which then provides a buffered access and reduces the > > number of syscalls and work towards getting that committed, ideally as part of > > this release. I think it is quite unlikely that patches of that invasiveness can be merged this release. Greetings, Andres Freund
Hi, On Tue, Feb 01, 2022 at 01:07:36PM -0500, Stephen Frost wrote: > On Tue, Feb 1, 2022 at 12:50 Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Feb 1, 2022 at 07:45:06AM +0100, Antonin Houska wrote: > > > > With pg_upgrade modified to preserve the relfilenode, tablespace > > > > oid, and database oid, we are now closer to implementing cluster > > > > file encryption using XTS. I think we have a few steps left: > > > > > > > > 1. modify temporary file I/O to use a more centralized API > > > > 2. modify the existing cluster file encryption patch to use XTS with a > > > > IV that uses more than the LSN > > > > 3. add XTS regression test code like CTR > > > > 4. create WAL encryption code using CTR > > > > > > > > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July. > > > > The feature wiki page is: > > > > > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption > > > > > > > > Do people want to advance this feature forward? > > > > > > I confirm that we (Cybertec) do and that we're ready to spend more > > > time on the community implementation. > > > > Well, I sent an email a week ago asking if people want to advance this > > feature forward, and so far you are the only person to reply, which I > > think means there isn't enough interest in this feature to advance it. > > This confuses me. Clearly there’s plenty of interest, but asking on hackers > in a deep old sub thread isn’t a terribly good way to judge that. Yet even > when there is an active positive response, you argue that there isn’t > enough. Even more so because not Antonin not only replied as an individual, but in the name of a whole company developing Postgres in general and TDE in particular. Michael -- Michael Banck Teamleiter PostgreSQL-Team Projektleiter Tel.: +49 2166 9901-171 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Geoff Richardson, Peter Lilley Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On 2022/2/2 01:50, Bruce Momjian wrote: > Well, I sent an email a week ago asking if people want to advance this > feature forward, and so far you are the only person to reply, which I > think means there isn't enough interest in this feature to advance it. I am still focus on this thread. and I have a small patch to solve the current buffile problem. https://www.postgresql.org/message-id/a859a753-70f2-bb17-6830-19dbcad11c17%40sasa.su