Thread: Key management with tests
I have completed the key management patch with tests created by Stephen Frost. Original patch by Masahiko Sawada. It requires the hex reorganization patch first. The key patch is now 2.1MB because of the tests, so attaching it here seems unwise: https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff https://github.com/postgres/postgres/compare/master...bmomjian:key.diff I will add it to the commitfest. I think we need to figure out how much of the tests we want to add. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote: > I have completed the key management patch with tests created by Stephen > Frost. Original patch by Masahiko Sawada. It requires the hex > reorganization patch first. The key patch is now 2.1MB because of the > tests, so attaching it here seems unwise: > > https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff > > I will add it to the commitfest. I think we need to figure out how much > of the tests we want to add. I am getting regression test errors using OpenSSL 1.1.1d 10 Sep 2019 with zero-length input data (no -p), while Stephen is able for those tests to pass. This needs more research, plus I think higher-level tests. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Fri, Jan 1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote: > On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote: > > I have completed the key management patch with tests created by Stephen > > Frost. Original patch by Masahiko Sawada. It requires the hex > > reorganization patch first. The key patch is now 2.1MB because of the > > tests, so attaching it here seems unwise: > > > > https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff > > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff > > > > I will add it to the commitfest. I think we need to figure out how much > > of the tests we want to add. > > I am getting regression test errors using OpenSSL 1.1.1d 10 Sep 2019 > with zero-length input data (no -p), while Stephen is able for those > tests to pass. This needs more research, plus I think higher-level > tests. I have found the cause of the failure, which I added as a C comment: /* * OpenSSL 1.1.1d and earlier crashes on some zero-length plaintext * and ciphertext strings. It crashes on an encryption call to * EVP_EncryptFinal_ex(() in GCM mode of zero-length strings if * plaintext is NULL, even though plaintext_len is zero. Setting * plaintext to non-NULL allows it to work. In KW/KWP mode, * zero-length strings fail if plaintext_len = 0 and plaintext is * non-NULL (the opposite). OpenSSL 1.1.1e+ is fine with all options. */ else if (cipher == PG_CIPHER_AES_GCM) { plaintext_len = 0; plaintext = pg_malloc0(1); } All the tests pass now. The current src/test directory is 19MB, and adding these tests takes it to 23MB, or a 20% increase. That seems like a lot. It is testing 128-bit and 256-bit keys --- should we do fewer tests, or just test 256, or use gzip to compress the tests by 50%? (Does every platform have gzip?) My next step is to add the high-level tests. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On 2021-Jan-07, Bruce Momjian wrote: > All the tests pass now. The current src/test directory is 19MB, and > adding these tests takes it to 23MB, or a 20% increase. That seems like > a lot. It is testing 128-bit and 256-bit keys --- should we do fewer > tests, or just test 256, or use gzip to compress the tests by 50%? > (Does every platform have gzip?) So the tests are about 95% of the patch ... do we really need that many tests? -- Álvaro Herrera
On Thu, Jan 7, 2021 at 04:08:49PM -0300, Álvaro Herrera wrote: > On 2021-Jan-07, Bruce Momjian wrote: > > > All the tests pass now. The current src/test directory is 19MB, and > > adding these tests takes it to 23MB, or a 20% increase. That seems like > > a lot. It is testing 128-bit and 256-bit keys --- should we do fewer > > tests, or just test 256, or use gzip to compress the tests by 50%? > > (Does every platform have gzip?) > > So the tests are about 95% of the patch ... do we really need that many > tests? No, I don't think so. Stephen imported the entire NIST test suite. It was so comperhensive, it detected several OpenSSL bugs for zero-length strings, which I already reported, but we would never be encrypting zero-length strings, so there wasn't a lot of value to it. Anyway, I think we need to figure out how to trim. The first part would be to figure out whether we need 128 _and_ 256-bit tests, and then see what items are really useful. Stephen, do you have any ideas on that? We currently have 10296 tests, and I think we could get away with 100. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Thu, Jan 7, 2021 at 10:02:14AM -0500, Bruce Momjian wrote: > My next step is to add the high-level tests. Here is the high-level script, and the log output. I used the pg_upgrade test.sh as a model. It uses "CFE DEBUG" lines that are already in the code to compare the initdb encryption with the other initdb decryption and pg_ctl decryption. It was easier than I thought. What it does not do is to test the file descriptor passing from /dev/tty, or the sample scripts. This seems acceptable to me since I test them and they rarely change. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Jan 7, 2021 at 04:08:49PM -0300, Álvaro Herrera wrote: > > On 2021-Jan-07, Bruce Momjian wrote: > > > > > All the tests pass now. The current src/test directory is 19MB, and > > > adding these tests takes it to 23MB, or a 20% increase. That seems like > > > a lot. It is testing 128-bit and 256-bit keys --- should we do fewer > > > tests, or just test 256, or use gzip to compress the tests by 50%? > > > (Does every platform have gzip?) > > > > So the tests are about 95% of the patch ... do we really need that many > > tests? > > No, I don't think so. Stephen imported the entire NIST test suite. It > was so comperhensive, it detected several OpenSSL bugs for zero-length > strings, which I already reported, but we would never be encrypting > zero-length strings, so there wasn't a lot of value to it. I ran the entire test suite locally to ensure everything worked, but I didn't actually include all of it in the PR which you merged- I had already reduced it quite a bit by removing all 'additional authenticated data' test cases (which the tests will automatically skip and which we haven't implemented support for in the common library wrappers) and by removing the 192-bit cases. This reduced the overall test set by about 2/3rd's or so, as I recall. > Anyway, I think we need to figure out how to trim. The first part would > be to figure out whether we need 128 _and_ 256-bit tests, and then see > what items are really useful. Stephen, do you have any ideas on that? > We currently have 10296 tests, and I think we could get away with 100. Yeah, it's probably still too much, but I don't have any particularly justifiable suggestions as to exactly what we should remove or what we should keep. Perhaps it'd make sense to try and cover the cases that are more likely to be issues between our wrapper functions and OpenSSL, and not stress too much about constantly testing cases that should really be up to OpenSSL. As such, I'd propose: - Add back in some 192-bit tests, so we cover all three bit lengths. - Add back in some additional authenticated test cases, just to make sure that, until/unless we implement support, the test code properly skips over those. - Keep tests for various length plaintext/ciphertext (including 0-byte cases, so we make sure those work, since they really should). - Keep at least one test for each length of tag that's included in the test suite. I'm not sure how many tests we'd end up with from that, but my swag / gut feeling is that it'd probably be on the order of 100ish and a small enough set that it won't dwarf the rest of the patch. Would be nice if we had a way for some buildfarm animal or something to pull in the entire suite and test it, imv.. If anyone wants to volunteer, I'd be happy to explain how to make that happen (it's not hard though- download/unzip the files, drop them in the directory, update the test script to add all the files into the array). Thanks, Stephen
Attachment
Greetings Bruce, * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Jan 1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote: > > On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote: > > > I have completed the key management patch with tests created by Stephen > > > Frost. Original patch by Masahiko Sawada. It requires the hex > > > reorganization patch first. The key patch is now 2.1MB because of the > > > tests, so attaching it here seems unwise: > > > > > > https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff > > > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff > > > > > > I will add it to the commitfest. I think we need to figure out how much > > > of the tests we want to add. > > > > I am getting regression test errors using OpenSSL 1.1.1d 10 Sep 2019 > > with zero-length input data (no -p), while Stephen is able for those > > tests to pass. This needs more research, plus I think higher-level > > tests. > > I have found the cause of the failure, which I added as a C comment: > > /* > * OpenSSL 1.1.1d and earlier crashes on some zero-length plaintext > * and ciphertext strings. It crashes on an encryption call to > * EVP_EncryptFinal_ex(() in GCM mode of zero-length strings if > * plaintext is NULL, even though plaintext_len is zero. Setting > * plaintext to non-NULL allows it to work. In KW/KWP mode, > * zero-length strings fail if plaintext_len = 0 and plaintext is > * non-NULL (the opposite). OpenSSL 1.1.1e+ is fine with all options. > */ > else if (cipher == PG_CIPHER_AES_GCM) > { > plaintext_len = 0; > plaintext = pg_malloc0(1); > } > > All the tests pass now. The current src/test directory is 19MB, and > adding these tests takes it to 23MB, or a 20% increase. That seems like > a lot. It is testing 128-bit and 256-bit keys --- should we do fewer > tests, or just test 256, or use gzip to compress the tests by 50%? > (Does every platform have gzip?) Thanks a lot for working on this and figuring out what the issue was and fixing it! That's great that we got all those cases passing for you too. Thanks again, Stephen
Attachment
On Fri, Jan 8, 2021 at 03:33:44PM -0500, Stephen Frost wrote: > > No, I don't think so. Stephen imported the entire NIST test suite. It > > was so comperhensive, it detected several OpenSSL bugs for zero-length > > strings, which I already reported, but we would never be encrypting > > zero-length strings, so there wasn't a lot of value to it. > > I ran the entire test suite locally to ensure everything worked, but I > didn't actually include all of it in the PR which you merged- I had > already reduced it quite a bit by removing all 'additional > authenticated data' test cases (which the tests will automatically skip > and which we haven't implemented support for in the common library > wrappers) and by removing the 192-bit cases. This reduced the overall > test set by about 2/3rd's or so, as I recall. Wow, so that was reduced! > > Anyway, I think we need to figure out how to trim. The first part would > > be to figure out whether we need 128 _and_ 256-bit tests, and then see > > what items are really useful. Stephen, do you have any ideas on that? > > We currently have 10296 tests, and I think we could get away with 100. > > Yeah, it's probably still too much, but I don't have any particularly > justifiable suggestions as to exactly what we should remove or what we > should keep. > > Perhaps it'd make sense to try and cover the cases that are more likely > to be issues between our wrapper functions and OpenSSL, and not stress > too much about constantly testing cases that should really be up to > OpenSSL. As such, I'd propose: > > - Add back in some 192-bit tests, so we cover all three bit lengths. > - Add back in some additional authenticated test cases, just to make > sure that, until/unless we implement support, the test code properly > skips over those. > - Keep tests for various length plaintext/ciphertext (including 0-byte > cases, so we make sure those work, since they really should). > - Keep at least one test for each length of tag that's included in the > test suite. Makes sense. I did a simplistic trim-down to 90 tests but it still was 40% of the patch; attached. The hex strings are very long. > I'm not sure how many tests we'd end up with from that, but my swag / > gut feeling is that it'd probably be on the order of 100ish and a small > enough set that it won't dwarf the rest of the patch. > > Would be nice if we had a way for some buildfarm animal or something to > pull in the entire suite and test it, imv.. If anyone wants to > volunteer, I'd be happy to explain how to make that happen (it's not > hard though- download/unzip the files, drop them in the directory, > update the test script to add all the files into the array). Yes, do we have a place to store more comprehensive tests outside of our git tree? Has this been done before? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Fri, Jan 8, 2021 at 03:34:23PM -0500, Stephen Frost wrote: > > All the tests pass now. The current src/test directory is 19MB, and > > adding these tests takes it to 23MB, or a 20% increase. That seems like > > a lot. It is testing 128-bit and 256-bit keys --- should we do fewer > > tests, or just test 256, or use gzip to compress the tests by 50%? > > (Does every platform have gzip?) > > Thanks a lot for working on this and figuring out what the issue was and > fixing it! That's great that we got all those cases passing for you > too. Yes, I was relieved. The pattern of when zero-length strings fail in which modes is still very odd, but at least it reports an error, so it isn't returning incorrect data. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Jan 8, 2021 at 03:33:44PM -0500, Stephen Frost wrote: > > > Anyway, I think we need to figure out how to trim. The first part would > > > be to figure out whether we need 128 _and_ 256-bit tests, and then see > > > what items are really useful. Stephen, do you have any ideas on that? > > > We currently have 10296 tests, and I think we could get away with 100. > > > > Yeah, it's probably still too much, but I don't have any particularly > > justifiable suggestions as to exactly what we should remove or what we > > should keep. > > > > Perhaps it'd make sense to try and cover the cases that are more likely > > to be issues between our wrapper functions and OpenSSL, and not stress > > too much about constantly testing cases that should really be up to > > OpenSSL. As such, I'd propose: > > > > - Add back in some 192-bit tests, so we cover all three bit lengths. > > - Add back in some additional authenticated test cases, just to make > > sure that, until/unless we implement support, the test code properly > > skips over those. > > - Keep tests for various length plaintext/ciphertext (including 0-byte > > cases, so we make sure those work, since they really should). > > - Keep at least one test for each length of tag that's included in the > > test suite. > > Makes sense. I did a simplistic trim-down to 90 tests but it still was > 40% of the patch; attached. The hex strings are very long. I don't think we actually need to stress over the size of the test data relative to the size of the patch- it's not like it's all that much perl code. I can appreciate that we don't want to add megabytes worth of test data to the git repo though. > > I'm not sure how many tests we'd end up with from that, but my swag / > > gut feeling is that it'd probably be on the order of 100ish and a small > > enough set that it won't dwarf the rest of the patch. > > > > Would be nice if we had a way for some buildfarm animal or something to > > pull in the entire suite and test it, imv.. If anyone wants to > > volunteer, I'd be happy to explain how to make that happen (it's not > > hard though- download/unzip the files, drop them in the directory, > > update the test script to add all the files into the array). > > Yes, do we have a place to store more comprehensive tests outside of our > git tree? Has this been done before? Not that I'm aware of. Thanks, Stephen
Attachment
On Fri, Jan 1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote: > On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote: > > I have completed the key management patch with tests created by Stephen > > Frost. Original patch by Masahiko Sawada. It requires the hex > > reorganization patch first. The key patch is now 2.1MB because of the > > tests, so attaching it here seems unwise: > > > > https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff > > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff > > > > I will add it to the commitfest. I think we need to figure out how much > > of the tests we want to add. > > I am getting regression test errors using OpenSSL 1.1.1d 10 Sep 2019 > with zero-length input data (no -p), while Stephen is able for those > tests to pass. This needs more research, plus I think higher-level > tests. I know we are still working on the hex patch (dest-len) and the crypto tests, but I wanted to post this so people can see where we are, and we can get some current cfbot testing. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Sat, Jan 9, 2021 at 01:17:36PM -0500, Bruce Momjian wrote: > On Fri, Jan 1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote: > > On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote: > > > I have completed the key management patch with tests created by Stephen > > > Frost. Original patch by Masahiko Sawada. It requires the hex > > > reorganization patch first. The key patch is now 2.1MB because of the > > > tests, so attaching it here seems unwise: > > > > > > https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff > > > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff > > > > > > I will add it to the commitfest. I think we need to figure out how much > > > of the tests we want to add. > > > > I am getting regression test errors using OpenSSL 1.1.1d 10 Sep 2019 > > with zero-length input data (no -p), while Stephen is able for those > > tests to pass. This needs more research, plus I think higher-level > > tests. > > I know we are still working on the hex patch (dest-len) and the crypto > tests, but I wanted to post this so people can see where we are, and we > can get some current cfbot testing. Here is an updated version that covers all the possible testing/configuration options. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Sat, Jan 9, 2021 at 08:08:16PM -0500, Bruce Momjian wrote: > On Sat, Jan 9, 2021 at 01:17:36PM -0500, Bruce Momjian wrote: > > I know we are still working on the hex patch (dest-len) and the crypto > > tests, but I wanted to post this so people can see where we are, and we > > can get some current cfbot testing. > > Here is an updated version that covers all the possible > testing/configuration options. Does anyone know why the cfbot applied the patch listed second first here? http://cfbot.cputube.org/patch_31_2925.log Specifically, it applied hex..key.diff.gz before hex.diff.gz. I assumed it would apply attachments in the order they appear in the email. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote: > Does anyone know why the cfbot applied the patch listed second first > here? > > http://cfbot.cputube.org/patch_31_2925.log > > Specifically, it applied hex..key.diff.gz before hex.diff.gz. I assumed > it would apply attachments in the order they appear in the email. It sorts the filenames (in this case after decompressing step removes the .gz endings). That works pretty well for the patches that "git format-patch" spits out, but it's a bit hit and miss with cases like yours.
On Sun, Jan 10, 2021 at 06:04:12PM +1300, Thomas Munro wrote: > On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote: > > Does anyone know why the cfbot applied the patch listed second first > > here? > > > > http://cfbot.cputube.org/patch_31_2925.log > > > > Specifically, it applied hex..key.diff.gz before hex.diff.gz. I assumed > > it would apply attachments in the order they appear in the email. > > It sorts the filenames (in this case after decompressing step removes > the .gz endings). That works pretty well for the patches that "git > format-patch" spits out, but it's a bit hit and miss with cases like > yours. OK, here they are with numeric prefixes. It was actually tricky to figure out how to create a squashed format-patch based on another branch. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Sun, Jan 10, 2021 at 11:51 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Sun, Jan 10, 2021 at 06:04:12PM +1300, Thomas Munro wrote: > > On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote: > > > Does anyone know why the cfbot applied the patch listed second first > > > here? > > > > > > http://cfbot.cputube.org/patch_31_2925.log > > > > > > Specifically, it applied hex..key.diff.gz before hex.diff.gz. I assumed > > > it would apply attachments in the order they appear in the email. > > > > It sorts the filenames (in this case after decompressing step removes > > the .gz endings). That works pretty well for the patches that "git > > format-patch" spits out, but it's a bit hit and miss with cases like > > yours. > > OK, here they are with numeric prefixes. It was actually tricky to > figure out how to create a squashed format-patch based on another branch. > Thank you for attaching the patches. It passes all cfbot tests, great. Looking at the patch, it supports three algorithms but only PG_CIPHER_AES_KWP is used in the core for now: +/* + * Supported symmetric encryption algorithm. These identifiers are passed + * to pg_cipher_ctx_create() function, and then actual encryption + * implementations need to initialize their context of the given encryption + * algorithm. + */ +#define PG_CIPHER_AES_GCM 0 +#define PG_CIPHER_AES_KW 1 +#define PG_CIPHER_AES_KWP 2 +#define PG_MAX_CIPHER_ID 3 Are we in the process of experimenting which algorithms are better? If we support one algorithm that is actually used in the core, we would reduce the tests as well. FWIW, I've written a PoC patch for buffer encryption to make sure the kms patch would be workable with other components using the encryption key managed by kmgr. Overall it’s good. While the buffer encryption patch is still PoC quality and there are some problems regarding nonce generation we need to deal with, it easily can use the relation key managed by the kmgr to encrypt/decrypt buffers. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
Attachment
On Mon, Jan 11, 2021 at 08:12:00PM +0900, Masahiko Sawada wrote: > On Sun, Jan 10, 2021 at 11:51 PM Bruce Momjian <bruce@momjian.us> wrote: > > OK, here they are with numeric prefixes. It was actually tricky to > > figure out how to create a squashed format-patch based on another branch. > > Thank you for attaching the patches. It passes all cfbot tests, great. Yeah, I saw that. :-) I head to learn a lot about how to create squashed format-patches on non-master branches. I have now automated it so it will be easy going forward. > Looking at the patch, it supports three algorithms but only > PG_CIPHER_AES_KWP is used in the core for now: > > +/* > + * Supported symmetric encryption algorithm. These identifiers are passed > + * to pg_cipher_ctx_create() function, and then actual encryption > + * implementations need to initialize their context of the given encryption > + * algorithm. > + */ > +#define PG_CIPHER_AES_GCM 0 > +#define PG_CIPHER_AES_KW 1 > +#define PG_CIPHER_AES_KWP 2 > +#define PG_MAX_CIPHER_ID 3 > > Are we in the process of experimenting which algorithms are better? If > we support one algorithm that is actually used in the core, we would > reduce the tests as well. I think we are only using KWP (Key Wrap with Padding) because that is for wrapping keys: https://csrc.nist.gov/CSRC/media/Projects/Cryptographic-Algorithm-Validation-Program/documents/mac/KWVS.pdf I am not sure about KW. I think we are using GCM for the WAP/heap/index pages. Stephen would know more. > FWIW, I've written a PoC patch for buffer encryption to make sure the > kms patch would be workable with other components using the encryption > key managed by kmgr. Wow, it is a small patch --- nice. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Jan 11, 2021 at 08:12:00PM +0900, Masahiko Sawada wrote: > > Looking at the patch, it supports three algorithms but only > > PG_CIPHER_AES_KWP is used in the core for now: > > > > +/* > > + * Supported symmetric encryption algorithm. These identifiers are passed > > + * to pg_cipher_ctx_create() function, and then actual encryption > > + * implementations need to initialize their context of the given encryption > > + * algorithm. > > + */ > > +#define PG_CIPHER_AES_GCM 0 > > +#define PG_CIPHER_AES_KW 1 > > +#define PG_CIPHER_AES_KWP 2 > > +#define PG_MAX_CIPHER_ID 3 > > > > Are we in the process of experimenting which algorithms are better? If > > we support one algorithm that is actually used in the core, we would > > reduce the tests as well. > > I think we are only using KWP (Key Wrap with Padding) because that is > for wrapping keys: > > https://csrc.nist.gov/CSRC/media/Projects/Cryptographic-Algorithm-Validation-Program/documents/mac/KWVS.pdf Yes. > I am not sure about KW. I think we are using GCM for the WAP/heap/index > pages. Stephen would know more. KW was more-or-less 'for free' and there were tests for it, which is why it was included. Yes, GCM would be for WAL/heap/index pages, it wouldn't be appropriate to use KW or KWP for that. Using KW/KWP for the key wrapping also makes the API simpler- and therefore easier for other implementations to be written which provide the same API. > > FWIW, I've written a PoC patch for buffer encryption to make sure the > > kms patch would be workable with other components using the encryption > > key managed by kmgr. > > Wow, it is a small patch --- nice. I agree that the actual encryption patch, for just the main heap/index, won't be too bad. The larger part will be dealing with all of the temporary files we create that have user data in them... I've been contemplating a way to try and make that part of the patch smaller though and hopefully that will bear fruit and we can avoid having to change a lof of, eg, reorderbuffer.c and pgstat.c. There's a few places where we need to be sure to be updating the LSN for both logged and unlogged relations properly, including dealing with things like the magic GIST "GistBuildLSN" fake-LSN too, and we will absolutely need to have a bit used in the IV to distinguish if it's a real LSN or an unlogged LSN. Although, another approach and one that I've discussed a bit with Bruce, is to have more keys- such as a key for temporary files, and perhaps even a key for logged relations and a different for unlogged.. Or perhaps sets of keys for each which automatically are rotating every X number of GB based on the LSN... Which is a big part of why key management is such an important part of this effort. Thanks, Stephen
Attachment
On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote: > Although, another approach and one that I've discussed a bit with Bruce, > is to have more keys- such as a key for temporary files, and perhaps > even a key for logged relations and a different for unlogged.. Or Yes, we have to make sure the nonce (computed as LSN/pageno) is never reused, so if we have several LSN usage "spaces", they need different data keys. > perhaps sets of keys for each which automatically are rotating every X > number of GB based on the LSN... Which is a big part of why key > management is such an important part of this effort. Yes, this would avoid the need to failover to a standby for data key rotation. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote: > > Although, another approach and one that I've discussed a bit with Bruce, > > is to have more keys- such as a key for temporary files, and perhaps > > even a key for logged relations and a different for unlogged.. Or > > Yes, we have to make sure the nonce (computed as LSN/pageno) is never > reused, so if we have several LSN usage "spaces", they need different > data keys. Right, or ensure that the actual IV used is distinct (such as by using another bit in the IV to distinguish logged-vs-unlogged), but it seems saner to just use a different key, ultimately. > > perhaps sets of keys for each which automatically are rotating every X > > number of GB based on the LSN... Which is a big part of why key > > management is such an important part of this effort. > > Yes, this would avoid the need to failover to a standby for data key > rotation. Yes, and it avoids the issue of using a single key for too much, which is also a concern. The remaining larger issues are to figure out a place to put the tag for each page, and the relatively simple matter of programming a mechanism to cache the keys we're commonly using (current key for encryption, recently used keys for decryption) since we'll eventually get to a point of having written out more data than we are going to keep keys in memory for. Thanks, Stephen
Attachment
On Mon, Jan 11, 2021 at 01:23:27PM -0500, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote: > > > Although, another approach and one that I've discussed a bit with Bruce, > > > is to have more keys- such as a key for temporary files, and perhaps > > > even a key for logged relations and a different for unlogged.. Or > > > > Yes, we have to make sure the nonce (computed as LSN/pageno) is never > > reused, so if we have several LSN usage "spaces", they need different > > data keys. > > Right, or ensure that the actual IV used is distinct (such as by using > another bit in the IV to distinguish logged-vs-unlogged), but it seems > saner to just use a different key, ultimately. Yes, we have eight unused bit in the Nonce right now. > > > perhaps sets of keys for each which automatically are rotating every X > > > number of GB based on the LSN... Which is a big part of why key > > > management is such an important part of this effort. > > > > Yes, this would avoid the need to failover to a standby for data key > > rotation. > > Yes, and it avoids the issue of using a single key for too much, which > is also a concern. The remaining larger issues are to figure out a > place to put the tag for each page, and the relatively simple matter of > programming a mechanism to cache the keys we're commonly using (current > key for encryption, recently used keys for decryption) since we'll > eventually get to a point of having written out more data than we are > going to keep keys in memory for. I thought the LSN range would be stored with the keys, so there is no need to tag the LSN on each page. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Jan 11, 2021 at 01:23:27PM -0500, Stephen Frost wrote: > > Yes, and it avoids the issue of using a single key for too much, which > > is also a concern. The remaining larger issues are to figure out a > > place to put the tag for each page, and the relatively simple matter of > > programming a mechanism to cache the keys we're commonly using (current > > key for encryption, recently used keys for decryption) since we'll > > eventually get to a point of having written out more data than we are > > going to keep keys in memory for. > > I thought the LSN range would be stored with the keys, so there is no > need to tag the LSN on each page. Yes, LSN range would be stored with the keys in some fashion (maybe just the start of a particular LSN range would be in the filename of the key for that range...). The 'tag' that I'm referring to there is one of the outputs from the GCM encryption and is what provides the integrity / authentication of the encrypted data to be able to detect if it's been modified. Unfortunately, while the page checksum will continue to be used and available for checking against disk corruption, it's not sufficient. Hence, ideally, we'd find a spot to stick the 128-bit tag on each page. Given that, clearly, it's not possible to go from an unencrypted cluster to an encrypted cluster without rewriting the entire cluster, we aren't bound to maintain the on-disk page format, we should be able to accomadate including the tag somewhere. Unfortuantely, it doesn't seem quite as trivial as I'd hoped since there are parts of the code which make assumptions about the page beyond perhaps what they should be, but I'm still hopeful that it won't be *too* hard to do. Thanks, Stephen
Attachment
On Mon, Jan 11, 2021 at 02:19:22PM -0500, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Mon, Jan 11, 2021 at 01:23:27PM -0500, Stephen Frost wrote: > > > Yes, and it avoids the issue of using a single key for too much, which > > > is also a concern. The remaining larger issues are to figure out a > > > place to put the tag for each page, and the relatively simple matter of > > > programming a mechanism to cache the keys we're commonly using (current > > > key for encryption, recently used keys for decryption) since we'll > > > eventually get to a point of having written out more data than we are > > > going to keep keys in memory for. > > > > I thought the LSN range would be stored with the keys, so there is no > > need to tag the LSN on each page. > > Yes, LSN range would be stored with the keys in some fashion (maybe just > the start of a particular LSN range would be in the filename of the key > for that range...). The 'tag' that I'm referring to there is one of the Oh, that tag, yes, we need to add that to each page. I thought you mean an LSN-range-key tag. > outputs from the GCM encryption and is what provides the integrity / > authentication of the encrypted data to be able to detect if it's been > modified. Unfortunately, while the page checksum will continue to be > used and available for checking against disk corruption, it's not > sufficient. Hence, ideally, we'd find a spot to stick the 128-bit tag > on each page. Agreed. Would checksums be of any value with GCM? > Given that, clearly, it's not possible to go from an unencrypted cluster > to an encrypted cluster without rewriting the entire cluster, we aren't > bound to maintain the on-disk page format, we should be able to > accomadate including the tag somewhere. Unfortuantely, it doesn't seem > quite as trivial as I'd hoped since there are parts of the code which > make assumptions about the page beyond perhaps what they should be, but > I'm still hopeful that it won't be *too* hard to do. OK, thanks. Are there other page improvements we should make when we are requiring a page rewrite? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Jan 11, 2021 at 02:19:22PM -0500, Stephen Frost wrote: > > outputs from the GCM encryption and is what provides the integrity / > > authentication of the encrypted data to be able to detect if it's been > > modified. Unfortunately, while the page checksum will continue to be > > used and available for checking against disk corruption, it's not > > sufficient. Hence, ideally, we'd find a spot to stick the 128-bit tag > > on each page. > > Agreed. Would checksums be of any value with GCM? The value would be to allow testing of the database integrity, to the amount allowed by the checksum, to be done without having access to the encryption keys, and because there's not much else we'd be using those bits for if we didn't. > > Given that, clearly, it's not possible to go from an unencrypted cluster > > to an encrypted cluster without rewriting the entire cluster, we aren't > > bound to maintain the on-disk page format, we should be able to > > accomadate including the tag somewhere. Unfortuantely, it doesn't seem > > quite as trivial as I'd hoped since there are parts of the code which > > make assumptions about the page beyond perhaps what they should be, but > > I'm still hopeful that it won't be *too* hard to do. > > OK, thanks. Are there other page improvements we should make when we > are requiring a page rewrite? This is an interesting question but ultimately I don't think we should be looking at this from the perspective of allowing arbitrary changes to the page format. The challenge is that much of the page format, today, is defined by a C struct and changing the way that works would require a great deal of code to be modified and turn this into a massive effort, assuming we wish to have the same compiled binary able to work with both unencrypted and encrypted clusters, which I do believe is a requirement. The thought that I had was to, instead, try to figure out if we could fudge some space by, say, putting a 128-bit 'hole' at the end of the page and just move pd_special back, effectively making the page seem 'smaller' to all of the code that uses it, except for the code that knows how to do the decryption. I ran into some trouble with that but haven't quite sorted out what happened yet. Other ideas would be to put it before pd_special, or maybe somewhere else, but a lot depends on the code's expectations. Thanks, Stephen
Attachment
On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote: > > > Although, another approach and one that I've discussed a bit with Bruce, > > > is to have more keys- such as a key for temporary files, and perhaps > > > even a key for logged relations and a different for unlogged.. Or > > > > Yes, we have to make sure the nonce (computed as LSN/pageno) is never > > reused, so if we have several LSN usage "spaces", they need different > > data keys. > > Right, or ensure that the actual IV used is distinct (such as by using > another bit in the IV to distinguish logged-vs-unlogged), but it seems > saner to just use a different key, ultimately. Agreed. I think we also need to consider how to make sure nonce is unique when making a page dirty by updating hint bits. Hint bit update changes the page contents but doesn't change the page lsn if we already write a full page write. In the PoC patch, I logged a dummy WAL record (XLOG_NOOP) just to move the page lsn forward, but since this is required even when changing the page is not the first time since the last checkpoint we might end up logging too many dummy WAL records. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Tue, Jan 12, 2021 at 09:32:54AM +0900, Masahiko Sawada wrote: > On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost <sfrost@snowman.net> wrote: > > Right, or ensure that the actual IV used is distinct (such as by using > > another bit in the IV to distinguish logged-vs-unlogged), but it seems > > saner to just use a different key, ultimately. > > Agreed. > > I think we also need to consider how to make sure nonce is unique when > making a page dirty by updating hint bits. Hint bit update changes the > page contents but doesn't change the page lsn if we already write a > full page write. In the PoC patch, I logged a dummy WAL record > (XLOG_NOOP) just to move the page lsn forward, but since this is > required even when changing the page is not the first time since the > last checkpoint we might end up logging too many dummy WAL records. This says: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements wal_log_hints will be enabled automatically in encryption mode. Does that help? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi Stephen,
On Tue, Jan 12, 2021 at 10:47 AM Stephen Frost <sfrost@snowman.net> wrote:
This is an interesting question but ultimately I don't think we should
be looking at this from the perspective of allowing arbitrary changes to
the page format. The challenge is that much of the page format, today,
is defined by a C struct and changing the way that works would require a
great deal of code to be modified and turn this into a massive effort,
assuming we wish to have the same compiled binary able to work with both
unencrypted and encrypted clusters, which I do believe is a requirement.
The thought that I had was to, instead, try to figure out if we could
fudge some space by, say, putting a 128-bit 'hole' at the end of the
page and just move pd_special back, effectively making the page seem
'smaller' to all of the code that uses it, except for the code that
knows how to do the decryption. I ran into some trouble with that but
haven't quite sorted out what happened yet. Other ideas would be to put
it before pd_special, or maybe somewhere else, but a lot depends on the
code's expectations.
I agree that we should not make too many changes to affect the use of unencrypted clusters. But as a personal opinion only, I don't think it's a good idea to add some "implicit" tricks. To provide an inspiration, can we add a flag to mark whether the page format has been changed:
--- a/src/include/storage/bufpage.h+++ b/src/include/storage/bufpage.h@@ -181,8 +185,9 @@ typedef PageHeaderData *PageHeader;#define PD_PAGE_FULL 0x0002 /* not enough free space for new tuple? */#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to* everyone */+#define PD_PAGE_ENCRYPTED 0x0008 /* Is page encrypted? */-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits *//** Page layout version number 0 is for pre-7.3 Postgres releases.@@ -389,6 +394,13 @@ PageValidateSpecialPointer(Page page)#define PageClearAllVisible(page) \(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)+#define PageIsEncrypted(page) \+ (((PageHeader) (page))->pd_flags & PD_PAGE_ENCRYPTED)+#define PageSetEncrypted(page) \+ (((PageHeader) (page))->pd_flags |= PD_PAGE_ENCRYPTED)+#define PageClearEncrypted(page) \+ (((PageHeader) (page))->pd_flags &= ~PD_PAGE_ENCRYPTED)+#define PageIsPrunable(page, oldestxmin) \( \AssertMacro(TransactionIdIsNormal(oldestxmin)), \
In this way, I think it has little effect on the unencrypted cluster, and we can also modify the page format as we wish. Of course, it's also possible that I didn't understand your design correctly, or there's something wrong with my idea. :D
There is no royal road to learning.
HighGo Software Co.
On Tue, Jan 12, 2021 at 11:09 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Jan 12, 2021 at 09:32:54AM +0900, Masahiko Sawada wrote: > > On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost <sfrost@snowman.net> wrote: > > > Right, or ensure that the actual IV used is distinct (such as by using > > > another bit in the IV to distinguish logged-vs-unlogged), but it seems > > > saner to just use a different key, ultimately. > > > > Agreed. > > > > I think we also need to consider how to make sure nonce is unique when > > making a page dirty by updating hint bits. Hint bit update changes the > > page contents but doesn't change the page lsn if we already write a > > full page write. In the PoC patch, I logged a dummy WAL record > > (XLOG_NOOP) just to move the page lsn forward, but since this is > > required even when changing the page is not the first time since the > > last checkpoint we might end up logging too many dummy WAL records. > > This says: > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements > > wal_log_hints will be enabled automatically in encryption mode. > > Does that help? IIUC it helps but not enough. When wal_log_hints is enabled, we write a full-page image when updating hint bits if it's the first time change for the page since the last checkpoint. But I'm concerned that what if we change hint bits again after the page is flushed. We would mark the page as dirtied but not write any WAL, leaving the page lsn as it is. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
Greetings, * Neil Chen (carpenter.nail.cz@gmail.com) wrote: > On Tue, Jan 12, 2021 at 10:47 AM Stephen Frost <sfrost@snowman.net> wrote: > > This is an interesting question but ultimately I don't think we should > > be looking at this from the perspective of allowing arbitrary changes to > > the page format. The challenge is that much of the page format, today, > > is defined by a C struct and changing the way that works would require a > > great deal of code to be modified and turn this into a massive effort, > > assuming we wish to have the same compiled binary able to work with both > > unencrypted and encrypted clusters, which I do believe is a requirement. > > > > The thought that I had was to, instead, try to figure out if we could > > fudge some space by, say, putting a 128-bit 'hole' at the end of the > > page and just move pd_special back, effectively making the page seem > > 'smaller' to all of the code that uses it, except for the code that > > knows how to do the decryption. I ran into some trouble with that but > > haven't quite sorted out what happened yet. Other ideas would be to put > > it before pd_special, or maybe somewhere else, but a lot depends on the > > code's expectations. > > I agree that we should not make too many changes to affect the use of > unencrypted clusters. But as a personal opinion only, I don't think it's a > good idea to add some "implicit" tricks. To provide an inspiration, can we > add a flag to mark whether the page format has been changed: Sure, of course we could add such a flag, but I don't see how that would actually help with the issue? > In this way, I think it has little effect on the unencrypted cluster, and > we can also modify the page format as we wish. Of course, it's also > possible that I didn't understand your design correctly, or there's > something wrong with my idea. :D No, we can't 'modify the page format as we wish'- if we change away from using a C structure then we're going to be modifying quite a bit of code which otherwise doesn't need to be changed. The proposed flag doesn't actually make a different page format work, the only thing it would do would be to allow some parts of the cluster to be encrypted and other parts not be, but I don't know that that's actually a useful capability or a good reason to use one of those bits. Having it handled on a cluster level, at initdb time through pg_control, seems like it'd work just fine. Thanks, Stephen
Attachment
On Sun, Jan 10, 2021 at 09:51:16AM -0500, Bruce Momjian wrote: > On Sun, Jan 10, 2021 at 06:04:12PM +1300, Thomas Munro wrote: > > On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote: > > > Does anyone know why the cfbot applied the patch listed second first > > > here? > > > > > > http://cfbot.cputube.org/patch_31_2925.log > > > > > > Specifically, it applied hex..key.diff.gz before hex.diff.gz. I assumed > > > it would apply attachments in the order they appear in the email. > > > > It sorts the filenames (in this case after decompressing step removes > > the .gz endings). That works pretty well for the patches that "git > > format-patch" spits out, but it's a bit hit and miss with cases like > > yours. > > OK, here they are with numeric prefixes. It was actually tricky to > figure out how to create a squashed format-patch based on another branch. Here is an updated version built on top of Michael Paquier's patch posted here: https://www.postgresql.org/message-id/X/0IChOPHd+aYC1w@paquier.xyz and included as my first attachment. This will give Michael's patch cfbot testing too since the second attachment calls many of the first attachment's functions. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Tue, Jan 12, 2021 at 09:40:53PM +0900, Masahiko Sawada wrote: > > This says: > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements > > > > wal_log_hints will be enabled automatically in encryption mode. > > > > Does that help? > > IIUC it helps but not enough. When wal_log_hints is enabled, we write > a full-page image when updating hint bits if it's the first time > change for the page since the last checkpoint. But I'm concerned that > what if we change hint bits again after the page is flushed. We would > mark the page as dirtied but not write any WAL, leaving the page lsn > as it is. I updated the wiki to be: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements wal_log_hints will be enabled automatically in encryption mode. However, more than one hit change between checkpoints does not cause WAL activity, which would cause the same LSN to be used for different pages images. I think one big question is that, since we are using a streaming cipher, do we care about hint bit changes showing to users? I actually don't know. If we do, some kind of dummy LSN record might be required, as you suggested. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On 2021-01-12 13:03:14 -0500, Bruce Momjian wrote: > I think one big question is that, since we are using a streaming cipher, > do we care about hint bit changes showing to users? I actually don't > know. If we do, some kind of dummy LSN record might be required, as you > suggested. That'd lead to a *massive* increase of WAL record volume. It's one thing to WAL log hint bit writes once per page per checkpoint. It's another to do so on every single hint bit write.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Jan 12, 2021 at 09:40:53PM +0900, Masahiko Sawada wrote: > > > This says: > > > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements > > > > > > wal_log_hints will be enabled automatically in encryption mode. > > > > > > Does that help? > > > > IIUC it helps but not enough. When wal_log_hints is enabled, we write > > a full-page image when updating hint bits if it's the first time > > change for the page since the last checkpoint. But I'm concerned that > > what if we change hint bits again after the page is flushed. We would > > mark the page as dirtied but not write any WAL, leaving the page lsn > > as it is. > > I updated the wiki to be: > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements > > wal_log_hints will be enabled automatically in encryption mode. However, > more than one hit change between checkpoints does not cause WAL > activity, which would cause the same LSN to be used for different pages > images. > > I think one big question is that, since we are using a streaming cipher, > do we care about hint bit changes showing to users? I actually don't > know. If we do, some kind of dummy LSN record might be required, as you > suggested. I don't think there's any doubt that we need to make sure that the IV is distinct and advancing the LSN to get a new one when needed for this case seems like it's probably the way to do that. Hint bit change visibility to users isn't really at issue here- we can't use the same IV multiple times. The two options that we have are to either not actually update the hint bit in such a case, or to make sure to change the LSN/IV. Another option would be to, if we're able to make a hole to put the GCM tag on to the page somewhere, further widen that hole to include an additional space for a counter that would be mixed into the IV, to avoid having to do an XLOG NOOP. Thanks, Stephen
Attachment
On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote: > > I think one big question is that, since we are using a streaming cipher, > > do we care about hint bit changes showing to users? I actually don't > > know. If we do, some kind of dummy LSN record might be required, as you > > suggested. > > I don't think there's any doubt that we need to make sure that the IV is > distinct and advancing the LSN to get a new one when needed for this > case seems like it's probably the way to do that. Hint bit change > visibility to users isn't really at issue here- we can't use the same IV > multiple times. The two options that we have are to either not actually > update the hint bit in such a case, or to make sure to change the > LSN/IV. Another option would be to, if we're able to make a hole to put > the GCM tag on to the page somewhere, further widen that hole to include > an additional space for a counter that would be mixed into the IV, to > avoid having to do an XLOG NOOP. Well, we have eight unused bits in the IV, so we could just increment that for every hint bit change that uses the same LSN, and then force a dummy WAL record when that 8-bit counter overflows --- that seems simpler than logging hint bits. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote: > > > I think one big question is that, since we are using a streaming cipher, > > > do we care about hint bit changes showing to users? I actually don't > > > know. If we do, some kind of dummy LSN record might be required, as you > > > suggested. > > > > I don't think there's any doubt that we need to make sure that the IV is > > distinct and advancing the LSN to get a new one when needed for this > > case seems like it's probably the way to do that. Hint bit change > > visibility to users isn't really at issue here- we can't use the same IV > > multiple times. The two options that we have are to either not actually > > update the hint bit in such a case, or to make sure to change the > > LSN/IV. Another option would be to, if we're able to make a hole to put > > the GCM tag on to the page somewhere, further widen that hole to include > > an additional space for a counter that would be mixed into the IV, to > > avoid having to do an XLOG NOOP. > > Well, we have eight unused bits in the IV, so we could just increment > that for every hint bit change that uses the same LSN, and then force a > dummy WAL record when that 8-bit counter overflows --- that seems > simpler than logging hint bits. Sure, as long as we have a place to store that information.. We need to have the full IV available when we go to decrypt the page. Thanks, Stephen
Attachment
On Tue, Jan 12, 2021 at 01:15:44PM -0500, Bruce Momjian wrote: > On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote: > > I don't think there's any doubt that we need to make sure that the IV is > > distinct and advancing the LSN to get a new one when needed for this > > case seems like it's probably the way to do that. Hint bit change > > visibility to users isn't really at issue here- we can't use the same IV > > multiple times. The two options that we have are to either not actually > > update the hint bit in such a case, or to make sure to change the > > LSN/IV. Another option would be to, if we're able to make a hole to put > > the GCM tag on to the page somewhere, further widen that hole to include > > an additional space for a counter that would be mixed into the IV, to > > avoid having to do an XLOG NOOP. > > Well, we have eight unused bits in the IV, so we could just increment > that for every hint bit change that uses the same LSN, and then force a > dummy WAL record when that 8-bit counter overflows --- that seems > simpler than logging hint bits. Sorry, I was incorrect. The IV is 16 bytes, made up of the LSN (8 bytes), and the page number (4 bytes). That leaves 4 bytes unused or 2^32 values for hint bit changes before we have to generate a dummy LSN record. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > Well, we have eight unused bits in the IV, so we could just increment > > that for every hint bit change that uses the same LSN, and then force a > > dummy WAL record when that 8-bit counter overflows --- that seems > > simpler than logging hint bits. > > Sure, as long as we have a place to store that information.. We need to > have the full IV available when we go to decrypt the page. Oh, yeah, we would need that counter recorded since previously the IV was made up of already-recorded information, i.e., the page LSN and page number. However, the reason don't WAL-log hint bits always is because we can afford to lose them, but in this case, any counter we need to store will need to be WAL logged since we can't affort to lose that counter value for decryption --- that gets us back to WAL-logging something during hint bit changes. :-( -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > Well, we have eight unused bits in the IV, so we could just increment > > > that for every hint bit change that uses the same LSN, and then force a > > > dummy WAL record when that 8-bit counter overflows --- that seems > > > simpler than logging hint bits. > > > > Sure, as long as we have a place to store that information.. We need to > > have the full IV available when we go to decrypt the page. > > Oh, yeah, we would need that counter recorded since previously the IV > was made up of already-recorded information, i.e., the page LSN and page > number. However, the reason don't WAL-log hint bits always is because > we can afford to lose them, but in this case, any counter we need to > store will need to be WAL logged since we can't affort to lose that > counter value for decryption --- that gets us back to WAL-logging > something during hint bit changes. :-( I don't think that's actually the case..? The hole I'm talking about is there exclusively for post-encryption storage of the tag and maybe this part of the IV and would be zero'd out in the FPIs that actually go into the WAL (which would be encrypted with the WAL key, not the data key). All we would need to be confident of is that if the page with the hint bit update gets encrypted and written out that the IV counter gets incremented and also written out as part of that write. Thanks, Stephen
Attachment
On Tue, Jan 12, 2021 at 01:57:11PM -0500, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote: > > > * Bruce Momjian (bruce@momjian.us) wrote: > > > > Well, we have eight unused bits in the IV, so we could just increment > > > > that for every hint bit change that uses the same LSN, and then force a > > > > dummy WAL record when that 8-bit counter overflows --- that seems > > > > simpler than logging hint bits. > > > > > > Sure, as long as we have a place to store that information.. We need to > > > have the full IV available when we go to decrypt the page. > > > > Oh, yeah, we would need that counter recorded since previously the IV > > was made up of already-recorded information, i.e., the page LSN and page > > number. However, the reason don't WAL-log hint bits always is because > > we can afford to lose them, but in this case, any counter we need to > > store will need to be WAL logged since we can't affort to lose that > > counter value for decryption --- that gets us back to WAL-logging > > something during hint bit changes. :-( > > I don't think that's actually the case..? The hole I'm talking about is > there exclusively for post-encryption storage of the tag and maybe this > part of the IV and would be zero'd out in the FPIs that actually go into > the WAL (which would be encrypted with the WAL key, not the data key). > All we would need to be confident of is that if the page with the hint > bit update gets encrypted and written out that the IV counter gets > incremented and also written out as part of that write. OK, got it. I have added this to the wiki: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements wal_log_hints will be enabled automatically in encryption mode. However, more than one hit change between checkpoints does not cause WAL activity, which would cause the same LSN to be used for different page images. This means we need a page-stored counter, to be used in the four unused bytes of the IV. This prevents multiple page writes during the same checkpoint interval from using the same IV. Counter changes do not need to be WAL logged since we either get the page from the WAL (which is only encrypted with the WAL data key), or from disk, which is durable. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi, On 2021-01-11 20:12:00 +0900, Masahiko Sawada wrote: > diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c > index 32b5d62e1f..d474af753c 100644 > --- a/contrib/bloom/blinsert.c > +++ b/contrib/bloom/blinsert.c > @@ -177,6 +177,7 @@ blbuildempty(Relation index) > * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need > * this even when wal_level=minimal. > */ > + PageEncryptInplace(metapage, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO); > PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); > smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, > (char *) metapage, true); There's quite a few places doing encryption + checksum + smgwrite now. I strongly suggest splitting that off into a helper routine in a preparatory patch. > @@ -528,6 +529,8 @@ BootstrapModeMain(void) > > InitPostgres(NULL, InvalidOid, NULL, InvalidOid, NULL, false); > > + InitializeBufferEncryption(); > + > /* Initialize stuff for bootstrap-file processing */ > for (i = 0; i < MAXATTR; i++) > { Why are we initializing this here instead of postmaster? As far as I can tell that just leads to redundant work instead of doing it once? > +/*------------------------------------------------------------------------- > + * We use both page LSN and page number to create a nonce for each page. Page > + * LSN is 8 byte, page number is 4 byte, and the maximum required counter for > + * AES-CTR is 2048, which fits in 3 byte. Since the length of IV is 16 byte > + * it's fine. Using the LSN and page number as part of the nonce has > + * three benefits: > + * > + * 1. We don't need to decrypt/re-encrypt during CREATE DATABASE since the page > + * contents are the same in both places, and once one database changes its pages, > + * it gets a new LSN, and hence a new nonce. > + * 2. For each change of an 8k page, we get a new nonce, so we are not encrypting > + * different data with the same nonce/IV. > + * 3. We avoid requiring pg_upgrade to preserve database oids, tablespace oids, > + * relfilenodes. I think 3) also has a few minor downsides - by not including information identifying a relation a potential attacker with access to the data directory has more chances to get the database to decrypt data by e.g. switching relation files around. > @@ -2792,12 +2793,15 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) > */ > bufBlock = BufHdrGetBlock(buf); > > + bufToWrite = PageEncryptCopy((Page) bufBlock, buf->tag.forkNum, > + buf->tag.blockNum); > + > /* > * Update page checksum if desired. Since we have only shared lock on the > * buffer, other processes might be updating hint bits in it, so we must > * copy the page to private storage if we do checksumming. > */ > - bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum); > + bufToWrite = PageSetChecksumCopy((Page) bufToWrite, buf->tag.blockNum); > > if (track_io_timing) > INSTR_TIME_SET_CURRENT(io_start); So now we copy the page twice, not just once, if both checksums and encryption is enabled? That doesn't seem right. > @@ -3677,6 +3683,21 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) > { > dirtied = true; /* Means "will be dirtied by this action" */ > > + /* > + * We will dirty the page but the page lsn is not changed if we > + * doesn't write a backup block. We don't want to encrypt the > + * different bits stream with the same combination of nonce and key > + * since in buffer encryption the page lsn is a part of nonce. > + * Therefore we WAL-log no-op record just to move page lsn forward if > + * we doesn't write a backup block, even when this is not the first > + * modification in this checkpoint round. > + */ > + if (XLogRecPtrIsInvalid(lsn) && DataEncryptionEnabled()) > + { > + lsn = log_noop(); > + Assert(!XLogRecPtrIsInvalid(lsn)); > + } > + Aren't you doing a WAL record while holding the buffer header lock here? You can't do things like WAL insertions while holding a spinlock. I don't see how it is safe / correct to use a noop record here. A noop record isn't associated with the page, so WAL replay isn't going to perform the same LSN modification. Also, why is it OK to modify the LSN without, if necessary, logging an FPI? > +char * > +PageEncryptCopy(Page page, ForkNumber forknum, BlockNumber blkno) > +{ > + static char *pageCopy = NULL; > + > + /* If we don't need a checksum, just return the passed-in data */ > + if (PageIsNew(page) || !PageNeedsToBeEncrypted(forknum)) > + return (char *) page; Why is it OK to not encrypt new pages? > +#define PageEncryptOffset offsetof(PageHeaderData, pd_special) > +#define SizeOfPageEncryption (BLCKSZ - PageEncryptOffset) I think you need a detailed explanation somewhere about what you're doing here, and why it's a good idea. Greetings, Andres Freund
Thank you for your reply,
On Wed, Jan 13, 2021 at 12:08 AM Stephen Frost <sfrost@snowman.net> wrote:
No, we can't 'modify the page format as we wish'- if we change away from
using a C structure then we're going to be modifying quite a bit of
code which otherwise doesn't need to be changed. The proposed flag
doesn't actually make a different page format work, the only thing it
would do would be to allow some parts of the cluster to be encrypted and
other parts not be, but I don't know that that's actually a useful
capability or a good reason to use one of those bits. Having it handled
on a cluster level, at initdb time through pg_control, seems like it'd
work just fine.
Yes, I realized that for cluster-level encryption, it would be unwise to flag a single page(Unless we want to do it at relation-level). Forgive me for not describing clearly, the 'modify the page' I said means the method you mentioned, not modifying the C structure. My original motivation is to avoid storing in an unconventional format without a description of the C structure. However, as I just said, it seems that we should not set the flag for a single page. Maybe it's enough to just add a comment description?
On Tue, Jan 12, 2021 at 01:46:53PM -0500, Bruce Momjian wrote: > On Tue, Jan 12, 2021 at 01:15:44PM -0500, Bruce Momjian wrote: > > Well, we have eight unused bits in the IV, so we could just increment > > that for every hint bit change that uses the same LSN, and then force a > > dummy WAL record when that 8-bit counter overflows --- that seems > > simpler than logging hint bits. > > Sorry, I was incorrect. The IV is 16 bytes, made up of the LSN (8 > bytes), and the page number (4 bytes). That leaves 4 bytes unused or > 2^32 values for hint bit changes before we have to generate a dummy LSN > record. I just did a massive update to the Transparent Data Encryption wiki page to make it more readable and updated it with current decisions: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Tue, Jan 12, 2021 at 12:04:09PM -0500, Bruce Momjian wrote: > On Sun, Jan 10, 2021 at 09:51:16AM -0500, Bruce Momjian wrote: > > OK, here they are with numeric prefixes. It was actually tricky to > > figure out how to create a squashed format-patch based on another branch. > > Here is an updated version built on top of Michael Paquier's patch > posted here: > > https://www.postgresql.org/message-id/X/0IChOPHd+aYC1w@paquier.xyz > > and included as my first attachment. This will give Michael's patch > cfbot testing too since the second attachment calls many of the first > attachment's functions. Now that Michael's hex encoding patch is committed, I am reposting my key management patch without Michael's patch. It is improved since the mid-December version: * TAP tests for encrypt/decryption, wrapped key creation and decryption, and KEK rotation * built on top of new hex encoding functions in /common * passes cfbot testing * handles disabled OpenSSL library properly * handles Windows builds properly I also learned a lot about format-patch, cfbot testing, and TAP tests. :-) It still can't test everything, like prompting from /dev/tty. Also, if we don't get data encryption into PG 14, we are going to need to hide the user interface for some of this until it is useful. Prompting from /dev/tty for the TLS private key passphrase already works and will be a useful PG 14 feature, so that part of the API will be visible in PG 14. I am planning to apply this next week. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote: > I am planning to apply this next week. I don't think that's appropriate. Several prominent community members have told you that the patch, as committed the first time, needed a lot more work. There hasn't been enough time between then and now for you, or anyone, to do that amount of work. This patch needs detailed and substantial review from senior community members, and multiple rounds of feedback and improvement, before it should be considered for commit. I am not even sure there is a consensus on the design, without which any commit is always premature. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 15, 2021 at 04:23:22PM -0500, Robert Haas wrote: > On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote: > > I am planning to apply this next week. > > I don't think that's appropriate. Several prominent community members > have told you that the patch, as committed the first time, needed a > lot more work. There hasn't been enough time between then and now for > you, or anyone, to do that amount of work. This patch needs detailed > and substantial review from senior community members, and multiple > rounds of feedback and improvement, before it should be considered for > commit. > > I am not even sure there is a consensus on the design, without which > any commit is always premature. If people want changes, I need to hear about it here. I have address everything people have mentioned in these threads so far. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian <bruce@momjian.us> wrote: > If people want changes, I need to hear about it here. I have address > everything people have mentioned in these threads so far. That does not match my perception of the situation. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 15, 2021 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian <bruce@momjian.us> wrote:
> If people want changes, I need to hear about it here. I have address
> everything people have mentioned in these threads so far.
That does not match my perception of the situation.
Looking at the Commitfest there are three authors and no reviewers. Given the previous incident at minimum each of the people in the Commitfest should add their approval to commit this patch to this thread. And while committers get some leeway, in this case having a non-author review and sign-off on it being ready-to-commit seems like it should be required.
David J.
On Fri, Jan 15, 2021 at 04:59:17PM -0500, Robert Haas wrote: > On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian <bruce@momjian.us> wrote: > > If people want changes, I need to hear about it here. I have address > > everything people have mentioned in these threads so far. > > That does not match my perception of the situation. Well, that's not very specific, is it? You might be confusing the POC data encryption patch that was posted in this thread with the key management patch that I am working on. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi, On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote: > On Fri, Jan 15, 2021 at 04:23:22PM -0500, Robert Haas wrote: > > On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote: > > I don't think that's appropriate. Several prominent community members > > have told you that the patch, as committed the first time, needed a > > lot more work. There hasn't been enough time between then and now for > > you, or anyone, to do that amount of work. This patch needs detailed > > and substantial review from senior community members, and multiple > > rounds of feedback and improvement, before it should be considered for > > commit. > > > > I am not even sure there is a consensus on the design, without which > > any commit is always premature. > > If people want changes, I need to hear about it here. I have address > everything people have mentioned in these threads so far. I don't even know how anybody is supposed to realistically review the design or the patch: This thread started at https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no reference to any discussion of the design at all and the supposed links to code are dead. The last version of the code that I see posted ([1]), has the useless commit message of "key squash commit" - nothing else. There's no design documentation included in the patch either, as far as I can tell. Manually searching for the topic brings me to https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us , a thread of 52 messages, which provides a bit more context, but largely just references another thread and a wiki article. The link to the other thread is into the middle of a 112 message thread. The wiki page doesn't really describe a design either. It has a very long todo, a bunch of implementation details, but no design. Nor did 978f869b99 include much in the way of design description. You cannot expect anybody to review a patch if developing some basic understanding of the intended design requires reading hundreds of messages in which the design evolved. And I don't think it's acceptable to push it due to lack of further feedback, given this situation - the lack of design description is a blocker in itself. There's a few things that stand out on a very very brief scan: - the patch badly needs to be split up into independently reviewable pieces - tests: - wait, a .sh test script? No, we shouldn't add any more of those, they're nightmare across platforms - Do the tests actually do anything useful? It's not clear to me what they are trying to achieve. En/Decrypting test vectors doesn't seem to buy that much? - the new pg_alterckey is completely untested - the pg_upgrade paths is untested - .. - Without further comment BootStrapKmgr() does "copy cluster file encryption keys from an old cluster?", but there's no explanation as to why / when that's the case. Presumably pg_upgrade, but, uh, explain that. - pg_alterckey.c - appears to create it's own cluster lock file, using its own routine for doing so. How does that lock file interact with the running server? - retrieve_cluster_keys() is missing (void). I think this is at the very least a month away from being committable, even if the design were completely correct (which I do not know, see above). Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20210115204926.GD8740%40momjian.us
On Fri, Jan 15, 2021 at 02:37:56PM -0800, Andres Freund wrote: > On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote: > > > I am not even sure there is a consensus on the design, without which > > > any commit is always premature. > > > > If people want changes, I need to hear about it here. I have address > > everything people have mentioned in these threads so far. > > I don't even know how anybody is supposed to realistically review the > design or the patch: > > This thread started at > https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no > reference to any discussion of the design at all and the supposed links > to code are dead. You have to understand cryptography and Postgres internals to understand the design, and I don't think it is realistic to explain that all to the community. We did much of this in voice calls over months because it was too much of a burden to explain all the cryptographic details so everyone could follow along. > The last version of the code that I see posted ([1]), has the useless > commit message of "key squash commit" - nothing else. There's no design > documentation included in the patch either, as far as I can tell. > > Manually searching for the topic brings me to > https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us > , a thread of 52 messages, which provides a bit more context, but > largely just references another thread and a wiki article. The link to > the other thread is into the middle of a 112 message thread. > > The wiki page doesn't really describe a design either. It has a very > long todo, a bunch of implementation details, but no design. I am not sure what design document you are requesting. I thought the TODO was that. > Nor did 978f869b99 include much in the way of design description. > > You cannot expect anybody to review a patch if developing some basic > understanding of the intended design requires reading hundreds of > messages in which the design evolved. And I don't think it's acceptable > to push it due to lack of further feedback, given this situation - the > lack of design description is a blocker in itself. OK, I will just move on to something else then. It is not worth the feature to go into that kind of discussion again. I am willing to have voice calls with individuals to explain the logic, but repeatedly explaining it to the entire group I find unproductive. I don't think another 400-email thread would help anyone. > There's a few things that stand out on a very very brief scan: > - the patch badly needs to be split up into independently reviewable > pieces I can do that, but there are enough complaints above that I feel it would not be worth it. > - tests: > - wait, a .sh test script? No, we shouldn't add any more of those, > they're nightmare across platforms The script originatad from pg_upgrade. I don't know how to do things like initdb and stuff another way, at least in our code. > - Do the tests actually do anything useful? It's not clear to me what > they are trying to achieve. En/Decrypting test vectors doesn't seem to > buy that much? Uh, that's because the key manager doesn't do anything useful yet. > - the new pg_alterckey is completely untested Wow, I was so excited I tested the data keys that I forgot to add the pg_alterckey tests. My tests had that already. I have added it to the attached patch. > - the pg_upgrade paths is untested Uh, I was waiting until we were actually encrypting some data to test that. > - .. > - Without further comment BootStrapKmgr() does "copy cluster file > encryption keys from an old cluster?", but there's no explanation as > to why / when that's the case. Presumably pg_upgrade, but, uh, explain > that. Uh, the heap/index files are, in the future, encrypted with the keys of the old cluster, so we just copy them to the new cluster and they keep working. Potentially we could replace the WAL key at that point since we don't move WAL from the old cluster to the new one, but we also need a command-line tool to do that, so I figured I would just wait for that to be done. > - pg_alterckey.c > - appears to create it's own cluster lock file, using its > own routine for doing so. How does that lock file interact with the > running server? pg_alterckey runs fine while the old cluster is running, which is why I used a new lock file. The keys are only read at db boot time. > - retrieve_cluster_keys() is missing (void). Oops, fixed. > I think this is at the very least a month away from being committable, > even if the design were completely correct (which I do not know, see > above). Those comments were very helpful, and I could certainly use more feedback on the patch. Updated patch attached. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
Hi, On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote: > On Fri, Jan 15, 2021 at 02:37:56PM -0800, Andres Freund wrote: > > On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote: > > > > I am not even sure there is a consensus on the design, without which > > > > any commit is always premature. > > > > > > If people want changes, I need to hear about it here. I have address > > > everything people have mentioned in these threads so far. > > > > I don't even know how anybody is supposed to realistically review the > > design or the patch: > > > > This thread started at > > https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no > > reference to any discussion of the design at all and the supposed links > > to code are dead. > > You have to understand cryptography and Postgres internals to understand > the design, and I don't think it is realistic to explain that all to the > community. We did much of this in voice calls over months because it > was too much of a burden to explain all the cryptographic details so > everyone could follow along. I think that's not at all acceptable. I don't mind hashing out details on calls / off-list, but the design needs to be public, documented, and reviewable. And if it's something the community can't understand, then it can't get in. We're going to have to maintain this going forward. I don't mean to say that we need to re-hash all design details from scratch - but that there needs to be an explanation somewhere that describes what's being done on a medium-high level, and what drove those design decisions. > > The last version of the code that I see posted ([1]), has the useless > > commit message of "key squash commit" - nothing else. There's no design > > documentation included in the patch either, as far as I can tell. > > > > Manually searching for the topic brings me to > > https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us > > , a thread of 52 messages, which provides a bit more context, but > > largely just references another thread and a wiki article. The link to > > the other thread is into the middle of a 112 message thread. > > > > The wiki page doesn't really describe a design either. It has a very > > long todo, a bunch of implementation details, but no design. > > I am not sure what design document you are requesting. I thought the > TODO was that. The TODO in https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements is a design document? > > Nor did 978f869b99 include much in the way of design description. > > > > You cannot expect anybody to review a patch if developing some basic > > understanding of the intended design requires reading hundreds of > > messages in which the design evolved. And I don't think it's acceptable > > to push it due to lack of further feedback, given this situation - the > > lack of design description is a blocker in itself. > > OK, I will just move on to something else then. It is not worth the > feature to go into that kind of discussion again. I am willing to have > voice calls with individuals to explain the logic, but repeatedly > explaining it to the entire group I find unproductive. I don't think > another 400-email thread would help anyone. Explaining something over voice doesn't help with people in a year or five trying to understand the code and the design, so they can adapt it when making half-related changes. Nor do I see why another 400 email thread would be a necessary consequence of you explaining the design that you came up with. This isn't specific to this topic? I don't really understand why this specific feature gets to avoid normal community development processes? > > - tests: > > - wait, a .sh test script? No, we shouldn't add any more of those, > > they're nightmare across platforms > > The script originatad from pg_upgrade. I don't know how to do things > like initdb and stuff another way, at least in our code. We have had perl tap tests for quite a while now? And all new tests that aren't regression / isolation tests are expected to be written in it. Greetings, Andres Freund
On Fri, Jan 15, 2021 at 04:56:24PM -0800, Andres Freund wrote: > On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote: > > You have to understand cryptography and Postgres internals to understand > > the design, and I don't think it is realistic to explain that all to the > > community. We did much of this in voice calls over months because it > > was too much of a burden to explain all the cryptographic details so > > everyone could follow along. > > I think that's not at all acceptable. I don't mind hashing out details > on calls / off-list, but the design needs to be public, documented, and > reviewable. And if it's something the community can't understand, then > it can't get in. We're going to have to maintain this going forward. OK, so we don't want it. That's fine with me. > I don't mean to say that we need to re-hash all design details from > scratch - but that there needs to be an explanation somewhere that > describes what's being done on a medium-high level, and what drove those > design decisions. I thought the TODO list was that, and the email threads. > > > The wiki page doesn't really describe a design either. It has a very > > > long todo, a bunch of implementation details, but no design. > > > > I am not sure what design document you are requesting. I thought the > > TODO was that. > > The TODO in https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements > is a design document? Yes. > > > Nor did 978f869b99 include much in the way of design description. > > > > > > You cannot expect anybody to review a patch if developing some basic > > > understanding of the intended design requires reading hundreds of > > > messages in which the design evolved. And I don't think it's acceptable > > > to push it due to lack of further feedback, given this situation - the > > > lack of design description is a blocker in itself. > > > > OK, I will just move on to something else then. It is not worth the > > feature to go into that kind of discussion again. I am willing to have > > voice calls with individuals to explain the logic, but repeatedly > > explaining it to the entire group I find unproductive. I don't think > > another 400-email thread would help anyone. > > Explaining something over voice doesn't help with people in a year or > five trying to understand the code and the design, so they can adapt it > when making half-related changes. Nor do I see why another 400 email > thread would be a necessary consequence of you explaining the design > that you came up with. I have underestimated the amount of discussion this has required repeatedly, and I don't want to make that mistake again. > This isn't specific to this topic? I don't really understand why this > specific feature gets to avoid normal community development processes? What is being avoided? > > > - tests: > > > - wait, a .sh test script? No, we shouldn't add any more of those, > > > they're nightmare across platforms > > > > The script originatad from pg_upgrade. I don't know how to do things > > like initdb and stuff another way, at least in our code. > > We have had perl tap tests for quite a while now? And all new tests that > aren't regression / isolation tests are expected to be written in it. What Perl tap tests run initdb and manage the cluster? I didn't find any. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi, On 2021-01-15 20:49:10 -0500, Bruce Momjian wrote: > On Fri, Jan 15, 2021 at 04:56:24PM -0800, Andres Freund wrote: > > On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote: > > > You have to understand cryptography and Postgres internals to understand > > > the design, and I don't think it is realistic to explain that all to the > > > community. We did much of this in voice calls over months because it > > > was too much of a burden to explain all the cryptographic details so > > > everyone could follow along. > > > > I think that's not at all acceptable. I don't mind hashing out details > > on calls / off-list, but the design needs to be public, documented, and > > reviewable. And if it's something the community can't understand, then > > it can't get in. We're going to have to maintain this going forward. > > OK, so we don't want it. That's fine with me. That's not what I said... > > This isn't specific to this topic? I don't really understand why this > > specific feature gets to avoid normal community development processes? > > What is being avoided? You previously pushed a patch without tests, now you want to push a patch that was barely reviewed and also doesn't contain an explanation of the design. I mean: > > > You have to understand cryptography and Postgres internals to understand > > > the design, and I don't think it is realistic to explain that all to the > > > community. We did much of this in voice calls over months because it > > > was too much of a burden to explain all the cryptographic details so > > > everyone could follow along. really is very far from the normal community process. Again, how is this supposed to be maintained in the future, if it's based on a design that's only understandable to the people on those phone calls? > > We have had perl tap tests for quite a while now? And all new tests that > > aren't regression / isolation tests are expected to be written in it. > > What Perl tap tests run initdb and manage the cluster? I didn't find > any. find . -name '*.pl'|xargs grep 'use PostgresNode;' should give you a nearly complete list. Greetings, Andres Freund
On Fri, Jan 15, 2021 at 08:20:36PM -0800, Andres Freund wrote: > On 2021-01-15 20:49:10 -0500, Bruce Momjian wrote: >> What Perl tap tests run initdb and manage the cluster? I didn't find >> any. > > find . -name '*.pl'|xargs grep 'use PostgresNode;' > > should give you a nearly complete list. Just to add that all the perl modules we use for the tests are within src/test/perl/. The coolest tests are within src/bin/ and src/test/. -- Michael
Attachment
> > > I think that's not at all acceptable. I don't mind hashing out details > > > on calls / off-list, but the design needs to be public, documented, and > > > reviewable. And if it's something the community can't understand, then > > > it can't get in. We're going to have to maintain this going forward. > > > > OK, so we don't want it. That's fine with me. > > That's not what I said... > I think the majority of us believe that it is important we take this first step towards a solid TDE implementation in PostgreSQL that is built around the community processes which involves general consensus. Before this feature falls into the “we will never do it because we will never build consensus" category and community PostgreSQL potentially gets locked out of more deployment scenarios that require this feature I would like to see if I can help with this current attempt at it. I will share that I am concerned that if the people who have been involved in this to date can’t get this in, it will never happen. Admittedly I am a novice on this topic, and the majority of the PostgreSQL source code, however I am hopeful enough (those of you who know me understand that I suffer from eternal optimism) that I am going to attempt to help. Is there a design document for a Postgres feature of this size and scope that people feel would serve as a good example? Alternatively, is there a design document template that has been successfully used in the past? I could guess based on things I have observed reading this list for many years. However, if there is something that those who are deeply involved in the development effort feel would suffice as an example of a "good design document" or a "good design template" sharing it would be greatly appreciated.
On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid <tomjohnkincaid@gmail.com> wrote: > > > > > I think that's not at all acceptable. I don't mind hashing out details > > > > on calls / off-list, but the design needs to be public, documented, and > > > > reviewable. And if it's something the community can't understand, then > > > > it can't get in. We're going to have to maintain this going forward. > > > > > > OK, so we don't want it. That's fine with me. > > > > That's not what I said... > > > > > I think the majority of us believe that it is important we take this > first step towards a solid TDE implementation in PostgreSQL that is > built around the community processes which involves general consensus. > > Before this feature falls into the “we will never do it because we > will never build consensus" category and community PostgreSQL > potentially gets locked out of more deployment scenarios that require > this feature I would like to see if I can help with this current > attempt at it. I will share that I am concerned that if the people who > have been involved in this to date can’t get this in, it will never > happen. > > Admittedly I am a novice on this topic, and the majority of the > PostgreSQL source code, however I am hopeful enough (those of you who > know me understand that I suffer from eternal optimism) that I am > going to attempt to help. > > Is there a design document for a Postgres feature of this size and > scope that people feel would serve as a good example? Alternatively, > is there a design document template that has been successfully used in > the past? > We normally write the design considerations and choices we made with the reasons in README and code comments. Personally, I am not sure if there is a need for any specific document per-se but a README and detailed comments in the code should suffice what people are worried about here. It is mostly from the perspective that other developers reading the code, want to do bug-fix, or later enhance that code should be able to understand it. One recent example I can give is Peter's work on bottom-up deletion [1] which I have read today where I find that the design is captured via README, appropriate comments in the code, and documentation. This feature is quite different and probably a lot more new concepts are being introduced but I hope that will give you some clue. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf -- With Regards, Amit Kapila.
Hi, On 2021-01-17 11:54:57 +0530, Amit Kapila wrote: > On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid <tomjohnkincaid@gmail.com> wrote: > > Admittedly I am a novice on this topic, and the majority of the > > PostgreSQL source code, however I am hopeful enough (those of you who > > know me understand that I suffer from eternal optimism) that I am > > going to attempt to help. > > > > Is there a design document for a Postgres feature of this size and > > scope that people feel would serve as a good example? Alternatively, > > is there a design document template that has been successfully used in > > the past? > > > > We normally write the design considerations and choices we made with > the reasons in README and code comments. Personally, I am not sure if > there is a need for any specific document per-se but a README and > detailed comments in the code should suffice what people are worried > about here. Right. It could be a README file, or a long comment at a start of one of the files. It doesn't matter too much. What matters is that people that haven't been on those phone call can understand the design and the implications it has. > It is mostly from the perspective that other developers > reading the code, want to do bug-fix, or later enhance that code > should be able to understand it. I'd add the perspective of code reviewers as well. > One recent example I can give is > Peter's work on bottom-up deletion [1] which I have read today where I > find that the design is captured via README, appropriate comments in > the code, and documentation. This feature is quite different and > probably a lot more new concepts are being introduced but I hope that > will give you some clue. > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf This is a great example. Greetings, Andres Freund
On Fri, Jan 15, 2021 at 7:56 PM Andres Freund <andres@anarazel.de> wrote: > I think that's not at all acceptable. I don't mind hashing out details > on calls / off-list, but the design needs to be public, documented, and > reviewable. And if it's something the community can't understand, then > it can't get in. We're going to have to maintain this going forward. I agree. If the community is unable to clearly understand what something is, and why we should have it, then we shouldn't have it -- even if the reason is that we're too dumb to understand, as Bruce seems to be alleging. I don't really think I believe the theory that community members by and large are too dumb to understand encryption. Many features have provoked long and painful discussions about the design and yet got into the tree in the end with documentation of that design, and I don't see why that couldn't be done for this one, too. I think it can and should, and the fact that the work hasn't been done is one of several blockers for this patch. But even if I'm wrong, and the real problem is that everyone except the select group of people on these off-list phone calls are too stupid to understand this, then that's still a reason not to accept the patch. The code that's in our source tree is maintained by communal effort, and that means communal understanding is important. Frankly, it's more important in this particular case than in some others. TDE is in great demand, so if it gets into the tree, it's likely to get a lot of use. The preparatory patches, such as this one, would at that point be getting a lot of use, too. That means many people, not just hackers, will have to understand them and answer questions about them. They are also likely to get a lot of scrutiny from a security point of view, so we should have a way that we can be confident that we know why we believe them to be secure. If a security researcher shows up and says "your stuff is broken," we are not going to get away with it "no it isn't, because we discussed it on a Friday call with a closed group of people and decided it was OK." Our reasoning is going to have to be documented. That doesn't guarantee that it will be correct, but makes it possible to distinguish between defects in design, defects in particular parts of the code, and non-defects, which is otherwise impossible. Meanwhile, even if security researches are as happy with our TDE implementation as they could possibly be, a feature that changes the on-disk format can't erase our ability to solve other problems with the database. Databases using TDE are still going to have corruption, for example, but now a corrupted page has a good chance of being completely unreadable rather than just garbled. You certainly aren't going to be able to just run pg_filedump on it. I think even if we do a great job explaining to everybody what impact TDE and its preparatory patches are likely to have on the system, there's likely to be a lot of cases where users blame the database for eating their data when the real culprit is the OS or the hardware, just because such cases are bound to get harder to investigate, which could have a very negative effect on the perceptions of PostgreSQL's quality. But if the TDE itself is magic that only designated smart people on special calls can understand, then it's going to be far worse, because that'll mean when any kind of TDE problems comes up, nobody else can help debug anything. While I would like to have TDE in PostgreSQL, I would not like to have it on those terms. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Jan 17, 2021 at 07:50:13PM -0500, Robert Haas wrote: > On Fri, Jan 15, 2021 at 7:56 PM Andres Freund <andres@anarazel.de> wrote: > > I think that's not at all acceptable. I don't mind hashing out details > > on calls / off-list, but the design needs to be public, documented, and > > reviewable. And if it's something the community can't understand, then > > it can't get in. We're going to have to maintain this going forward. > > I agree. If the community is unable to clearly understand what > something is, and why we should have it, then we shouldn't have it -- > even if the reason is that we're too dumb to understand, as Bruce I am not sure why you are brining intelligence into this discussion. You have to understand Postgres internals and cryptography tradeoffs to understand why some of the design decisions were made. It is a knowledge issue, not an intelligence issue. The wiki page is the result of those phone discussions. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Sun, Jan 17, 2021 at 11:54:57AM +0530, Amit Kapila wrote: > > Is there a design document for a Postgres feature of this size and > > scope that people feel would serve as a good example? Alternatively, > > is there a design document template that has been successfully used in > > the past? > > We normally write the design considerations and choices we made with > the reasons in README and code comments. Personally, I am not sure if > there is a need for any specific document per-se but a README and > detailed comments in the code should suffice what people are worried > about here. It is mostly from the perspective that other developers > reading the code, want to do bug-fix, or later enhance that code > should be able to understand it. One recent example I can give is > Peter's work on bottom-up deletion [1] which I have read today where I > find that the design is captured via README, appropriate comments in > the code, and documentation. This feature is quite different and > probably a lot more new concepts are being introduced but I hope that > will give you some clue. > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf OK, I looked at that and it is good, and I see my patch is missing that. Are people looking for me to take the wiki content, expand on it and tie it to the code that will be applied, or something else like all the various crypto options and why we chose what we did beyond what is already on the wiki? I can easily go from what we have on the wiki to implementation code steps, but the other part is harder to explain and that is why I offered to talk to people via voice. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Sat, Jan 16, 2021 at 10:58:47PM -0800, Andres Freund wrote: > Hi, > > On 2021-01-17 11:54:57 +0530, Amit Kapila wrote: > > On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid <tomjohnkincaid@gmail.com> wrote: > > > Admittedly I am a novice on this topic, and the majority of the > > > PostgreSQL source code, however I am hopeful enough (those of you who > > > know me understand that I suffer from eternal optimism) that I am > > > going to attempt to help. > > > > > > Is there a design document for a Postgres feature of this size and > > > scope that people feel would serve as a good example? Alternatively, > > > is there a design document template that has been successfully used in > > > the past? > > > > > > > We normally write the design considerations and choices we made with > > the reasons in README and code comments. Personally, I am not sure if > > there is a need for any specific document per-se but a README and > > detailed comments in the code should suffice what people are worried > > about here. > > Right. It could be a README file, or a long comment at a start of one of > the files. It doesn't matter too much. What matters is that people that > haven't been on those phone call can understand the design and the > implications it has. OK, so does the wiki page contain most of what you want, but is missing the connection between the design and the code? https://wiki.postgresql.org/wiki/Transparent_Data_Encryption -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote: > OK, I looked at that and it is good, and I see my patch is missing that. > Are people looking for me to take the wiki content, expand on it and tie > it to the code that will be applied, or something else like all the > various crypto options and why we chose what we did beyond what is > already on the wiki? I can easily go from what we have on the wiki to > implementation code steps, but the other part is harder to explain and > that is why I offered to talk to people via voice. Just to clarify why voice calls can be helpful --- if you have to get into "you have to understand X to understand Y", that's where a voice call works best, because understanding X will require understanding A/B/C, and everyone's missing pieces are different, so you have to customize it for the individual. You can explain some of this in a README, but trying to cover all of it leads to a combinatorial problem of trying to explain everything. Ideally the wiki page can be expanded so people can ask and answer all posted issues, perhaps in a Q&A format. Someone could go through the archives and post why certain decisions were made, and link to the original emails. I have to admit I was kind of baffled that the wiki page wasn't sufficient, because it is one of the longest Postgres feature explanations I have seen, but I now think the missing part is tying the wiki contents to the code implementation. If that is it, please confirm. If it is something else, also explain. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi, On 2021-01-18 12:06:35 -0500, Bruce Momjian wrote: > On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote: > > OK, I looked at that and it is good, and I see my patch is missing that. > > Are people looking for me to take the wiki content, expand on it and tie > > it to the code that will be applied, or something else like all the > > various crypto options and why we chose what we did beyond what is > > already on the wiki? I can easily go from what we have on the wiki to > > implementation code steps, but the other part is harder to explain and > > that is why I offered to talk to people via voice. > > Just to clarify why voice calls can be helpful --- if you have to get > into "you have to understand X to understand Y", that's where a voice > call works best, because understanding X will require understanding > A/B/C, and everyone's missing pieces are different, so you have to > customize it for the individual. I don't think anybody argued against having voice calls. > You can explain some of this in a README, but trying to cover all of it > leads to a combinatorial problem of trying to explain everything. > Ideally the wiki page can be expanded so people can ask and answer all > posted issues, perhaps in a Q&A format. Someone could go through the > archives and post why certain decisions were made, and link to the > original emails. > > I have to admit I was kind of baffled that the wiki page wasn't > sufficient, because it is one of the longest Postgres feature > explanations I have seen, but I now think the missing part is tying > the wiki contents to the code implementation. If that is it, please > confirm. If it is something else, also explain. I don't think the wiki right now covers what's needed. The "Overview", "Threat model" and "Scope of TDE" are a start, but beyond that it's missing a bunch of things. And it's not in the source tree (we'll soon have multiple versions of postgres with increasing levels of TDE features, the wiki doesn't help with that) Missing: - talks about cluster wide encyrption being simpler, without mentioning what it's being compared to, and what makes it simpler - no differentiation from file system / block level encryption - there's no explanation of which/why specific crypto primitives were chosen, what the tradeoffs are - no explanation which keys exists, stored where - the key management patch introduces new files, not documented - there's new types of lock files, possibility of interrupted operations, ... - no documentation of what that means - there's no documentation what "key wrapping" actually precisely is, what the danger of the two-tier model is, ... - are there dangers in not encrypting zero pages etc? - ... Personally, but I admit that there's legitimate reasons to differ on that note, I don't think it's reasonable for a feature this invasive to commit preliminary patches without the major subsequent patches being in a shape that allows reviewing the whole picture. Greetings, Andres Freund
On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > Personally, but I admit that there's legitimate reasons to differ on > that note, I don't think it's reasonable for a feature this invasive to > commit preliminary patches without the major subsequent patches being in > a shape that allows reviewing the whole picture. OK, if that is a requirement, I can't help anymore since there are already complaints that the patch is too large to review, even if broken into pieces. Please let me know what the community decides. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
> > I have to admit I was kind of baffled that the wiki page wasn't > > sufficient, because it is one of the longest Postgres feature > > explanations I have seen, but I now think the missing part is tying > > the wiki contents to the code implementation. If that is it, please > > confirm. If it is something else, also explain. > > I don't think the wiki right now covers what's needed. The "Overview", > "Threat model" and "Scope of TDE" are a start, but beyond that it's > missing a bunch of things. And it's not in the source tree (we'll soon > have multiple versions of postgres with increasing levels of TDE > features, the wiki doesn't help with that) > Thanks, the versioning issue makes sense for the design document needing to be part of the the source tree. As I was reading the README for the patch Amit referenced and as I am going through this patch, I feel the desire to incorporate diagrams. Are design diagrams ever incorporated in the source tree as a part of the design description of a feature? If not, any concerns about doing that? I think that is likely where I can contribute the most. > Missing: > - talks about cluster wide encyrption being simpler, without mentioning > what it's being compared to, and what makes it simpler > - no differentiation from file system / block level encryption > - there's no explanation of which/why specific crypto primitives were > chosen, what the tradeoffs are > - no explanation which keys exists, stored where > - the key management patch introduces new files, not documented > - there's new types of lock files, possibility of interrupted > operations, ... - no documentation of what that means > - there's no documentation what "key wrapping" actually precisely is, > what the danger of the two-tier model is, ... > - are there dangers in not encrypting zero pages etc? > - ... > Some of the missing things you mention above are about the design of TDE feature in general. However, this patch is about Key Management which is going part of the larger TDE feature. So it feels as though there is the need for a general design document about the overall vision / approach for TDE and a specific design doc. for Key Management. Is it appropriate to include both of those in the same patch? Something along the lines here is the overall design of TDE and here is how the Key Management portion is designed and implemented. I guess in that case, follow on patches for TDE could refer to the overall design described in this patch. > > > Personally, but I admit that there's legitimate reasons to differ on > that note, I don't think it's reasonable for a feature this invasive to > commit preliminary patches without the major subsequent patches being in > a shape that allows reviewing the whole picture. > > Greetings, > > Andres Freund -- Thomas John Kincaid
On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote: > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > > Personally, but I admit that there's legitimate reasons to differ on > > that note, I don't think it's reasonable for a feature this invasive to > > commit preliminary patches without the major subsequent patches being in > > a shape that allows reviewing the whole picture. > > OK, if that is a requirement, I can't help anymore since there are > already complaints that the patch is too large to review, even if broken > into pieces. Please let me know what the community decides. Those aren't conflicting demands. Having later patches around to validate the design of earlier patches doesn't necessitates that the later patches need to be reviewed at the same time.
On Mon, Jan 18, 2021 at 2:00 PM Tom Kincaid <tomjohnkincaid@gmail.com> wrote: > Some of the missing things you mention above are about the design of > TDE feature in general. However, this patch is about Key Management > which is going part of the larger TDE feature. So it feels as though > there is the need for a general design document about the overall > vision / approach for TDE and a specific design doc. for Key > Management. Is it appropriate to include both of those in the same > patch? To me, it wouldn't make sense to commit a full README for a TDE feature that we don't have yet with a key management patch, but the way that they'll interact with each other has to be clear. The doc/database-encryption.sgml file that Bruce included in the patch is a decent start on explaining the design, though I think it needs more work and more details, perhaps including some of the things Andres mentioned. To be honest, after reading over that SGML documentation a bit, I'm somewhat skeptical about whether it really makes sense to think about committing the key management part separately. It seems to have no use independent of the main feature, and it in fact embeds very specific details of how the main feature is expected to work. For example, the documentation says that key #0 will be used for data files, and key #1 for WAL. There seems to be no suggestion that the key management portion of this can be used to manage encryption keys generally for whatever purposes someone might have; it's all about the needs of a particular TDE implementation. Typically, we would not commit something like that separately, or only once the main patch was done, with the two commits occurring in a relatively short time period. Otherwise, as Bruce already noted, we can end up with something that is documented and visible to users but doesn't actually work yet. Some more specific comments on data-encryption.sgml: * The documentation explains that the purpose of having a WAL key separate from the data file key is so that the data file keys can "eventually" be rotated. It's not clear whether this means that we might eventually have that feature or that we might eventually be able to rotate, after failing over. If this kind of thing is possible, we'll eventually need documentation on how to do it. * The reasons for use a DEK and a KEK are not explained. I realize it's not an uncommon practice and that other systems do it, but I think a few sentences of explanation wouldn't be a bad idea. Even if we are supposing that hackers who want to have input into this feature have to be knowledgeable about cryptography, I don't think we can reasonably suppose that for users. * "For example" is at one point followed by a period rather than a colon or comma. * In the "Internals" subsection, the last sentence doesn't seem to be grammatical. I wonder if it's missing the word "or"'. * The part about integrity-checking keys on startup isn't clear. It makes it sound like we still have a copy of the KEK lying around someplace against which we can compare, which I assume is not the case since it would be really insecure. * I think it's going to be pretty important that we can easily switch to other cryptographic algorithms as they are discovered, so I don't like the fact that this is tied specifically to AES. (In fact, kmgr_utils.h makes it sound like we're specifically locked into AES256, but that contradicts the documentation, so I guess there's some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN is doing.) As far as possible we should try to make this generic, like supporting any cipher that SSL has which has property X. It seems relatively inevitable that every currently popular cryptographic algorithm will at some point in the future be judged weak and worthless, just as has already happened with MD5 and some variants of SHA, both of which used to be considered state of the art. It seems equally inevitable that new and stronger algorithms will continued to be devised, and we'll want to adopt those easily. I'm not sure to what extent this a serious flaw in the patch and to what extent it's just a matter of tweaking the wording of some things, but I think this is actually an extremely critical design point where we had better be certain we've got it right. Few things would be sadder than to get a TDE feature and then have to rip it out again because it couldn't be upgraded to work with newer crypto algorithms with reasonable effort. Notes on other parts of the documentation: * The documentation for initdb -K doesn't list the valid values of the parameter, only the default. Probably we should be specifying an algorithm here and not just a bit count. Otherwise, like I say above, what happens when AES gives way to something else? It'd be easy to say -K BFT256 instead of -K AES256, but if AES is assumed and it's no longer what we want them we have problems. This kind of thing probably needs to be cleaned up in a bunch of places. * I don't see the point of saying "a passphrase or PIN." We don't need to document that your passphrase might happen to only contain digits. * pg_alterckey's description of "repair" is hard to understand. It doesn't really explain why or how this would be necessary, and it begs the question of why we'd ever leave things in a state that requires repair. This is sketched out in code comments elsewhere, but I think at least some of it needs to be explained in the documentation as well. (Incidentally, I don't think the comments at the top of recover_failure will survive a visit from pgindent, though I might be wrong about that.) * The changes to config.sgml say "Sample script" instead of "Sample scripts". * I don't think that the documentation of %R is very clear, or adequate for someone to make effective use of it. If I wanted to use %R, how would I ensure that a value is available? * The changes to allfiles.sgml add pg_alterckey.sgml in the wrong place and include an incorrect whitespace change. * It's odd that "pg_alterckey" describes itself as "technically" changing the KEK. Isn't that just what it does, not a technicality? I imagine we'll ultimately need a way to change a DEK as well, because otherwise the use of a separate key for the WAL wouldn't accomplish the intended goal. -- Robert Haas EDB: http://www.enterprisedb.com
I met with Bruce and Stephen this afternoon to discuss the feedback we received so far (prior to Robert's note which I haven't fully digested yet) on this patch. Here is what we plan to do: 1) Bruce is going to gather all the details from the Wiki and build a README for the TDE Key Management patch. In addition, it will include details about the implementation, the data structures involved and the locks that are taken and general technical implementation approach. 2) Stephen is going to write up the overall design of TDE. Between these two patches, we hope to cover what Andres is asking for and what Robert is asking for in his reply on this thread which I haven't fully digested yet. Stephen's documentation patch will also make reference to Neil Chen's TDE prototype for making use of this Key Management patch to encrypt and decrypt heap pages as well as index pages. https://www.postgresql.org/message-id/CAA3qoJ=qtO5JcSBjqFDBT9iKUX9XKmC5bXCrd7rysE+XSMEuTg@mail.gmail.com 3) Tom will work to find somebody who will sign up as a reviewer upon the next submission of this patch. (Somebody who is not an author). Could we get feedback if this feels like enough to get this patch (which will include just the Key Management portion of TDE) to a state where it can be reviewed and assuming the review issues are resolved with consensus be committed? On Mon, Jan 18, 2021 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote: > > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > > > Personally, but I admit that there's legitimate reasons to differ on > > > that note, I don't think it's reasonable for a feature this invasive to > > > commit preliminary patches without the major subsequent patches being in > > > a shape that allows reviewing the whole picture. > > > > OK, if that is a requirement, I can't help anymore since there are > > already complaints that the patch is too large to review, even if broken > > into pieces. Please let me know what the community decides. > > Those aren't conflicting demands. Having later patches around to > validate the design of earlier patches doesn't necessitates that the > later patches need to be reviewed at the same time. -- Thomas John Kincaid
On Mon, Jan 18, 2021 at 04:38:47PM -0500, Robert Haas wrote: > To me, it wouldn't make sense to commit a full README for a TDE > feature that we don't have yet with a key management patch, but the > way that they'll interact with each other has to be clear. The > doc/database-encryption.sgml file that Bruce included in the patch is > a decent start on explaining the design, though I think it needs more > work and more details, perhaps including some of the things Andres > mentioned. Sure. > To be honest, after reading over that SGML documentation a bit, I'm > somewhat skeptical about whether it really makes sense to think about > committing the key management part separately. It seems to have no use > independent of the main feature, and it in fact embeds very specific For usefulness, it does enable passphrase prompting for the TLS private key. > details of how the main feature is expected to work. For example, the > documentation says that key #0 will be used for data files, and key #1 > for WAL. There seems to be no suggestion that the key management > portion of this can be used to manage encryption keys generally for > whatever purposes someone might have; it's all about the needs of a > particular TDE implementation. Typically, we would not commit We originally were going to have SQL-level keys, but many felt they weren't useful. > something like that separately, or only once the main patch was done, > with the two commits occurring in a relatively short time period. > Otherwise, as Bruce already noted, we can end up with something that > is documented and visible to users but doesn't actually work yet. Yep, that is the risk. > Some more specific comments on data-encryption.sgml: > > * The documentation explains that the purpose of having a WAL key > separate from the data file key is so that the data file keys can > "eventually" be rotated. It's not clear whether this means that we > might eventually have that feature or that we might eventually be able > to rotate, after failing over. If this kind of thing is possible, > we'll eventually need documentation on how to do it. I have clarified that saying "future release". > * The reasons for use a DEK and a KEK are not explained. I realize > it's not an uncommon practice and that other systems do it, but I > think a few sentences of explanation wouldn't be a bad idea. Even if > we are supposing that hackers who want to have input into this feature > have to be knowledgeable about cryptography, I don't think we can > reasonably suppose that for users. I added a little about that in the docs. > * "For example" is at one point followed by a period rather than a > colon or comma. Fixed. > * In the "Internals" subsection, the last sentence doesn't seem to be > grammatical. I wonder if it's missing the word "or"'. Fixed. > * The part about integrity-checking keys on startup isn't clear. It > makes it sound like we still have a copy of the KEK lying around > someplace against which we can compare, which I assume is not the case > since it would be really insecure. I rewored that entire section. See if it is better now. > * I think it's going to be pretty important that we can easily switch > to other cryptographic algorithms as they are discovered, so I don't > like the fact that this is tied specifically to AES. (In fact, > kmgr_utils.h makes it sound like we're specifically locked into > AES256, but that contradicts the documentation, so I guess there's > some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN > is doing.) As far as possible we should try to make this generic, like > supporting any cipher that SSL has which has property X. It seems > relatively inevitable that every currently popular cryptographic > algorithm will at some point in the future be judged weak and > worthless, just as has already happened with MD5 and some variants of > SHA, both of which used to be considered state of the art. It seems > equally inevitable that new and stronger algorithms will continued to > be devised, and we'll want to adopt those easily. That is a nifty idea. Right now I just pass the integer length around, and store it in pg_control, but if we define macros, we can easily abstract this and easily allow for new methods. If others like that, I will start on it now. > I'm not sure to what extent this a serious flaw in the patch and to > what extent it's just a matter of tweaking the wording of some things, > but I think this is actually an extremely critical design point where > we had better be certain we've got it right. Few things would be > sadder than to get a TDE feature and then have to rip it out again > because it couldn't be upgraded to work with newer crypto algorithms > with reasonable effort. Yep. > Notes on other parts of the documentation: > > * The documentation for initdb -K doesn't list the valid values of the > parameter, only the default. Probably we should be specifying an Fixed. > algorithm here and not just a bit count. Otherwise, like I say above, > what happens when AES gives way to something else? It'd be easy to say > -K BFT256 instead of -K AES256, but if AES is assumed and it's no > longer what we want them we have problems. This kind of thing probably > needs to be cleaned up in a bunch of places. Again, I can do that if people like it. > * I don't see the point of saying "a passphrase or PIN." We don't need > to document that your passphrase might happen to only contain digits. Well, PIN is what the Yubikey and PIV devices call it, so I thought I should give specific examples of inputs. > * pg_alterckey's description of "repair" is hard to understand. It > doesn't really explain why or how this would be necessary, and it begs > the question of why we'd ever leave things in a state that requires > repair. This is sketched out in code comments elsewhere, but I think > at least some of it needs to be explained in the documentation as > well. (Incidentally, I don't think the comments at the top of > recover_failure will survive a visit from pgindent, though I might be > wrong about that.) Fixed with rewording. Better? > * The changes to config.sgml say "Sample script" instead of "Sample scripts". Fixed. > * I don't think that the documentation of %R is very clear, or > adequate for someone to make effective use of it. If I wanted to use > %R, how would I ensure that a value is available? Fixed, use -R on server start. > * The changes to allfiles.sgml add pg_alterckey.sgml in the wrong > place and include an incorrect whitespace change. Uh, the whitespace change was to align the column. I will review and push that separately. > * It's odd that "pg_alterckey" describes itself as "technically" > changing the KEK. Isn't that just what it does, not a technicality? I > imagine we'll ultimately need a way to change a DEK as well, because > otherwise the use of a separate key for the WAL wouldn't accomplish > the intended goal. "technically" removed. I kind of wanted to say "in detail" or something like that, but removing the word is fine. Change-only patch attached so you can see the changes more easily. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Mon, Jan 18, 2021 at 05:47:34PM -0500, Tom Kincaid wrote: > I met with Bruce and Stephen this afternoon to discuss the feedback > we received so far (prior to Robert's note which I haven't fully > digested yet) > on this patch. > > Here is what we plan to do: > > 1) Bruce is going to gather all the details from the Wiki and build a > README for the TDE Key Management patch. In addition, it will include > details about the implementation, the data structures involved and the > locks that are taken and general technical implementation approach. ... > Could we get feedback if this feels like enough to get this patch > (which will include just the Key Management portion of TDE) to a state > where it can be reviewed and assuming the review issues are resolved > with consensus be committed? Attached is an updated patch that has the requested changes: * broken into seven parts * test script converted from shell to Perl * added README for every new directory * moved text from wiki to READMEs where appropriate * included Robert's suggestions, including the ability to add future non-AES crypto methods * fixes for pg_alterckey PGDATA arg processing The patch is attached, and is also here: https://github.com/postgres/postgres/compare/master...bmomjian:key.patch Questions: * What changes do people want to this patch set? * Do we want it applied, even though it might need to be hidden for PG 14? * If not, how do people build on this patch? Using the commitfest links or github URL? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
In patch 1, * The docs are not clear on what happens if --auth-prompt is not given but an auth prompt is required for the program to work. Should it exit with a status other than 0? * BootStrapKmgr claims it is called by initdb, but that doesn't seem to be the case. * Also, BootStrapKmgr is the only one that checks USE_OPENSSL; what if a with-openssl build inits the datadir, and then a non-openssl runs it? What if it's the other way around? I think you'd get a failure in stat() ... * ... oh, KMGR_DIR_PID is used but not defined anywhere. Is it defined in some later commit? If so, then I think you've chosen to split the patch series wrong. May I suggest to use "git format-patch" to produce the patch files? When working with a series like this, trying to do patch handling manually like you seem to be doing, is much more time-consuming and error prone. For example, with a branch containing individual commits, you could use git rebase -i origin/master -x "make install check-world" or similar, so that each commit is built and tested individually. -- Álvaro Herrera Valdivia, Chile Al principio era UNIX, y UNIX habló y dijo: "Hello world\n". No dijo "Hello New Jersey\n", ni "Hello USA\n".
On Mon, Jan 25, 2021 at 08:12:01PM -0300, Álvaro Herrera wrote: > In patch 1, > > * The docs are not clear on what happens if --auth-prompt is not given > but an auth prompt is required for the program to work. Should it exit > with a status other than 0? Uh, I think the docs talk about this: It can prompt from the terminal if option>--authprompt</option> is used. In the parameter value, <literal>%R</literal> is replaced by a file descriptor number opened to the terminal that started the server. A file descriptor is only available if enabled at server start via <option>-R</option>. If <literal>%R</literal> is specified and no file descriptor is available, the server will not start. The code is: case 'R': { char fd_str[20]; if (terminal_fd == -1) { ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), errmsg("cluster key command referenced %%R, but --authprompt not specified"))); } Does that help? > * BootStrapKmgr claims it is called by initdb, but that doesn't seem to > be the case. Well, initdb starts the postmaster in --boot mode, and that calls BootStrapKmgr(). Does that help? > * Also, BootStrapKmgr is the only one that checks USE_OPENSSL; what if a > with-openssl build inits the datadir, and then a non-openssl runs it? > What if it's the other way around? I think you'd get a failure in > stat() ... Wow, I never considered that. I have added a check to InitializeKmgr(). Thanks. > * ... oh, KMGR_DIR_PID is used but not defined anywhere. Is it defined > in some later commit? If so, then I think you've chosen to split the > patch series wrong. OK, fixed. It is in include/common/kmgr_utils.c, which was in #3. > May I suggest to use "git format-patch" to produce the patch files? When > working with a series like this, trying to do patch handling manually > like you seem to be doing, is much more time-consuming and error prone. > For example, with a branch containing individual commits, you could use > git rebase -i origin/master -x "make install check-world" > or similar, so that each commit is built and tested individually. I used "git format-patch". Are you asking for seven commits that then generate seven files via one format-patch run? Or is the primary issue that you want compile testing for each patch? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Mon, Jan 25, 2021 at 07:09:44PM -0500, Bruce Momjian wrote: > > May I suggest to use "git format-patch" to produce the patch files? When > > working with a series like this, trying to do patch handling manually > > like you seem to be doing, is much more time-consuming and error prone. > > For example, with a branch containing individual commits, you could use > > git rebase -i origin/master -x "make install check-world" > > or similar, so that each commit is built and tested individually. > > I used "git format-patch". Are you asking for seven commits that then > generate seven files via one format-patch run? Or is the primary issue > that you want compile testing for each patch? The attached patch meets both criteria. I also clarified the README on how initdb calls those functions. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Mon, Jan 25, 2021 at 10:27:18PM -0500, Bruce Momjian wrote: > On Mon, Jan 25, 2021 at 07:09:44PM -0500, Bruce Momjian wrote: > > > May I suggest to use "git format-patch" to produce the patch files? When > > > working with a series like this, trying to do patch handling manually > > > like you seem to be doing, is much more time-consuming and error prone. > > > For example, with a branch containing individual commits, you could use > > > git rebase -i origin/master -x "make install check-world" > > > or similar, so that each commit is built and tested individually. > > > > I used "git format-patch". Are you asking for seven commits that then > > generate seven files via one format-patch run? Or is the primary issue > > that you want compile testing for each patch? > > The attached patch meets both criteria. I also clarified the README on > how initdb calls those functions. This version fixes OpenSSL detection and improves docs for initdb interactions. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Tue, Jan 26, 2021 at 11:15 AM Bruce Momjian <bruce@momjian.us> wrote: > This version fixes OpenSSL detection and improves docs for initdb > interactions. Hi, I'm wondering whether you've considered storing all the keys in one file instead of a file per key. The reason I ask is that it seems to me that the key rotation procedure would be a lot simpler if it were all in one file. You could just create a temporary file and atomically rename it over the existing file. If you see a temporary file you're always free to remove it. This is a lot simpler than what you have right now. The "repair" concept pretty much goes away completely, which seems nice. Granted I don't know exactly how to store multiple keys in one file, but I bet there's some way to do it. The way in which you are posting these patches is quite unlike what most people do when posting patches to this list. You seem to have generated a bunch of patches using 'git format-patch' but then concatenated them all together in a single file. It would be helpful if you could do this more like the way that is now standard on this list. Not only that, but the patches don't have meaningful commit messages in them, and don't seem to be meaningfully split for easy review. They just say things like 'crypto squash commit'. Compare this to for example what I did on the "cleaning up a few CLOG-related things" thread where the commits appear in a logical sequence, and each one has a meaningful commit message. Or here's an example from someone else -- http://postgr.es/m/be72abfa-e62e-eb81-4e70-1b57fe6dc9e2@amazon.com -- and note the inclusion of authorship information in the commit messages, so that the source of the code can be easily understood. The README in src/backend/crypto does not explain how the scripts in that directory are intended to be used. If I want to use AWS Secrets Manager with this feature, I can see that I should use ckey_aws.sh.sample as a basis for that integration, but I don't know what I should do with the script because the README says nothing about it. I am frankly pretty doubtful about the idea of shipping a bunch of /bin/sh scripts as a best practice; for one thing, that's totally unusable on Windows, and it also means that this is dependent on /bin/sh existing and having the behavior we expect and on all the other executables in these scripts as well. But, at the very least, there needs to be a clearer explanation of how the scripts are intended to be used, which parts people are supposed to modify, what arguments they're going to get called with, and things like that. The comments in cipher.c and cipher_openssl.c could be improved to explain that they are alternatives to each other. Perhaps the former could be renamed to something like cipher_failure.c or cipher_noimpl.c for clarity. I believe that a StaticAssertStmt could be used to check the length of the encryption_methods[] array, so that if someone changes NUM_ENCRYPTION_METHODS without updating the array, compilation fails. See UserAuthName[] for an example of how to do this. You seem to have omitted to update the documentation with the names of the new wait events that you added. In process_postgres_switches(), when there's a multi-line comment followed by a single line of actual code, I prefer to include braces around the whole thing. There might be some disagreement on what is best here, though. What are the consequences of the placement of the code in PostgresMain() for processes other than user backends and walsenders? I think that the way you have it, background workers would not have access to keys, nor auxiliary processes like the checkpointer ... at least in the EXEC_BACKEND case. In the non-EXEC_BACKEND case you have the postmaster doing it, so then I'm not sure why it has to be redone for every backend. Won't they just inherit the data from the postmaster? Has this code been meaningfully tested on Windows? How do we know that it works? Maybe we need to think about adding some asserts that guarantee that any process that attempts to access a buffer has the key manager initialized; I bet such assertions would fail at least on Windows as the code looks now. I don't think it makes sense to think about committing this to v14. I believe it only makes sense if we have a TDE patch that is relatively close to committable that can be used with it. I also don't think that this patch is in good enough shape to commit yet in terms of where it's at in terms of quality; I think it needs more review first, hopefully including review from people who can comment intelligently specifically on the cryptography aspects of it. However, the challenges don't seem insurmountable. There's also still some question in my mind about whether the design choices here (one KEK, 2 DEKs, one for data and one for WAL) have enough consensus. I don't have a considered opinion on that, partly because I'm not quite sure what the reasonable alternatives are, but it seems that other people had some questions about it, IIUC. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 26, 2021 at 03:24:30PM -0500, Robert Haas wrote: > On Tue, Jan 26, 2021 at 11:15 AM Bruce Momjian <bruce@momjian.us> wrote: > > This version fixes OpenSSL detection and improves docs for initdb > > interactions. > > Hi, > > I'm wondering whether you've considered storing all the keys in one > file instead of a file per key. The reason I ask is that it seems to > me that the key rotation procedure would be a lot simpler if it were > all in one file. You could just create a temporary file and atomically > rename it over the existing file. If you see a temporary file you're > always free to remove it. This is a lot simpler than what you have > right now. The "repair" concept pretty much goes away completely, > which seems nice. Granted I don't know exactly how to store multiple > keys in one file, but I bet there's some way to do it. We envisioned allowing heap/index key rotation by having a standby with the same WAL key as the primary but different heap/index keys so that we can failover to the standby to change the heap/index key and then change the WAL key. This separation allows that. We also might need some additional keys later and this allows that. I do like simplicity, but the complexity here seems to serve a need. > The way in which you are posting these patches is quite unlike what > most people do when posting patches to this list. You seem to have > generated a bunch of patches using 'git format-patch' but then > concatenated them all together in a single file. It would be helpful > if you could do this more like the way that is now standard on this > list. Not only that, but the patches don't have meaningful commit What is the standard? You want seven separate files? I can do that. > messages in them, and don't seem to be meaningfully split for easy > review. They just say things like 'crypto squash commit'. Compare this Yes, the feature is at the backend, common, /bin, and test levels. I was able to separate out the bin, pg_alterckey and test stuff, but the backend interactions were hard to split. > to for example what I did on the "cleaning up a few CLOG-related > things" thread where the commits appear in a logical sequence, and > each one has a meaningful commit message. Or here's an example from > someone else -- > http://postgr.es/m/be72abfa-e62e-eb81-4e70-1b57fe6dc9e2@amazon.com -- > and note the inclusion of authorship information in the commit > messages, so that the source of the code can be easily understood. I see. I am not sure how to do that easily for all the pieces. > The README in src/backend/crypto does not explain how the scripts in > that directory are intended to be used. If I want to use AWS Secrets > Manager with this feature, I can see that I should use > ckey_aws.sh.sample as a basis for that integration, but I don't know > what I should do with the script because the README says nothing about > it. I am frankly pretty doubtful about the idea of shipping a bunch of > /bin/sh scripts as a best practice; for one thing, that's totally > unusable on Windows, and it also means that this is dependent on > /bin/sh existing and having the behavior we expect and on all the > other executables in these scripts as well. But, at the very least, > there needs to be a clearer explanation of how the scripts are > intended to be used, which parts people are supposed to modify, what > arguments they're going to get called with, and things like that. I added comments to most of the scripts. I don't know what more I can do, or what other language would be appropriate. > The comments in cipher.c and cipher_openssl.c could be improved to > explain that they are alternatives to each other. Perhaps the former > could be renamed to something like cipher_failure.c or cipher_noimpl.c > for clarity. This follows the way cryptohash.c and cryptohash_openssl.c are done. I did just add comments to the top of cipher.c and cipher_openssl.c to be just like cryptohash versions. > I believe that a StaticAssertStmt could be used to check the length of > the encryption_methods[] array, so that if someone changes > NUM_ENCRYPTION_METHODS without updating the array, compilation fails. > See UserAuthName[] for an example of how to do this. Sure, good idea, done. > You seem to have omitted to update the documentation with the names of > the new wait events that you added. OK, added. > In process_postgres_switches(), when there's a multi-line comment > followed by a single line of actual code, I prefer to include braces > around the whole thing. There might be some disagreement on what is > best here, though. OK, done. > What are the consequences of the placement of the code in > PostgresMain() for processes other than user backends and walsenders? > I think that the way you have it, background workers would not have > access to keys, nor auxiliary processes like the checkpointer ... at Well, there are three cases, --boot mode, postmaster mode, and postgres single-user mode. I tried to have all those cases only unwrap the keys once and store them in shared memory, or in boot mode, in local memory. As far as I know, the startup does it once and everyone else uses shared memory to access it. > least in the EXEC_BACKEND case. In the non-EXEC_BACKEND case you have > the postmaster doing it, so then I'm not sure why it has to be redone > for every backend. Won't they just inherit the data from the For postgres --single. > postmaster? Has this code been meaningfully tested on Windows? How do No, just by the cfbot Windows machine. > we know that it works? Maybe we need to think about adding some > asserts that guarantee that any process that attempts to access a > buffer has the key manager initialized; I bet such assertions would > fail at least on Windows as the code looks now. Are you saying we should set a global variable and throw an error if it is accessed without the array being initialized? > I don't think it makes sense to think about committing this to v14. I > believe it only makes sense if we have a TDE patch that is relatively > close to committable that can be used with it. I also don't think that > this patch is in good enough shape to commit yet in terms of where > it's at in terms of quality; I think it needs more review first, > hopefully including review from people who can comment intelligently > specifically on the cryptography aspects of it. However, the > challenges don't seem insurmountable. There's also still some question > in my mind about whether the design choices here (one KEK, 2 DEKs, one > for data and one for WAL) have enough consensus. I don't have a > considered opinion on that, partly because I'm not quite sure what the > reasonable alternatives are, but it seems that other people had some > questions about it, IIUC. While I am willing to make requested adjustments to the patch, I don't plan to work on this feaure any further, assuming your analysis above is correct. If after years we are still not sure this is the right direction, I don't see any point in moving forward with the later pieces, which are even more complicated. I will join the group of people that feel there will never be consensus on implementing this feature in the community, so it is not worth trying. I would also like to add a "not wanted" entry for this feature on the TODO list, baaed on the feature's limited usefulness, but I already asked about that and no one seems to feel we don't want it. I now better understand why the OpenSSL project has had such serious problems in the past. Updated patch attached as seven attachments. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Tue, Jan 26, 2021 at 05:53:01PM -0500, Bruce Momjian wrote: > On Tue, Jan 26, 2021 at 03:24:30PM -0500, Robert Haas wrote: > > I'm wondering whether you've considered storing all the keys in one > > file instead of a file per key. The reason I ask is that it seems to > > me that the key rotation procedure would be a lot simpler if it were > > all in one file. You could just create a temporary file and atomically > > rename it over the existing file. If you see a temporary file you're > > always free to remove it. This is a lot simpler than what you have > > right now. The "repair" concept pretty much goes away completely, > > which seems nice. Granted I don't know exactly how to store multiple > > keys in one file, but I bet there's some way to do it. > > We envisioned allowing heap/index key rotation by having a standby with > the same WAL key as the primary but different heap/index keys so that we > can failover to the standby to change the heap/index key and then change > the WAL key. This separation allows that. We also might need some > additional keys later and this allows that. I do like simplicity, but > the complexity here seems to serve a need. Just to close this issue, several scripts, e,g., PIV, AWS, need to store data to indicate the cluster encryption key used, and those need to be kept synchronized with the wrapped data keys. Having separate directories for each cluster key version allows that to work cleanly. > > The README in src/backend/crypto does not explain how the scripts in > > that directory are intended to be used. If I want to use AWS Secrets > > Manager with this feature, I can see that I should use > > ckey_aws.sh.sample as a basis for that integration, but I don't know > > what I should do with the script because the README says nothing about > > it. I am frankly pretty doubtful about the idea of shipping a bunch of > > /bin/sh scripts as a best practice; for one thing, that's totally > > unusable on Windows, and it also means that this is dependent on > > /bin/sh existing and having the behavior we expect and on all the > > other executables in these scripts as well. But, at the very least, > > there needs to be a clearer explanation of how the scripts are > > intended to be used, which parts people are supposed to modify, what > > arguments they're going to get called with, and things like that. > > I added comments to most of the scripts. I don't know what more I can > do, or what other language would be appropriate. I think someone would need to write Windows versions of these scripts. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hello,
> I don't think it makes sense to think about committing this to v14. I
> believe it only makes sense if we have a TDE patch that is relatively
> close to committable that can be used with it. I also don't think that
> this patch is in good enough shape to commit yet in terms of where
> it's at in terms of quality; I think it needs more review first,
> hopefully including review from people who can comment intelligently
> specifically on the cryptography aspects of it. However, the
> challenges don't seem insurmountable. There's also still some question
> in my mind about whether the design choices here (one KEK, 2 DEKs, one
> for data and one for WAL) have enough consensus. I don't have a
> considered opinion on that, partly because I'm not quite sure what the
> reasonable alternatives are, but it seems that other people had some
> questions about it, IIUC.
While I am willing to make requested adjustments to the patch, I don't
plan to work on this feaure any further, assuming your analysis above is
correct. If after years we are still not sure this is the right
direction, I don't see any point in moving forward with the later
pieces, which are even more complicated. I will join the group of
people that feel there will never be consensus on implementing this
feature in the community, so it is not worth trying.
I would also like to add a "not wanted" entry for this feature on the
TODO list, baaed on the feature's limited usefulness, but I already
asked about that and no one seems to feel we don't want it.
I want to avoid seeing this happen. As a result of a lot of customer and user discussions, around their criteria for choosing a database, I believe TDE is an important feature and having it appear with a "not-wanted" tag will keep the version of PostgreSQL released by the community out of certain (and possibly growing) number of deployment scenarios which I don't think anybody wants to see.
I think the current situation to be as follows (if I missed something please let me know):
1) We need to get the current patch for Key Management reviewed and tested further.
I spoke to Bruce just now he will see if can get somebody to do this.
2) We need to start working on the actual TDE implementation and get it pretty close to final before we start committing smaller portions of the feature.
Unfortunately, on this front, the only things, I think I can offer are:
a) Ask for volunteers to work on the TDE implementation.
b) Facilitate the work between volunteers.
c) Prod folks along and cheer as we go.
So I will start with (a), do we have any volunteers who feel they can contribute regularly for a while and would like to be part of a team that moves this forward?
I now better understand why the OpenSSL project has had such serious
problems in the past.
Updated patch attached as seven attachments.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Thomas John Kincaid
On Thu, Jan 28, 2021 at 02:41:09PM -0500, Tom Kincaid wrote: > I would also like to add a "not wanted" entry for this feature on the > TODO list, baaed on the feature's limited usefulness, but I already > asked about that and no one seems to feel we don't want it. > > > I want to avoid seeing this happen. As a result of a lot of customer and user > discussions, around their criteria for choosing a database, I believe TDE is an > important feature and having it appear with a "not-wanted" tag will keep the > version of PostgreSQL released by the community out of certain (and possibly > growing) number of deployment scenarios which I don't think anybody wants to > see. With pg_upgrade, I could work on it out of the tree until it became popular, with a small non-user-visible part in the backend. With the Windows port, the port wasn't really visible to users until it we ready. For the key management part of TDE, it can't be done outside the tree, and it is user-visible before it is useful, so that restricts how much incremental work can be committed to the tree for TDE. I highlighted that concern emails months ago, but never got any feedback --- now it seems people are realizing the ramifications of that. > I think the current situation to be as follows (if I missed something please > let me know): > > 1) We need to get the current patch for Key Management reviewed and tested > further. > > I spoke to Bruce just now he will see if can get somebody to do this. Well, if we don't get anyone committed to working on the data encryption part of TDE, the key management part is useless, so why review/test it further? Although Sawada-san and Stephen Frost worked on the patch, they have not commented much on my additions, and only a few others have commented on the code, and there has been no discussion on who is working on the next steps. This indicates to me that there is little interest in moving this feature forward, which is why I started asking if it could be labeled as "not wanted". -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Fri, Jan 29, 2021 at 5:22 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Jan 28, 2021 at 02:41:09PM -0500, Tom Kincaid wrote: > > I would also like to add a "not wanted" entry for this feature on the > > TODO list, baaed on the feature's limited usefulness, but I already > > asked about that and no one seems to feel we don't want it. > > > > > > I want to avoid seeing this happen. As a result of a lot of customer and user > > discussions, around their criteria for choosing a database, I believe TDE is an > > important feature and having it appear with a "not-wanted" tag will keep the > > version of PostgreSQL released by the community out of certain (and possibly > > growing) number of deployment scenarios which I don't think anybody wants to > > see. > > With pg_upgrade, I could work on it out of the tree until it became > popular, with a small non-user-visible part in the backend. With the > Windows port, the port wasn't really visible to users until it we ready. > > For the key management part of TDE, it can't be done outside the tree, > and it is user-visible before it is useful, so that restricts how much > incremental work can be committed to the tree for TDE. I highlighted > that concern emails months ago, but never got any feedback --- now it > seems people are realizing the ramifications of that. > > > I think the current situation to be as follows (if I missed something please > > let me know): > > > > 1) We need to get the current patch for Key Management reviewed and tested > > further. > > > > I spoke to Bruce just now he will see if can get somebody to do this. > > Well, if we don't get anyone committed to working on the data encryption > part of TDE, the key management part is useless, so why review/test it > further? > > Although Sawada-san and Stephen Frost worked on the patch, they have not > commented much on my additions, and only a few others have commented on > the code, and there has been no discussion on who is working on the next > steps. This indicates to me that there is little interest in moving > this feature forward, TBH I’m confused a bit about the recent situation of this patch, but I can contribute to KMS work by discussing, writing, reviewing, and testing the patch. Also, I can work on the data encryption part of TDE (we need more discussion on that though). If the community concerns about the high-level design and thinks the design reviews by cryptography experts are still needed, we would need to do that first since the data encryption part of TDE depends on KMS. As far as I know, we have done that many times on pgsql-hackers, on offl-line and including the discussion on the past proposal, etc but given that the community still has a concern, it seems that we haven’t been able to share the details of the discussion enough that led to the design decision or the design is still not good. Honestly, I’m not sure how this feature can get consensus. But maybe we would need to have a break from refining the patch now and we need to marshal the discussions so far and the point behind the design so that everyone can understand why this feature is designed in that way. To do that, it might be a good start to sort the wiki page since it has data encryption part, KMS, and ToDo mixed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Greetings, * Masahiko Sawada (sawada.mshk@gmail.com) wrote: > On Fri, Jan 29, 2021 at 5:22 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Jan 28, 2021 at 02:41:09PM -0500, Tom Kincaid wrote: > > > I would also like to add a "not wanted" entry for this feature on the > > > TODO list, baaed on the feature's limited usefulness, but I already > > > asked about that and no one seems to feel we don't want it. > > > > > > > > > I want to avoid seeing this happen. As a result of a lot of customer and user > > > discussions, around their criteria for choosing a database, I believe TDE is an > > > important feature and having it appear with a "not-wanted" tag will keep the > > > version of PostgreSQL released by the community out of certain (and possibly > > > growing) number of deployment scenarios which I don't think anybody wants to > > > see. > > > > With pg_upgrade, I could work on it out of the tree until it became > > popular, with a small non-user-visible part in the backend. With the > > Windows port, the port wasn't really visible to users until it we ready. > > > > For the key management part of TDE, it can't be done outside the tree, > > and it is user-visible before it is useful, so that restricts how much > > incremental work can be committed to the tree for TDE. I highlighted > > that concern emails months ago, but never got any feedback --- now it > > seems people are realizing the ramifications of that. > > > > > I think the current situation to be as follows (if I missed something please > > > let me know): > > > > > > 1) We need to get the current patch for Key Management reviewed and tested > > > further. > > > > > > I spoke to Bruce just now he will see if can get somebody to do this. > > > > Well, if we don't get anyone committed to working on the data encryption > > part of TDE, the key management part is useless, so why review/test it > > further? > > > > Although Sawada-san and Stephen Frost worked on the patch, they have not > > commented much on my additions, and only a few others have commented on > > the code, and there has been no discussion on who is working on the next > > steps. This indicates to me that there is little interest in moving > > this feature forward, > > TBH I’m confused a bit about the recent situation of this patch, but I > can contribute to KMS work by discussing, writing, reviewing, and > testing the patch. Also, I can work on the data encryption part of TDE > (we need more discussion on that though). If the community concerns > about the high-level design and thinks the design reviews by > cryptography experts are still needed, we would need to do that first > since the data encryption part of TDE depends on KMS. As far as I > know, we have done that many times on pgsql-hackers, on offl-line and > including the discussion on the past proposal, etc but given that the > community still has a concern, it seems that we haven’t been able to > share the details of the discussion enough that led to the design > decision or the design is still not good. Honestly, I’m not sure how > this feature can get consensus. But maybe we would need to have a > break from refining the patch now and we need to marshal the > discussions so far and the point behind the design so that everyone > can understand why this feature is designed in that way. To do that, > it might be a good start to sort the wiki page since it has data > encryption part, KMS, and ToDo mixed. I hope it's pretty clear that I'm also very much in support of both this effort with the KMS and of TDE in general- TDE is specifically, repeatedly, called out as a capability whose lack is blocking PG from being able to be used for certain use-cases that it would otherwise be well suited for, and that's really unfortunate. I appreciate the recent discussion and reviews of the KMS in particular, and of the patches which have been sent enabling TDE based on the KMS patches. Having them be relatively independent seems to be an ongoing concern and perhaps we should figure out a way to more clearly put them together. That is- the KMS patches have been posted on one thread, and TDE PoC patches which use the KMS patches have been on another thread, leading some to not realize that there's already been TDE PoC work done based on the KMS patches. Seems like it might make sense to get one patch set which goes all the way from the KMS and includes the TDE PoC, even if they don't all go in at once. I'm happy to go look over the KMS patches again if that'd be helpful and to comment on the TDE PoC. I can also spend some time trying to improve on each, as I've already done. A few of the larger concerns that I have revolve around how to store integrity information (I've tried to find a way to make room for such information in our existing page layout and, perhaps unsuprisingly, it's far from trivial to do so in a way that will avoid breaking the existing page layout, or where the same set of binaries could work on both unencrypted pages and encrypted pages with integrity validation information, and that's a problem that we really should consider trying to solve...), and how to automate key rotation (one of the nice things about Bruce's approach to storing the keys is that we're leveraging the filesystem as an index- it's easy to see how we might extend the key-per-file approach to allow us to, say, have a different key for every 32GB of LSN, but if we tried to put all of the keys into a single file then we'd have to figure out an indexing solution for it which would allow us to find the key we need to decrypt a given page...). I tend to agree with Bruce that we need to take these things in steps, getting each piece implemented as we go. Maybe we can do that in a separate repo for a time and then bring it all together, as a few on this thread have voiced, but there's no doubt that this is a large project and it's hard to see how we could possibly commit all of it at once. Thanks! Stephen
Attachment
Thanks Stephen, Bruce and Masahiko,
> discussions so far and the point behind the design so that everyone
> can understand why this feature is designed in that way. To do that,
> it might be a good start to sort the wiki page since it has data
> encryption part, KMS, and ToDo mixed.
I hope it's pretty clear that I'm also very much in support of both this
effort with the KMS and of TDE in general- TDE is specifically,
repeatedly, called out as a capability whose lack is blocking PG from
being able to be used for certain use-cases that it would otherwise be
well suited for, and that's really unfortunate.
It is clear you are supportive.
As you know, I share your point of view that PG adoption is suffering for certain use cases because it does not have TDE.
I appreciate the recent discussion and reviews of the KMS in particular,
and of the patches which have been sent enabling TDE based on the KMS
patches. Having them be relatively independent seems to be an ongoing
concern and perhaps we should figure out a way to more clearly put them
together. That is- the KMS patches have been posted on one thread, and
TDE PoC patches which use the KMS patches have been on another thread,
leading some to not realize that there's already been TDE PoC work done
based on the KMS patches. Seems like it might make sense to get one
patch set which goes all the way from the KMS and includes the TDE PoC,
even if they don't all go in at once.
Sounds good, thanks Masahiko, let's see if we can get consensus on the approach for moving this forward see below.
together, as a few on this thread have voiced, but there's no doubt that
this is a large project and it's hard to see how we could possibly
commit all of it at once.
I propose that we meet to discuss what approach we want to use to move TDE forward. We then start a new thread with a proposal on the approach and finalize it via community consensus. I will invite Bruce, Stephen and Masahiko to this meeting. If anybody else would like to participate in this discussion and subsequently in the effort to get TDE in PG1x, please let me know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer from this meeting) will post the proposal for how we move this patch forward in another thread. Hopefully, we can get consensus on that and subsequently restart the execution of delivering this feature.
Thanks!
Stephen
Thomas John Kincaid
Dear All. Thank you for all opinions and discussions regarding the KMS/TDE function. First of all, to get to the point of this email, I want to participate in anything I can do (review or development) when TDE related development is in progress. If there is a meeting related to it, I can't communicate because of my poor English skills, but I would like to attend if it is only possible to listen. I didn't understand KMS and didn't participate in the direct development, so I didn't comment on anything so far. Still, when TDE development starts, I wanted to join in the discussion and meeting if there was anything I could do. However, since I have a complicated and insufficient English ability to communicate in English, maybe I will rarely say anything in meetings (voice and video meetings). But I would like to attend the discussion if it is only possible to listen. Also, if the wiki page and other mail threads related to TDE start, I'll join in discussions if there is anything I can do. Best regards. Moon. On Sat, Jan 30, 2021 at 10:23 PM Tom Kincaid <tomjohnkincaid@gmail.com> wrote: > > > > > > Thanks Stephen, Bruce and Masahiko, > >> >> > discussions so far and the point behind the design so that everyone >> > can understand why this feature is designed in that way. To do that, >> > it might be a good start to sort the wiki page since it has data >> > encryption part, KMS, and ToDo mixed. >> >> I hope it's pretty clear that I'm also very much in support of both this >> effort with the KMS and of TDE in general- TDE is specifically, >> repeatedly, called out as a capability whose lack is blocking PG from >> being able to be used for certain use-cases that it would otherwise be >> well suited for, and that's really unfortunate. > > > It is clear you are supportive. > > As you know, I share your point of view that PG adoption is suffering for certain use cases because it does not have TDE. > >> I appreciate the recent discussion and reviews of the KMS in particular, >> and of the patches which have been sent enabling TDE based on the KMS >> patches. Having them be relatively independent seems to be an ongoing >> concern and perhaps we should figure out a way to more clearly put them >> together. That is- the KMS patches have been posted on one thread, and >> TDE PoC patches which use the KMS patches have been on another thread, >> leading some to not realize that there's already been TDE PoC work done >> based on the KMS patches. Seems like it might make sense to get one >> patch set which goes all the way from the KMS and includes the TDE PoC, >> even if they don't all go in at once. > > > Sounds good, thanks Masahiko, let's see if we can get consensus on the approach for moving this forward see below. > >> >> >> together, as a few on this thread have voiced, but there's no doubt that >> this is a large project and it's hard to see how we could possibly >> commit all of it at once. > > > I propose that we meet to discuss what approach we want to use to move TDE forward. We then start a new thread with aproposal on the approach and finalize it via community consensus. I will invite Bruce, Stephen and Masahiko to this meeting.If anybody else would like to participate in this discussion and subsequently in the effort to get TDE in PG1x, pleaselet me know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer from this meeting) will postthe proposal for how we move this patch forward in another thread. Hopefully, we can get consensus on that and subsequentlyrestart the execution of delivering this feature. > > > > >> >> Thanks! >> >> Stephen > > > > -- > Thomas John Kincaid >
On Fri, Jan 29, 2021 at 05:05:06PM +0900, Masahiko Sawada wrote: > TBH I’m confused a bit about the recent situation of this patch, but > I Yes, it is easy to get confused. > can contribute to KMS work by discussing, writing, reviewing, and > testing the patch. Also, I can work on the data encryption part of TDE Great. > (we need more discussion on that though). If the community concerns > about the high-level design and thinks the design reviews by > cryptography experts are still needed, we would need to do that first > since the data encryption part of TDE depends on KMS. As far as I I totally agree. While we don't need to commit the key management patch to the tree before moving forward, we should have agreement on the key management patch before doing more work on this. If we can't agree on the key management part, there is no value in working further, as I stated in an earlier email. > know, we have done that many times on pgsql-hackers, on offl-line and > including the discussion on the past proposal, etc but given that the > community still has a concern, it seems that we haven’t been able > to share the details of the discussion enough that led to the design > decision or the design is still not good. Honestly, I’m not sure how > this feature can get consensus. But maybe we would need to have a Yes, I am also confused. > break from refining the patch now and we need to marshal the > discussions so far and the point behind the design so that everyone > can understand why this feature is designed in that way. To do that, > it might be a good start to sort the wiki page since it has data > encryption part, KMS, and ToDo mixed. What I ended up doing is to moving the majority of the non-data-encryption part of the wiki into the patch, either in docs or README files, since people asked for more of this in the patch, and having the information in two places is confusing. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Fri, Jan 29, 2021 at 05:40:37PM -0500, Stephen Frost wrote: > I hope it's pretty clear that I'm also very much in support of both this > effort with the KMS and of TDE in general- TDE is specifically, Yes, thanks. I know we have privately talked about this recently, but it is nice to have it in public like this. > repeatedly, called out as a capability whose lack is blocking PG from > being able to be used for certain use-cases that it would otherwise be > well suited for, and that's really unfortunate. So, below, I am going to copy two doc paragraphs from the patch: The purpose of cluster file encryption is to prevent users with read access to the directories used to store database files and write-ahead log files from being able to access the data stored in those files. For example, when using cluster file encryption, users who have read access to the cluster directories for backup purposes will not be able to decrypt the data stored in these files. It also protects against decrypted data access after media theft. File system write access can allow for unauthorized file system data decryption if the writes can be used to weaken the system's security and this weakened system is later supplied with externally-stored keys. This also does not protect from users who have read access to system memory. This also does not detect or protect against users with write access from removing or modifying database files. Given what I said above, is the value of this feature for compliance, or for actual additional security? If it just compliance, are we willing to add all of this code just for that, even if it has limited security value? We should answer this question now, and if we don't want it, let's document that so users know and can consider alternatives. FYI, I don't think we can detect or protect against writers modifying the data files --- even if we could do it on a block level, they could remove trailing pages (might cause index lookup failures) or copy pages from other tables at the same offset. Therefore, I think we can only offer viewing security, not modification detection/prevention. > I appreciate the recent discussion and reviews of the KMS in particular, > and of the patches which have been sent enabling TDE based on the KMS > patches. Having them be relatively independent seems to be an ongoing I was thinking some more and I have received productive feedback from at least eight people on the key management patch, which is very good. > concern and perhaps we should figure out a way to more clearly put them > together. That is- the KMS patches have been posted on one thread, and > TDE PoC patches which use the KMS patches have been on another thread, > leading some to not realize that there's already been TDE PoC work done > based on the KMS patches. Seems like it might make sense to get one > patch set which goes all the way from the KMS and includes the TDE PoC, > even if they don't all go in at once. Uh, it is worse than that. Some people saw comments about the TDE PoC patch (e.g., buffer pins) and thought they were related to the KMS patch, so they thought the KMS patch wasn't ready. Now, I am not saying the KMS patch is ready, but comments on the TDE PoC patch are unrelated to the KMS patch being ready. I think the TDE PoC was a big positive because it showed the KMS patch being used for the actual use-case we are planning, so it was truly a proof-of-concept. > I'm happy to go look over the KMS patches again if that'd be helpful and > to comment on the TDE PoC. I can also spend some time trying to improve I think we eventually need a full review of the TDE PoC, combined with the Cybertec patch, and the wiki, to get them all aligned. However, as I said already, let's get the KMS patch approved, even if we don't apply it now, so we know we are on an approved foundation. > on each, as I've already done. A few of the larger concerns that I have > revolve around how to store integrity information (I've tried to find a > way to make room for such information in our existing page layout and, > perhaps unsuprisingly, it's far from trivial to do so in a way that will > avoid breaking the existing page layout, or where the same set of > binaries could work on both unencrypted pages and encrypted pages with > integrity validation information, and that's a problem that we really As stated above, I think we only need a byte or two for the hint bit counter (used in the IV), as I don't think the GCM verification bytes will add any additional security, and I bet we can find a byte or two. We do need a separate discussion on this, either here or privately. > should consider trying to solve...), and how to automate key rotation > (one of the nice things about Bruce's approach to storing the keys is > that we're leveraging the filesystem as an index- it's easy to see how > we might extend the key-per-file approach to allow us to, say, have a > different key for every 32GB of LSN, but if we tried to put all of the > keys into a single file then we'd have to figure out an indexing > solution for it which would allow us to find the key we need to decrypt > a given page...). I tend to agree with Bruce that we need to take Yeah, yuck on that plan. I was very happy how the per-version directory worked with scripts that needed to store matching state. > these things in steps, getting each piece implemented as we go. Maybe > we can do that in a separate repo for a time and then bring it all > together, as a few on this thread have voiced, but there's no doubt that > this is a large project and it's hard to see how we could possibly > commit all of it at once. I was putting stuff in a git tree/URL; you can see it here: https://github.com/postgres/postgres/compare/master...bmomjian:key.diff https://github.com/postgres/postgres/compare/master...bmomjian:key.patch https://github.com/postgres/postgres/compare/master...bmomjian:key However, people wanted persistent patches attached, so I started doing that. Attached is the current patch set. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Attachment
On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote: > I propose that we meet to discuss what approach we want to use to move TDE > forward. We then start a new thread with a proposal on the approach > and finalize it via community consensus. I will invite Bruce, Stephen and > Masahiko to this meeting. If anybody else would like to participate in this > discussion and subsequently in the effort to get TDE in PG1x, please let me > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer > from this meeting) will post the proposal for how we move this patch forward in > another thread. Hopefully, we can get consensus on that and subsequently > restart the execution of delivering this feature. We got complaints that decisions were not publicly discussed, or were too long, so I am not sure this helps. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Jan 29, 2021 at 05:40:37PM -0500, Stephen Frost wrote: > > I hope it's pretty clear that I'm also very much in support of both this > > effort with the KMS and of TDE in general- TDE is specifically, > > Yes, thanks. I know we have privately talked about this recently, but > it is nice to have it in public like this. Certainly happy to lend my support and to spend some time working on this to move it forward. > > repeatedly, called out as a capability whose lack is blocking PG from > > being able to be used for certain use-cases that it would otherwise be > > well suited for, and that's really unfortunate. > > So, below, I am going to copy two doc paragraphs from the patch: > > The purpose of cluster file encryption is to prevent users with read > access to the directories used to store database files and write-ahead > log files from being able to access the data stored in those files. > For example, when using cluster file encryption, users who have read > access to the cluster directories for backup purposes will not be able > to decrypt the data stored in these files. It also protects against > decrypted data access after media theft. That's one valid use-case and it particularly makes sense to consider, now that we support group read-access to the data cluster. The last line seems a bit unclear- I would update it to say: Cluster file encryption also provides data-at-rest security, protecting users from data loss should the physical media on which the cluster is stored be stolen, improperly deprovisioned (not wiped or destroyed), or otherwise ends up in the hands of an attacker. > File system write access can allow for unauthorized file system data > decryption if the writes can be used to weaken the system's security > and this weakened system is later supplied with externally-stored keys. This isn't very clear as to exactly what the concern is or how an attacker would be able to thwart the system if they had write access to it. An attacker with write access could possibly attempt to replace the existing keys, but with the key wrapping that we're using, that should result in just a decryption failure (unless, of course, the attacker has the actual KEK that was used, but that's not terribly interesting to worry about since then they could just go access the files directly). Until and unless we solve the issue around storing the GCM tags for each page, we will have the risk that an attacker could modify a page in a manner that we wouldn't detect. This is the biggest concern that I have currently with the existing TDE patch sets. There's two options that I see around how to address that issue- either we arrange to create space in the page for the tag, such as by making the 'special' space on a page a bit bigger and making sure that everything understands that, or we'll need to add another fork in which we store the tags (and possibly other TDE/encryption related information). If we go with a fork then it should be possible to do WAL streaming from an unencrypted cluster to an encrypted one, which would be pretty neat, but it means another fork and another page that has to be read/written every time we modify a page. Getting some input into the trade-offs here would be really helpful. I don't think it's really reasonable to go out with TDE without having figured out the integrity side. Certainly, when I review things like NIST 800-53, it's very clear that the requirement is for both confidentiality *and* integrity. > This also does not protect from users who have read access to system > memory. This also does not detect or protect against users with write > access from removing or modifying database files. The last seems a bit obvious, but the first sentence quoted above is important to make clear. I might even say: All of the pages in memory and all of the keys which are used for the encryption and decryption are stored in the clear in memory and therefore an attacker who is able to read the memory allocated by PostgreSQL would be able to decrypt the enitre cluster. > Given what I said above, is the value of this feature for compliance, or > for actual additional security? If it just compliance, are we willing > to add all of this code just for that, even if it has limited security > value? We should answer this question now, and if we don't want it, > let's document that so users know and can consider alternatives. The feature is for both compliance and additional security. While there are other ways to achieve data-at-rest encryption, they are not always available, for a variety of reasons. > FYI, I don't think we can detect or protect against writers modifying > the data files --- even if we could do it on a block level, they could > remove trailing pages (might cause index lookup failures) or copy > pages from other tables at the same offset. Therefore, I think we can > only offer viewing security, not modification detection/prevention. Protecting against file modification isn't about finding some way to make it so that an attacker isn't able to modify the files, it's about detecting the case where an unauthorized modification has happened. Clearly if an attacker has gained write access to the system then we can't protect against the attacker using the access they've gained, but we can in most cases detect it and that's what we should be doing. It would be really unfortunate to end up with a solution here that only provides confidentiality and doesn't address integrity at all, and I don't really think it's *that* hard to do both. That said, if we must work at this in pieces and we can get agreement to handle confidentiality initially and then add integrity later, that might be reasonable. > > I appreciate the recent discussion and reviews of the KMS in particular, > > and of the patches which have been sent enabling TDE based on the KMS > > patches. Having them be relatively independent seems to be an ongoing > > I was thinking some more and I have received productive feedback from at > least eight people on the key management patch, which is very good. Agreed. > > concern and perhaps we should figure out a way to more clearly put them > > together. That is- the KMS patches have been posted on one thread, and > > TDE PoC patches which use the KMS patches have been on another thread, > > leading some to not realize that there's already been TDE PoC work done > > based on the KMS patches. Seems like it might make sense to get one > > patch set which goes all the way from the KMS and includes the TDE PoC, > > even if they don't all go in at once. > > Uh, it is worse than that. Some people saw comments about the TDE PoC > patch (e.g., buffer pins) and thought they were related to the KMS > patch, so they thought the KMS patch wasn't ready. Now, I am not saying > the KMS patch is ready, but comments on the TDE PoC patch are unrelated > to the KMS patch being ready. I do agree with that and that it can lend to some confusion. I'm not sure what the right solution there is except to continue to try and work with those who are interested and to clarify the separation. > I think the TDE PoC was a big positive because it showed the KMS patch > being used for the actual use-case we are planning, so it was truly a > proof-of-concept. Agreed. > > I'm happy to go look over the KMS patches again if that'd be helpful and > > to comment on the TDE PoC. I can also spend some time trying to improve > > I think we eventually need a full review of the TDE PoC, combined with > the Cybertec patch, and the wiki, to get them all aligned. However, as > I said already, let's get the KMS patch approved, even if we don't apply > it now, so we know we are on an approved foundation. While the Cybertec patch is interesting, I'd really like to see something that's a bit less invasive when it comes to how temporary files are handled. In particular, I think it'd be possible to have an API that's very similar to the existing one for serial reading and writing of files which wouldn't require nearly as many changes to things like reorderbuffer.c. I also believe there's some things we could do to avoid having to modify quite as many places when it comes to LSN assignment, so the base patch isn't as big. > > on each, as I've already done. A few of the larger concerns that I have > > revolve around how to store integrity information (I've tried to find a > > way to make room for such information in our existing page layout and, > > perhaps unsuprisingly, it's far from trivial to do so in a way that will > > avoid breaking the existing page layout, or where the same set of > > binaries could work on both unencrypted pages and encrypted pages with > > integrity validation information, and that's a problem that we really > > As stated above, I think we only need a byte or two for the hint bit > counter (used in the IV), as I don't think the GCM verification bytes > will add any additional security, and I bet we can find a byte or two. > We do need a separate discussion on this, either here or privately. I have to disagree here- the GCM tag adds integrity which is really quite important. Happy to chat about it independently, of course. > > should consider trying to solve...), and how to automate key rotation > > (one of the nice things about Bruce's approach to storing the keys is > > that we're leveraging the filesystem as an index- it's easy to see how > > we might extend the key-per-file approach to allow us to, say, have a > > different key for every 32GB of LSN, but if we tried to put all of the > > keys into a single file then we'd have to figure out an indexing > > solution for it which would allow us to find the key we need to decrypt > > a given page...). I tend to agree with Bruce that we need to take > > Yeah, yuck on that plan. I was very happy how the per-version directory > worked with scripts that needed to store matching state. I don't know that it's going to ultimately be the best answer, as we're essentially using the filesystem as an index, as I mentioned above, but, yeah, trying to do all of that ourselves during WAL replay doesn't seem like it would be fun to try and figure out. This is an area that I would think we'd be able to improve on in the future too- if someone wants to spend the time coming up with a single-file format that is indexed in some manner and still provides the guarantees that we need, we could very likely teach pg_upgrade how to handle that and the data set we're talking about here is quite small, even if we've got a bunch of key rotation that's happened. > > these things in steps, getting each piece implemented as we go. Maybe > > we can do that in a separate repo for a time and then bring it all > > together, as a few on this thread have voiced, but there's no doubt that > > this is a large project and it's hard to see how we could possibly > > commit all of it at once. > > I was putting stuff in a git tree/URL; you can see it here: > > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff > https://github.com/postgres/postgres/compare/master...bmomjian:key.patch > https://github.com/postgres/postgres/compare/master...bmomjian:key > > However, people wanted persistent patches attached, so I started doing that. > Attached is the current patch set. Doing both seems likely to be the best option and hopefully will help everyone see the complete picture. Thanks, Stephen
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote: > > I propose that we meet to discuss what approach we want to use to move TDE > > forward. We then start a new thread with a proposal on the approach > > and finalize it via community consensus. I will invite Bruce, Stephen and > > Masahiko to this meeting. If anybody else would like to participate in this > > discussion and subsequently in the effort to get TDE in PG1x, please let me > > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer > > from this meeting) will post the proposal for how we move this patch forward in > > another thread. Hopefully, we can get consensus on that and subsequently > > restart the execution of delivering this feature. > > We got complaints that decisions were not publicly discussed, or were > too long, so I am not sure this helps. If the notes are published afterwords as an explanation of why certain choices were made, I suspect it'd be reasonably well received. The concern about back-room discussions is more that decisions are made without explanation as to why, provided we avoid that, I believe they can be helpful. So, +1 for my part to have the conversation. Thanks, Stephen
Attachment
On Mon, Feb 1, 2021 at 06:34:53PM -0500, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote: > > > I propose that we meet to discuss what approach we want to use to move TDE > > > forward. We then start a new thread with a proposal on the approach > > > and finalize it via community consensus. I will invite Bruce, Stephen and > > > Masahiko to this meeting. If anybody else would like to participate in this > > > discussion and subsequently in the effort to get TDE in PG1x, please let me > > > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer > > > from this meeting) will post the proposal for how we move this patch forward in > > > another thread. Hopefully, we can get consensus on that and subsequently > > > restart the execution of delivering this feature. > > > > We got complaints that decisions were not publicly discussed, or were > > too long, so I am not sure this helps. > > If the notes are published afterwords as an explanation of why certain > choices were made, I suspect it'd be reasonably well received. The > concern about back-room discussions is more that decisions are made > without explanation as to why, provided we avoid that, I believe they > can be helpful. Well, I thought that was what the wiki was, but I guess not. I did remove some of the decision logic recently since we had made a final decision. However, most of the questions were not covered on the wiki, since, as I said, everyone comes with a different need for details. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Mon, Feb 1, 2021 at 06:31:32PM -0500, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > The purpose of cluster file encryption is to prevent users with read > > access to the directories used to store database files and write-ahead > > log files from being able to access the data stored in those files. > > For example, when using cluster file encryption, users who have read > > access to the cluster directories for backup purposes will not be able > > to decrypt the data stored in these files. It also protects against > > decrypted data access after media theft. > > That's one valid use-case and it particularly makes sense to consider, > now that we support group read-access to the data cluster. The last Do enough people use group read-access to be useful? > line seems a bit unclear- I would update it to say: > Cluster file encryption also provides data-at-rest security, protecting > users from data loss should the physical media on which the cluster is > stored be stolen, improperly deprovisioned (not wiped or destroyed), or > otherwise ends up in the hands of an attacker. I have split the section into three paragraphs, trimmed down some of the suggested text, and added it. Full version below. > > File system write access can allow for unauthorized file system data > > decryption if the writes can be used to weaken the system's security > > and this weakened system is later supplied with externally-stored keys. > > This isn't very clear as to exactly what the concern is or how an > attacker would be able to thwart the system if they had write access to > it. An attacker with write access could possibly attempt to replace the > existing keys, but with the key wrapping that we're using, that should > result in just a decryption failure (unless, of course, the attacker has > the actual KEK that was used, but that's not terribly interesting to > worry about since then they could just go access the files directly). Uh, well, they could modify postgresql.conf to change the script to save the secret returned by the script before returning it to the PG server. We could require postgresql.conf to be somewhere secure, but then how do we know that is secure? I just don't see a clean solution here, but the idea that you write and then wait for the key to show up seems like a very valid way of attack, and it took me a while to be able to articulate it. > Until and unless we solve the issue around storing the GCM tags for each > page, we will have the risk that an attacker could modify a page in a > manner that we wouldn't detect. This is the biggest concern that I have > currently with the existing TDE patch sets. Well, GCM certainly can detect page modification, but it can't detect removing pages from the end of the table, or, since the nonce is LSN/pageno, you could copy a page from another table that has the same offset into another table, particularly with partitioning where the tables have the same columns. We might be able to protect against the later with some kind of table-id in the nonce, but I don't see how table truncation can be detected without adding a whole lot of overhead and complexity. And if we can't protect against those two, why bother with detecting single-page modifications? We have to do a full job for it to be useful. > There's two options that I see around how to address that issue- either > we arrange to create space in the page for the tag, such as by making > the 'special' space on a page a bit bigger and making sure that > everything understands that, or we'll need to add another fork in which > we store the tags (and possibly other TDE/encryption related > information). If we go with a fork then it should be possible to do WAL > streaming from an unencrypted cluster to an encrypted one, which would > be pretty neat, but it means another fork and another page that has to > be read/written every time we modify a page. Getting some input into > the trade-offs here would be really helpful. I don't think it's really > reasonable to go out with TDE without having figured out the integrity > side. Certainly, when I review things like NIST 800-53, it's very clear > that the requirement is for both confidentiality *and* integrity. Wow, well, if they are both required, and we can't do both, is it valuable to do just one? Yes, we can do something later, but what if we have no idea how to implement the second part? Your fork idea above might need to store some table-id used for the nonce (to prevent copying from another table) and the number of pages in the table, which fixes the integrity check issue, but adds a lot of complexity and perhaps overhead. > > This also does not protect from users who have read access to system > > memory. This also does not detect or protect against users with write > > access from removing or modifying database files. > > The last seems a bit obvious, but the first sentence quoted above is > important to make clear. I might even say: > > All of the pages in memory and all of the keys which are used for the > encryption and decryption are stored in the clear in memory and > therefore an attacker who is able to read the memory allocated by > PostgreSQL would be able to decrypt the enitre cluster. Same as above, full version below. > > Given what I said above, is the value of this feature for compliance, or > > for actual additional security? If it just compliance, are we willing > > to add all of this code just for that, even if it has limited security > > value? We should answer this question now, and if we don't want it, > > let's document that so users know and can consider alternatives. > > The feature is for both compliance and additional security. While there > are other ways to achieve data-at-rest encryption, they are not always > available, for a variety of reasons. True. > > FYI, I don't think we can detect or protect against writers modifying > > the data files --- even if we could do it on a block level, they could > > remove trailing pages (might cause index lookup failures) or copy > > pages from other tables at the same offset. Therefore, I think we can > > only offer viewing security, not modification detection/prevention. > > Protecting against file modification isn't about finding some way to > make it so that an attacker isn't able to modify the files, it's about > detecting the case where an unauthorized modification has happened. > Clearly if an attacker has gained write access to the system then we > can't protect against the attacker using the access they've gained, but > we can in most cases detect it and that's what we should be doing. It > would be really unfortunate to end up with a solution here that only > provides confidentiality and doesn't address integrity at all, and I > don't really think it's *that* hard to do both. That said, if we must > work at this in pieces and we can get agreement to handle > confidentiality initially and then add integrity later, that might be > reasonable. See above. > > > I'm happy to go look over the KMS patches again if that'd be helpful and > > > to comment on the TDE PoC. I can also spend some time trying to improve > > > > I think we eventually need a full review of the TDE PoC, combined with > > the Cybertec patch, and the wiki, to get them all aligned. However, as > > I said already, let's get the KMS patch approved, even if we don't apply > > it now, so we know we are on an approved foundation. > > While the Cybertec patch is interesting, I'd really like to see > something that's a bit less invasive when it comes to how temporary > files are handled. In particular, I think it'd be possible to have an > API that's very similar to the existing one for serial reading and > writing of files which wouldn't require nearly as many changes to things > like reorderbuffer.c. I also believe there's some things we could do to > avoid having to modify quite as many places when it comes to LSN > assignment, so the base patch isn't as big. Yes, I think we would get the best ideas from all patches. > > > on each, as I've already done. A few of the larger concerns that I have > > > revolve around how to store integrity information (I've tried to find a > > > way to make room for such information in our existing page layout and, > > > perhaps unsuprisingly, it's far from trivial to do so in a way that will > > > avoid breaking the existing page layout, or where the same set of > > > binaries could work on both unencrypted pages and encrypted pages with > > > integrity validation information, and that's a problem that we really > > > > As stated above, I think we only need a byte or two for the hint bit > > counter (used in the IV), as I don't think the GCM verification bytes > > will add any additional security, and I bet we can find a byte or two. > > We do need a separate discussion on this, either here or privately. > > I have to disagree here- the GCM tag adds integrity which is really > quite important. Happy to chat about it independently, of course. Yeah, see above. > > > should consider trying to solve...), and how to automate key rotation > > > (one of the nice things about Bruce's approach to storing the keys is > > > that we're leveraging the filesystem as an index- it's easy to see how > > > we might extend the key-per-file approach to allow us to, say, have a > > > different key for every 32GB of LSN, but if we tried to put all of the > > > keys into a single file then we'd have to figure out an indexing > > > solution for it which would allow us to find the key we need to decrypt > > > a given page...). I tend to agree with Bruce that we need to take > > > > Yeah, yuck on that plan. I was very happy how the per-version directory > > worked with scripts that needed to store matching state. > > I don't know that it's going to ultimately be the best answer, as we're > essentially using the filesystem as an index, as I mentioned above, but, > yeah, trying to do all of that ourselves during WAL replay doesn't seem > like it would be fun to try and figure out. This is an area that I > would think we'd be able to improve on in the future too- if someone > wants to spend the time coming up with a single-file format that is > indexed in some manner and still provides the guarantees that we need, > we could very likely teach pg_upgrade how to handle that and the data > set we're talking about here is quite small, even if we've got a bunch > of key rotation that's happened. I thought we were going to use failover to a standby as our data key rotation method. Here is the full doc part you wanted improved: The purpose of cluster file encryption is to prevent users with read access to the directories used to store database files and write-ahead log files from being able to access the data stored in those files. For example, when using cluster file encryption, users who have read access to the cluster directories for backup purposes will not be able to decrypt the data stored in these files. It also provides data-at-rest security, protecting users from data loss should the physical storage media be stolen or improperly erased before disposal. File system write access can allow for unauthorized file system data decryption if the writes can be used to weaken the system's security and this weakened system is later supplied with externally-stored keys. This also does not always detect if users with write access remove or modify database files. This also does not protect from users who have read access to system memory — all in-memory data pages and data encryption keys are stored unencrypted in memory, so an attacker who is able to read the PostgreSQL process's memory can decrypt the entire cluster. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Mon, Feb 1, 2021 at 07:47:57PM -0500, Bruce Momjian wrote: > On Mon, Feb 1, 2021 at 06:31:32PM -0500, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > The purpose of cluster file encryption is to prevent users with read > > > access to the directories used to store database files and write-ahead > > > log files from being able to access the data stored in those files. > > > For example, when using cluster file encryption, users who have read > > > access to the cluster directories for backup purposes will not be able > > > to decrypt the data stored in these files. It also protects against > > > decrypted data access after media theft. > > > > That's one valid use-case and it particularly makes sense to consider, > > now that we support group read-access to the data cluster. The last > > Do enough people use group read-access to be useful? I am thinking group read-access might be a requirement for cluster file encryption to be effective. > > line seems a bit unclear- I would update it to say: > > Cluster file encryption also provides data-at-rest security, protecting > > users from data loss should the physical media on which the cluster is > > stored be stolen, improperly deprovisioned (not wiped or destroyed), or > > otherwise ends up in the hands of an attacker. > > I have split the section into three paragraphs, trimmed down some of the > suggested text, and added it. Full version below. Here is an updated doc description of memory reading: This also does not protect against users who have read access to database process memory — all in-memory data pages and data encryption keys are stored unencrypted in memory, so an attacker who --> is able to read memory can decrypt the entire cluster. The Postgres --> operating system user and the operating system administrator, e.g., --> the <literal>root</literal> user, have such access. > > > File system write access can allow for unauthorized file system data > > > decryption if the writes can be used to weaken the system's security > > > and this weakened system is later supplied with externally-stored keys. > > > > This isn't very clear as to exactly what the concern is or how an > > attacker would be able to thwart the system if they had write access to > > it. An attacker with write access could possibly attempt to replace the > > existing keys, but with the key wrapping that we're using, that should > > result in just a decryption failure (unless, of course, the attacker has > > the actual KEK that was used, but that's not terribly interesting to > > worry about since then they could just go access the files directly). > > Uh, well, they could modify postgresql.conf to change the script to save > the secret returned by the script before returning it to the PG server. > We could require postgresql.conf to be somewhere secure, but then how do > we know that is secure? I just don't see a clean solution here, but the > idea that you write and then wait for the key to show up seems like a > very valid way of attack, and it took me a while to be able to > articulate it. Let's suppose you lock down your cluster --- the non-PGDATA files are owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA and are not writable by the database OS user, or we have the PGDATA directory on another server, so the adversary can only write to the remote PGDATA directory. What can they do? Well, they can't modify pg_proc to add a shared library since pg_proc is encrypted, so we have to focus on files needed before encryption starts or files that can't be easily encrypted. They could create postgresql.conf.auto in PGDATA, and modify cluster_key_command to capture the key, or they could modify preload libraries or archive command to call a command to read memory as the PG OS user and write the key out somewhere, or use the key to rewrite the database files --- those wouldn't even need a database restart, just a reload. They could also modify pg_xact files so that, even though the heap/index files are encrypted, how the contents of those files are interpreted would change. In summary, to detect malicious user writes, you would need to protect the files used before encryption starts (root owned or owned by another user?), and encrypt all files after encryption starts --- any other approach would probably leave open attack vectors, and I don't think there is sufficient community desire to add such boundaries. How do other database systems guarantee to detect malicious writes? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Feb 1, 2021 at 07:47:57PM -0500, Bruce Momjian wrote: > > On Mon, Feb 1, 2021 at 06:31:32PM -0500, Stephen Frost wrote: > > > * Bruce Momjian (bruce@momjian.us) wrote: > > > > The purpose of cluster file encryption is to prevent users with read > > > > access to the directories used to store database files and write-ahead > > > > log files from being able to access the data stored in those files. > > > > For example, when using cluster file encryption, users who have read > > > > access to the cluster directories for backup purposes will not be able > > > > to decrypt the data stored in these files. It also protects against > > > > decrypted data access after media theft. > > > > > > That's one valid use-case and it particularly makes sense to consider, > > > now that we support group read-access to the data cluster. The last > > > > Do enough people use group read-access to be useful? > > I am thinking group read-access might be a requirement for cluster file > encryption to be effective. People certainly do use group read-access, but I don't see that as being a requirement for cluster file encryption to be effective, it's just one thing TDE can address, among others, as discussed. > > > line seems a bit unclear- I would update it to say: > > > Cluster file encryption also provides data-at-rest security, protecting > > > users from data loss should the physical media on which the cluster is > > > stored be stolen, improperly deprovisioned (not wiped or destroyed), or > > > otherwise ends up in the hands of an attacker. > > > > I have split the section into three paragraphs, trimmed down some of the > > suggested text, and added it. Full version below. > > Here is an updated doc description of memory reading: > > This also does not protect against users who have read access to > database process memory — all in-memory data pages and data > encryption keys are stored unencrypted in memory, so an attacker who > --> is able to read memory can decrypt the entire cluster. The Postgres > --> operating system user and the operating system administrator, e.g., > --> the <literal>root</literal> user, have such access. That's helpful, +1. > > > > File system write access can allow for unauthorized file system data > > > > decryption if the writes can be used to weaken the system's security > > > > and this weakened system is later supplied with externally-stored keys. > > > > > > This isn't very clear as to exactly what the concern is or how an > > > attacker would be able to thwart the system if they had write access to > > > it. An attacker with write access could possibly attempt to replace the > > > existing keys, but with the key wrapping that we're using, that should > > > result in just a decryption failure (unless, of course, the attacker has > > > the actual KEK that was used, but that's not terribly interesting to > > > worry about since then they could just go access the files directly). > > > > Uh, well, they could modify postgresql.conf to change the script to save > > the secret returned by the script before returning it to the PG server. > > We could require postgresql.conf to be somewhere secure, but then how do > > we know that is secure? I just don't see a clean solution here, but the > > idea that you write and then wait for the key to show up seems like a > > very valid way of attack, and it took me a while to be able to > > articulate it. postgresql.conf isn't always writable by the postgres user, though postgresql.auto.conf is likely to always be. I'm not sure how much of a concern that is, but it we wanted to take steps to explicitly address this issue, we could have some kind of 'secure' postgresql.conf file which we would encourage users to make owned by root and whose values wouldn't be allowed to be overridden once set. > Let's suppose you lock down your cluster --- the non-PGDATA files are > owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA > and are not writable by the database OS user, or we have the PGDATA > directory on another server, so the adversary can only write to the > remote PGDATA directory. > > What can they do? Well, they can't modify pg_proc to add a shared > library since pg_proc is encrypted, so we have to focus on files needed > before encryption starts or files that can't be easily encrypted. This isn't accurate- just because it's encrypted doesn't mean they can't modify it. That's exactly why integrity is important, because an attacker absolutely could modify the files directly and potentially exploit the system through those modifications. > They could create postgresql.conf.auto in PGDATA, and modify > cluster_key_command to capture the key, or they could modify preload > libraries or archive command to call a command to read memory as the PG > OS user and write the key out somewhere, or use the key to rewrite the > database files --- those wouldn't even need a database restart, just a > reload. They would need to actually be able to effect that reload though. This is where the question comes up as to just what attack vector we're trying to address. It's certainly possible that an attacker has only access to the stored data in an off-line fashion (eg: a hard drive that was mistakenly thrown away without being properly wiped) and that's one of the cases which is addressed by cluster encryption. An attacker might have access to the LUN that PG is running on but not to the running server itself, which it seems like is what you're contemplating here. That's a much harder attack vector to fully protect against and we might need to do more than we're currently contemplating to address it- but I don't think we necessarily must solve for all cases in the first pass at this. > They could also modify pg_xact files so that, even though the heap/index > files are encrypted, how the contents of those files are interpreted > would change. Yes, ideally, we'd encrypt/integrity check just about every part of the running system and that's one area the patch doesn't address- things like temporary files and other parts. > In summary, to detect malicious user writes, you would need to protect > the files used before encryption starts (root owned or owned by another > user?), and encrypt all files after encryption starts --- any other > approach would probably leave open attack vectors, and I don't think > there is sufficient community desire to add such boundaries. There's going to be some attack vectors that TDE doesn't address. We should identify and document those where we're able to. We could offer up some mitigations (eg: strongly suggest monitoring of key utilization such that if the KEK is used without a reboot of the system or similar happening that it is reported and someone goes to look into it). While such mitigations aren't perfect, they can be enough to allow approval of a system to go operational (ultimately it comes down to what the relevant security officer is willing to accept). > How do other database systems guarantee to detect malicious writes? I doubt anyone would actually stipulate that they *guarantee* detection of malicious writes, and I don't think we should either, but certainly the other systems which provide TDE do so in a manner that provides both confidentiality and integrity. The big O, at least, documents that they use SHA-1 for their integrity checking, though they also provide an option which disables it. If we used an additional fork to provide the integrity then we could also give users the option of either having integrity included or not. Thanks, Stephen
Attachment
On Wed, Feb 3, 2021 at 10:33:57AM -0500, Stephen Frost wrote: > > I am thinking group read-access might be a requirement for cluster file > > encryption to be effective. > > People certainly do use group read-access, but I don't see that as being > a requirement for cluster file encryption to be effective, it's just one > thing TDE can address, among others, as discussed. Agreed. > > This also does not protect against users who have read access to > > database process memory — all in-memory data pages and data > > encryption keys are stored unencrypted in memory, so an attacker who > > --> is able to read memory can decrypt the entire cluster. The Postgres > > --> operating system user and the operating system administrator, e.g., > > --> the <literal>root</literal> user, have such access. > > That's helpful, +1. Good. > > > Uh, well, they could modify postgresql.conf to change the script to save > > > the secret returned by the script before returning it to the PG server. > > > We could require postgresql.conf to be somewhere secure, but then how do > > > we know that is secure? I just don't see a clean solution here, but the > > > idea that you write and then wait for the key to show up seems like a > > > very valid way of attack, and it took me a while to be able to > > > articulate it. > > postgresql.conf isn't always writable by the postgres user, though > postgresql.auto.conf is likely to always be. I'm not sure how much of a > concern that is, but it we wanted to take steps to explicitly address > this issue, we could have some kind of 'secure' postgresql.conf file > which we would encourage users to make owned by root and whose values > wouldn't be allowed to be overridden once set. Well, I think there is a lot more than postgresql.conf to worry about --- see below. > > Let's suppose you lock down your cluster --- the non-PGDATA files are > > owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA > > and are not writable by the database OS user, or we have the PGDATA > > directory on another server, so the adversary can only write to the > > remote PGDATA directory. > > > > What can they do? Well, they can't modify pg_proc to add a shared > > library since pg_proc is encrypted, so we have to focus on files needed > > before encryption starts or files that can't be easily encrypted. > > This isn't accurate- just because it's encrypted doesn't mean they can't > modify it. That's exactly why integrity is important, because an > attacker absolutely could modify the files directly and potentially > exploit the system through those modifications. They can't easily modify it to inject a shared object referenced into a system column, was my point --- also see below. > > They could create postgresql.conf.auto in PGDATA, and modify > > cluster_key_command to capture the key, or they could modify preload > > libraries or archive command to call a command to read memory as the PG > > OS user and write the key out somewhere, or use the key to rewrite the > > database files --- those wouldn't even need a database restart, just a > > reload. > > They would need to actually be able to effect that reload though. This > is where the question comes up as to just what attack vector we're > trying to address. It's certainly possible that an attacker has only > access to the stored data in an off-line fashion (eg: a hard drive that > was mistakenly thrown away without being properly wiped) and that's one > of the cases which is addressed by cluster encryption. An attacker > might have access to the LUN that PG is running on but not to the > running server itself, which it seems like is what you're contemplating > here. That's a much harder attack vector to fully protect against and > we might need to do more than we're currently contemplating to address > it- but I don't think we necessarily must solve for all cases in the > first pass at this. See below. > > They could also modify pg_xact files so that, even though the heap/index > > files are encrypted, how the contents of those files are interpreted > > would change. > > Yes, ideally, we'd encrypt/integrity check just about every part of the > running system and that's one area the patch doesn't address- things > like temporary files and other parts. It is worse than that --- see below. > > In summary, to detect malicious user writes, you would need to protect > > the files used before encryption starts (root owned or owned by another > > user?), and encrypt all files after encryption starts --- any other > > approach would probably leave open attack vectors, and I don't think > > there is sufficient community desire to add such boundaries. > > There's going to be some attack vectors that TDE doesn't address. We > should identify and document those where we're able to. We could offer > up some mitigations (eg: strongly suggest monitoring of key utilization > such that if the KEK is used without a reboot of the system or similar > happening that it is reported and someone goes to look into it). While > such mitigations aren't perfect, they can be enough to allow approval of > a system to go operational (ultimately it comes down to what the > relevant security officer is willing to accept). I ended up adding to the feature description in the docs to clearly outline what this feature provides, and what it does not: The purpose of cluster file encryption is to prevent users with read access on the directories used to store database files and write-ahead log files from being able to access the data stored in those files. For example, when using cluster file encryption, users who have read access to the cluster directories for backup purposes will not be able to decrypt the data stored in these files. Read-only access for a group of users can be enabled using the <application>initdb</application> <option>--allow-group-access</option> option. Cluster file encryption also provides data-at-rest security, protecting users from data loss should the physical storage media be stolen or improperly erased before disposal. Cluster file encryption does not protect against unauthorized file system writes. Such writes can allow data decryption if used to weaken the system's security and the weakened system is later supplied with the externally-stored cluster encryption key. This also does not always detect if users with write access remove or modify database files. This also does not protect against users who have read access to database process memory because all in-memory data pages and data encryption keys are stored unencrypted in memory. Therefore, an attacker who is able to read memory can read the data encryption keys and decrypt the entire cluster. The Postgres operating system user and the operating system administrator, e.g., the <literal>root</literal> user, have such access. > > How do other database systems guarantee to detect malicious writes? > > I doubt anyone would actually stipulate that they *guarantee* detection > of malicious writes, and I don't think we should either, but certainly > the other systems which provide TDE do so in a manner that provides both > confidentiality and integrity. The big O, at least, documents that they > use SHA-1 for their integrity checking, though they also provide an > option which disables it. If we used an additional fork to provide the > integrity then we could also give users the option of either having > integrity included or not. I thought more about this at an abstract level. If you are worried about malicious users _reading_ data, you can encrypt the sensitive parts, e.g., heap/index/WAL/temp, and leave some unencrypted, like pg_xact. Reading pg_xact is pretty useless if you can't read the heap pages. Reading postgresql.conf.auto, the external key retrieval scripts, etc. are useless too. However, when you are trying to protect against write access, you have to really encrypt _everything_, because the system is very interdependent, and changing one part where _reading_ is safe can affect other parts that must remain secure. You can modify postgresql.conf.auto to capture the cluster key, or maybe even change something to dump out the data keys from memory. You can modify pg_xact to affect how heap pages are interpreted. My point is that being able to detect malicious heap/index writes really doesn't gain us any security since there are much more serious writes that can be made, and protecting against those more serious writes would cause unacceptable Postgres source code changes which will probably never be implemented. My summary point is that we should clearly spell out exactly what protections we are offering, and an estimate of the code impact, before moving forward so the community can agree it is worthwhile to add this. Also, looking at the PCI DSS 3.2.1 spec from May 2018 (click-through required): https://www.pcisecuritystandards.org/document_library?category=pcidss&document=pci_dss#agreement or open PDF link here: https://commerce.uwo.ca/pdf/PCI_DSS_v3-2-1.pdf Page 41 covers what they expect from an encrypted file system, and from key encryption key and data encryption keys. There is a v4.0 spec in draft but I can't find a PDF available online. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Wed, Feb 3, 2021 at 01:16:32PM -0500, Bruce Momjian wrote: > On Wed, Feb 3, 2021 at 10:33:57AM -0500, Stephen Frost wrote: > > I doubt anyone would actually stipulate that they *guarantee* detection > > of malicious writes, and I don't think we should either, but certainly > > the other systems which provide TDE do so in a manner that provides both > > confidentiality and integrity. The big O, at least, documents that they > > use SHA-1 for their integrity checking, though they also provide an > > option which disables it. If we used an additional fork to provide the > > integrity then we could also give users the option of either having > > integrity included or not. > > I thought more about this at an abstract level. If you are worried > about malicious users _reading_ data, you can encrypt the sensitive > parts, e.g., heap/index/WAL/temp, and leave some unencrypted, like > pg_xact. Reading pg_xact is pretty useless if you can't read the heap > pages. Reading postgresql.conf.auto, the external key retrieval > scripts, etc. are useless too. > > However, when you are trying to protect against write access, you have > to really encrypt _everything_, because the system is very > interdependent, and changing one part where _reading_ is safe can affect > other parts that must remain secure. You can modify > postgresql.conf.auto to capture the cluster key, or maybe even change > something to dump out the data keys from memory. You can modify pg_xact > to affect how heap pages are interpreted. > > My point is that being able to detect malicious heap/index writes really > doesn't gain us any security since there are much more serious writes > that can be made, and protecting against those more serious writes would > cause unacceptable Postgres source code changes which will probably > never be implemented. I looked further. First, I don't think we are going to be able to protect at all against users who have _write_ access on the OS running Postgres. It would be too easy to just read process memory, or modify ~/.profile. I think the only possible option would be to try to give some protection against users with write access to PGDATA, where PGDATA is on another server, e.g., via NFS. We can't protect against all db modifications, for reasons outlined above, but we might be able to protect against write users being able to _read_ the keys and therefore decrypt data. Looking at PGDATA, we have, at least: postgresql.conf pg_hba.conf postmaster.opts postgresql.conf.auto which could be exploited to cause reading of the cluster key or process memory. The first two can be located outside of PGDATA but the last two currently cannot. The problem is that this is a limited use-case, and there are probably other problems I am not considering. It seems too error-prone to even try protect against this, but it does limit the value of this feature. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Wed, Feb 3, 2021 at 01:16:32PM -0500, Bruce Momjian wrote: > > On Wed, Feb 3, 2021 at 10:33:57AM -0500, Stephen Frost wrote: > > > I doubt anyone would actually stipulate that they *guarantee* detection > > > of malicious writes, and I don't think we should either, but certainly > > > the other systems which provide TDE do so in a manner that provides both > > > confidentiality and integrity. The big O, at least, documents that they > > > use SHA-1 for their integrity checking, though they also provide an > > > option which disables it. If we used an additional fork to provide the > > > integrity then we could also give users the option of either having > > > integrity included or not. > > > > I thought more about this at an abstract level. If you are worried > > about malicious users _reading_ data, you can encrypt the sensitive > > parts, e.g., heap/index/WAL/temp, and leave some unencrypted, like > > pg_xact. Reading pg_xact is pretty useless if you can't read the heap > > pages. Reading postgresql.conf.auto, the external key retrieval > > scripts, etc. are useless too. > > > > However, when you are trying to protect against write access, you have > > to really encrypt _everything_, because the system is very > > interdependent, and changing one part where _reading_ is safe can affect > > other parts that must remain secure. You can modify > > postgresql.conf.auto to capture the cluster key, or maybe even change > > something to dump out the data keys from memory. You can modify pg_xact > > to affect how heap pages are interpreted. > > > > My point is that being able to detect malicious heap/index writes really > > doesn't gain us any security since there are much more serious writes > > that can be made, and protecting against those more serious writes would > > cause unacceptable Postgres source code changes which will probably > > never be implemented. > > I looked further. First, I don't think we are going to be able to > protect at all against users who have _write_ access on the OS running > Postgres. It would be too easy to just read process memory, or modify > ~/.profile. I don't think anyone is really expecting that we'll be able to come up with a way to protect against attackers who have fully compromised the OS to the point where they can read/write OS memory, or even the PG unix account. I'm certainly not suggesting that there is a way to do that or that it's an attack vector we are trying to address here. > I think the only possible option would be to try to give some protection > against users with write access to PGDATA, where PGDATA is on another > server, e.g., via NFS. We can't protect against all db modifications, > for reasons outlined above, but we might be able to protect against > write users being able to _read_ the keys and therefore decrypt data. That certainly seems like a worthy goal. I also really want to stress that I don't think anyone is expecting us to be able to "protect" against users who have write access to the system- write access to files is really an OS level issue and there's not much we can do once someone has found a way to circumvent that (we can try to help the OS by doing things like using SELinux, of course, but that's a different discussion). At the point that an attacker has gotten write access, the best we can do is complain loudly if we detect unexpected modifications. Ideally, we would be able to do that for everything, but certainly doing it for the principal data would go a long way and is far better than nothing. Now, that said, I don't know that we absolutely must have that in the first release of TDE support for PG. In thinking about this, I would say we have two basic options: - Keep the same page layout, requiring that integrity data must be stored elsewhere, eg: another fork - Use a different page layout when TDE is enabled, making room for integrity information to be included on each page There's a set of pros and cons for these: Same page layout pros: - Simpler and less impactful on the overall system - With integrity data stored elsewhere, could possibly be something that's optional to enable/disable on a per-table basis - Potential to do things like have an unencrypted primary and an encrypted replica, providing an easier migration path Same page layout cons: - Integrity information must be stored elsewhere - Increases the reads/memory that is needed, since we have to look up the integrity information on every read. - Increases the writes that have to be done since we'd be dirtying multiple pages instead of just the main fork (though this isn't exactly unusual- there's the vis map, and indexes, etc, but it'd be yet another thing we're updating) Different page layout pros: - Avoids extra reads/writes for the integrity information - Once done, this might provide us with a way to add other page level information in the future while still being able to work with older page formats Different page layout cons: - Wouldn't be able to have an encrypted replica follow an unencrypted primary, migration would require logical replication or similar - More core code changes, and extensions, to handle a different page layout when cluster is initialized with TDE+integrity While I've been thinking about this, I have to admit that either approach could be done later and it's probably best to accept that and push it off until we have the initial TDE work done. I had been thinking that changing the page layout would be better to do in the same release as TDE, but having been playing around with that approach for a while it just seems like it's too much to try and include at the same time. We should be sure to be clear and document that though. > Looking at PGDATA, we have, at least: > > postgresql.conf > pg_hba.conf > postmaster.opts > postgresql.conf.auto > > which could be exploited to cause reading of the cluster key or process > memory. The first two can be located outside of PGDATA but the last two > currently cannot. There are certainly already users out there who intentionally make postgresql.auto.conf owned by root/root, zero-sized, and monitor it to make sure that it isn't updated. postgresql.conf actually is also often monitored for changes by a change management system of some kind and may also be owned by root/root already. I suspect that postmaster.opts is not monitored as closely, but that's probably due more to the fact that we don't really document it as a configuration system file and it can't be put outside of PGDATA. Having a way to move it outside of PGDATA or just not have it be used at all (do we really need it..?) would be another way to address that risk though. > The problem is that this is a limited use-case, and there are probably > other problems I am not considering. It seems too error-prone to even > try protect against this, but it does limit the value of this feature. I don't think we need to consider it a failing of the capability every time we think of something else that really should be addressed when considering this attack vector. We aren't going to be releasing this and saying "we guarantee that this protects against an attacker who has write access to PGDATA". Instead, we would be documenting "XYZ, when enabled, is used to validate the integrity of ABC data. Individuals concerned with unexpected modifications to their system should consider independently monitoring files D, E, F. Note that there is currently no explicit protection against or detection of unexpected or malicious modification of other parts of the system such as the transaction record.", or something along those lines. Hardening guidelines would also recommend things like having postgresql.conf moved out of PGDATA and owned by root/root, etc. Users would then have the ability to evaluate if what we're providing is sufficient for their requirements or not, and to then provide us with feedback about what they feel is still missing before they would be able to use PG for their use-case. To that end, I would hope that we'd eventually develop a way to detect unexpected modifications in other parts of the system, both as a way to discover filesystem corruption earlier but also in the case of a malicious attacker. The latter would involve more work, of course, but it doesn't seem insurmountable. I don't think it's necessary to get into that today though. I am concerned when statements are made that we are just never going to do something-or-other because we think it'd be a lot of source code changes or won't be completely perfect against every attack we can think of. There was a good bit of that with RLS which also made it a particularly difficult feature to push forward, but, thanks to clearly documenting what was and wasn't addressed, clearly admitting that there are covert channel attacks that might be possible due to how it works, it's been pretty well accepted and there hasn't been some huge number of issues or CVEs that have been associated with it or mismatched expectations that users of it have had regarding what it does and doesn't protect against. Thanks, Stephen
Attachment
On Fri, Feb 5, 2021 at 01:14:35PM -0500, Stephen Frost wrote: > > I looked further. First, I don't think we are going to be able to > > protect at all against users who have _write_ access on the OS running > > Postgres. It would be too easy to just read process memory, or modify > > ~/.profile. > > I don't think anyone is really expecting that we'll be able to come up > with a way to protect against attackers who have fully compromised the > OS to the point where they can read/write OS memory, or even the PG unix > account. I'm certainly not suggesting that there is a way to do that or > that it's an attack vector we are trying to address here. OK, that's good. > > I think the only possible option would be to try to give some protection > > against users with write access to PGDATA, where PGDATA is on another > > server, e.g., via NFS. We can't protect against all db modifications, > > for reasons outlined above, but we might be able to protect against > > write users being able to _read_ the keys and therefore decrypt data. > > That certainly seems like a worthy goal. I also really want to stress > that I don't think anyone is expecting us to be able to "protect" > against users who have write access to the system- write access to files > is really an OS level issue and there's not much we can do once someone > has found a way to circumvent that (we can try to help the OS by doing > things like using SELinux, of course, but that's a different > discussion). At the point that an attacker has gotten write access, the Agreed. > best we can do is complain loudly if we detect unexpected modifications. > Ideally, we would be able to do that for everything, but certainly doing > it for the principal data would go a long way and is far better than > nothing. I disagree. If we only warn about some parts, attackers will just attack other parts. It will also give users a false sense of security. If you can get the keys, it doesn't matter if there is one or ten ways of getting them, if they are all of equal difficulty. Same with modifying the system files. > Now, that said, I don't know that we absolutely must have that in the > first release of TDE support for PG. In thinking about this, I would > say we have two basic options: I skipped this part since I think we need a fully secure plan before considering page format changes. We don't need it for our currently outlined feature-set. > > Looking at PGDATA, we have, at least: > > > > postgresql.conf > > pg_hba.conf > > postmaster.opts > > postgresql.conf.auto > > > > which could be exploited to cause reading of the cluster key or process > > memory. The first two can be located outside of PGDATA but the last two > > currently cannot. > > There are certainly already users out there who intentionally make > postgresql.auto.conf owned by root/root, zero-sized, and monitor it to > make sure that it isn't updated. postgresql.conf actually is also often > monitored for changes by a change management system of some kind and may > also be owned by root/root already. I suspect that postmaster.opts is > not monitored as closely, but that's probably due more to the fact that > we don't really document it as a configuration system file and it can't > be put outside of PGDATA. Having a way to move it outside of PGDATA or > just not have it be used at all (do we really need it..?) would be > another way to address that risk though. I think postmaster.opts is used for pg_ctl reload. I think the question is whether the value of maliciously writable PGDATA being able to read the keys, while not protecting or detecting all malicious writes/db-modifications, is worth it. And, while I listed the files above, there are probably many more ways to break the system. > > The problem is that this is a limited use-case, and there are probably > > other problems I am not considering. It seems too error-prone to even > > try protect against this, but it does limit the value of this feature. > > I don't think we need to consider it a failing of the capability every > time we think of something else that really should be addressed when > considering this attack vector. We aren't going to be releasing this > and saying "we guarantee that this protects against an attacker who has > write access to PGDATA". Instead, we would be documenting "XYZ, when > enabled, is used to validate the integrity of ABC data. Individuals > concerned with unexpected modifications to their system should consider > independently monitoring files D, E, F. Note that there is currently no > explicit protection against or detection of unexpected or malicious > modification of other parts of the system such as the transaction > record.", or something along those lines. Hardening guidelines would > also recommend things like having postgresql.conf moved out of PGDATA > and owned by root/root, etc. Users would then have the ability to > evaluate if what we're providing is sufficient for their requirements > or not, and to then provide us with feedback about what they feel is > still missing before they would be able to use PG for their use-case. See above --- I think we can't just say we close _most_ of the doors here, and I am afraid there will be more and more cases we miss. It feels too open-ended. For example, imagine modifying a PGDATA file so it is a symbolic link to another file that is not in PGDATA? Seems that would break all sorts of security restrictions, and that's just a new idea I came up with today. What I don't want to do is to add a lot of complexity to the system, and not really gain any meaningful security. > To that end, I would hope that we'd eventually develop a way to detect > unexpected modifications in other parts of the system, both as a way to > discover filesystem corruption earlier but also in the case of a > malicious attacker. The latter would involve more work, of course, but > it doesn't seem insurmountable. I don't think it's necessary to get > into that today though. > > I am concerned when statements are made that we are just never going to > do something-or-other because we think it'd be a lot of source code > changes or won't be completely perfect against every attack we can think > of. There was a good bit of that with RLS which also made it a > particularly difficult feature to push forward, but, thanks to clearly > documenting what was and wasn't addressed, clearly admitting that there > are covert channel attacks that might be possible due to how it works, > it's been pretty well accepted and there hasn't been some huge number of > issues or CVEs that have been associated with it or mismatched > expectations that users of it have had regarding what it does and > doesn't protect against. Oh, that is a very meaningful lesson. I do think that for cluster file encryption, if we have a vulnerability, someone will write a script for it, and it could be widely exploited. I think RLS gets a little more flexibility since someone is already in the database when using it. I am not against adding more security features, but I need agreement that the existing features/protections, with the planned source code impact, is acceptable. I don't want to go down the road of getting the feature with the _hope_ that later changes will make the feature acceptable --- for me, either what we are planning now is acceptable given its code impact, or it is not. If the feature is not sufficient, then I would not move forward until we had a reasonable plan of when the feature would have acceptable usefulness, and acceptable source code impact. The big problem, as you outlined above, is that adding to the protections, like malicious write detection for a remote PGDATA, greatly increases the code impact, and ultimately, might be unsolvable. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Feb 5, 2021 at 01:14:35PM -0500, Stephen Frost wrote: > > > I looked further. First, I don't think we are going to be able to > > > protect at all against users who have _write_ access on the OS running > > > Postgres. It would be too easy to just read process memory, or modify > > > ~/.profile. > > > > I don't think anyone is really expecting that we'll be able to come up > > with a way to protect against attackers who have fully compromised the > > OS to the point where they can read/write OS memory, or even the PG unix > > account. I'm certainly not suggesting that there is a way to do that or > > that it's an attack vector we are trying to address here. > > OK, that's good. > > > > I think the only possible option would be to try to give some protection > > > against users with write access to PGDATA, where PGDATA is on another > > > server, e.g., via NFS. We can't protect against all db modifications, > > > for reasons outlined above, but we might be able to protect against > > > write users being able to _read_ the keys and therefore decrypt data. > > > > That certainly seems like a worthy goal. I also really want to stress > > that I don't think anyone is expecting us to be able to "protect" > > against users who have write access to the system- write access to files > > is really an OS level issue and there's not much we can do once someone > > has found a way to circumvent that (we can try to help the OS by doing > > things like using SELinux, of course, but that's a different > > discussion). At the point that an attacker has gotten write access, the > > Agreed. > > > best we can do is complain loudly if we detect unexpected modifications. > > Ideally, we would be able to do that for everything, but certainly doing > > it for the principal data would go a long way and is far better than > > nothing. > > I disagree. If we only warn about some parts, attackers will just > attack other parts. It will also give users a false sense of security. > If you can get the keys, it doesn't matter if there is one or ten ways > of getting them, if they are all of equal difficulty. Same with > modifying the system files. I agree that there's an additional concern around the keys and that we would want to have a solid way to avoid having them be compromised. We might not be able to guarantee that attackers who can write to PGDATA can't gain access to the keys in the first implementation, but I don't see that as a problem- the TDE capability would still provide protection against improper disposal and some other use-cases, which is useful. I do think it'd be useful to consider how we could provide protection against an attacker who has write access from being able to acquire the keys, but that seems like a tractable problem. Following that, we could look at how to provide integrity checking for principal data, using one of the outlined approaches or maybe something else entirely. Lastly, perhaps we can find a way to provide confidentiality and integrity for other parts of the system. Each of these steps is a useful improvement in its own right and will open up more opportunities for PG to be used. It wasn't my intent to suggest otherwise, but rather to see if there was an opportunity to get a few things done at once if it wasn't too impactful. I agree now that it makes sense to focus on the first step, so we can hopefully get that accomplished. > > There are certainly already users out there who intentionally make > > postgresql.auto.conf owned by root/root, zero-sized, and monitor it to > > make sure that it isn't updated. postgresql.conf actually is also often > > monitored for changes by a change management system of some kind and may > > also be owned by root/root already. I suspect that postmaster.opts is > > not monitored as closely, but that's probably due more to the fact that > > we don't really document it as a configuration system file and it can't > > be put outside of PGDATA. Having a way to move it outside of PGDATA or > > just not have it be used at all (do we really need it..?) would be > > another way to address that risk though. > > I think postmaster.opts is used for pg_ctl reload. I think the question > is whether the value of maliciously writable PGDATA being able to read > the keys, while not protecting or detecting all malicious > writes/db-modifications, is worth it. And, while I listed the files > above, there are probably many more ways to break the system. postmaster.opts is used for pg_ctl restart, just to be clear. As I try to state above- I don't think we need to provide any specific protections against a malicious writer for plain encryption to be useful for some important use-cases. Providing protections against a malicious writer being able to access the keys is certainly important as, if they acquire the keys, they would be able to trivially both decrypt the data and modify any other data they wished to, so it seems likely that solving that would be the first step towards protecting against a malicious writer, after which it's useful to think about what else we could provide integrity checking of, and principal data strikes me as the next sensible step, followed by what's essentially metadata. > > > The problem is that this is a limited use-case, and there are probably > > > other problems I am not considering. It seems too error-prone to even > > > try protect against this, but it does limit the value of this feature. > > > > I don't think we need to consider it a failing of the capability every > > time we think of something else that really should be addressed when > > considering this attack vector. We aren't going to be releasing this > > and saying "we guarantee that this protects against an attacker who has > > write access to PGDATA". Instead, we would be documenting "XYZ, when > > enabled, is used to validate the integrity of ABC data. Individuals > > concerned with unexpected modifications to their system should consider > > independently monitoring files D, E, F. Note that there is currently no > > explicit protection against or detection of unexpected or malicious > > modification of other parts of the system such as the transaction > > record.", or something along those lines. Hardening guidelines would > > also recommend things like having postgresql.conf moved out of PGDATA > > and owned by root/root, etc. Users would then have the ability to > > evaluate if what we're providing is sufficient for their requirements > > or not, and to then provide us with feedback about what they feel is > > still missing before they would be able to use PG for their use-case. > > See above --- I think we can't just say we close _most_ of the doors > here, and I am afraid there will be more and more cases we miss. It > feels too open-ended. For example, imagine modifying a PGDATA file so > it is a symbolic link to another file that is not in PGDATA? Seems that > would break all sorts of security restrictions, and that's just a new > idea I came up with today. It's not clear how that would provide the attacker with much, if anything. > What I don't want to do is to add a lot of complexity to the system, and > not really gain any meaningful security. Integrity is very meaningful to security, but key management would certainly come first because if an attacker is able to acquire the keys then they can circumvent any integrity check being done by simply using the key. I appreciate that protecting the keys is non-trivial but it's absolutely critical as everything else falls apart if the key is compromised. I don't think we should be thinking that we're going to be done with key management or with providing ways to acquire keys even if the currently proposed patches go in- we'll undoubtably need to provide other options in the future. There's an interesting point in this regarding how the flexibility of the shell-script based approach also introduces this risk that an attacker could modify it and write the key out to somewhere that they could get at pretty easily. Having support for directly fetching the key from the Linux kernel or the various vaulting systems would avoid this risk, I would think. Maybe there's a way to get PG to dump the key out of system memory by modifying other files in PGDATA but that's surely quite a bit more difficult. Ultimately, I don't think this voids the proposed approach but I do think it means we'll want to improve on this in the future. > > To that end, I would hope that we'd eventually develop a way to detect > > unexpected modifications in other parts of the system, both as a way to > > discover filesystem corruption earlier but also in the case of a > > malicious attacker. The latter would involve more work, of course, but > > it doesn't seem insurmountable. I don't think it's necessary to get > > into that today though. > > > > I am concerned when statements are made that we are just never going to > > do something-or-other because we think it'd be a lot of source code > > changes or won't be completely perfect against every attack we can think > > of. There was a good bit of that with RLS which also made it a > > particularly difficult feature to push forward, but, thanks to clearly > > documenting what was and wasn't addressed, clearly admitting that there > > are covert channel attacks that might be possible due to how it works, > > it's been pretty well accepted and there hasn't been some huge number of > > issues or CVEs that have been associated with it or mismatched > > expectations that users of it have had regarding what it does and > > doesn't protect against. > > Oh, that is a very meaningful lesson. I do think that for cluster file > encryption, if we have a vulnerability, someone will write a script for > it, and it could be widely exploited. I think RLS gets a little more > flexibility since someone is already in the database when using it. In the current attack we're contemplating, the attacker's got write access to the filesystem and if that's happening then they've managed to get through a few layers already, I would think, so it seems unlikely that it would be widely exploited. Of course, we'd like to avoid having vulnerabilities where we can, but a particular behavior is only a vulnerabiliy if there's an expectation that we protect against that kind of attack, which is why documentation is extremely important, which is what I was trying to get at with the RLS example. > I am not against adding more security features, but I need agreement > that the existing features/protections, with the planned source code > impact, is acceptable. I don't want to go down the road of getting the > feature with the _hope_ that later changes will make the feature > acceptable --- for me, either what we are planning now is acceptable > given its code impact, or it is not. If the feature is not sufficient, > then I would not move forward until we had a reasonable plan of when the > feature would have acceptable usefulness, and acceptable source code > impact. See above. I do think that the proposed approach is a valuable capability and improvement in its own right. It seems likely that this first step, as proposed, would allow us to support use-cases such as the PCI one you mentioned previously. Taking it further and adding integrity validation would move us into even more use-cases as it would address NIST requirements which explicitly call for confidentiality and integrity. > The big problem, as you outlined above, is that adding to the > protections, like malicious write detection for a remote PGDATA, greatly > increases the code impact, and ultimately, might be unsolvable. I don't think we really know that it increases the code impact hugely or is unsolveable, but ultimately those are really debates for another day at this point. Thanks, Stephen
Attachment
On Fri, Feb 5, 2021 at 05:21:22PM -0500, Stephen Frost wrote: > > I disagree. If we only warn about some parts, attackers will just > > attack other parts. It will also give users a false sense of security. > > If you can get the keys, it doesn't matter if there is one or ten ways > > of getting them, if they are all of equal difficulty. Same with > > modifying the system files. > > I agree that there's an additional concern around the keys and that we > would want to have a solid way to avoid having them be compromised. We > might not be able to guarantee that attackers who can write to PGDATA > can't gain access to the keys in the first implementation, but I don't > see that as a problem- the TDE capability would still provide protection > against improper disposal and some other use-cases, which is useful. I Agreed. > do think it'd be useful to consider how we could provide protection > against an attacker who has write access from being able to acquire the > keys, but that seems like a tractable problem. Following that, we could > look at how to provide integrity checking for principal data, using one > of the outlined approaches or maybe something else entirely. Lastly, > perhaps we can find a way to provide confidentiality and integrity for > other parts of the system. Yes, we should consider it, and I want to have this discussion. Ideally we could implement that now, because it might be harder later. However, I don't see how we can add additional security protections without adding a lot more complexity. You are right we might have better ideas later. > Each of these steps is a useful improvement in its own right and will > open up more opportunities for PG to be used. It wasn't my intent to > suggest otherwise, but rather to see if there was an opportunity to get > a few things done at once if it wasn't too impactful. I agree now that > it makes sense to focus on the first step, so we can hopefully get that > accomplished. OK, good. > > I think postmaster.opts is used for pg_ctl reload. I think the question > > is whether the value of maliciously writable PGDATA being able to read > > the keys, while not protecting or detecting all malicious > > writes/db-modifications, is worth it. And, while I listed the files > > above, there are probably many more ways to break the system. > > postmaster.opts is used for pg_ctl restart, just to be clear. Yes, sorry, "restart". > As I try to state above- I don't think we need to provide any specific > protections against a malicious writer for plain encryption to be > useful for some important use-cases. Providing protections against a > malicious writer being able to access the keys is certainly important > as, if they acquire the keys, they would be able to trivially both > decrypt the data and modify any other data they wished to, so it seems > likely that solving that would be the first step towards protecting > against a malicious writer, after which it's useful to think about what > else we could provide integrity checking of, and principal data strikes > me as the next sensible step, followed by what's essentially metadata. Agreed. > > See above --- I think we can't just say we close _most_ of the doors > > here, and I am afraid there will be more and more cases we miss. It > > feels too open-ended. For example, imagine modifying a PGDATA file so > > it is a symbolic link to another file that is not in PGDATA? Seems that > > would break all sorts of security restrictions, and that's just a new > > idea I came up with today. > > It's not clear how that would provide the attacker with much, if > anything. Not sure myself either. > > What I don't want to do is to add a lot of complexity to the system, and > > not really gain any meaningful security. > > Integrity is very meaningful to security, but key management would > certainly come first because if an attacker is able to acquire the keys > then they can circumvent any integrity check being done by simply using > the key. I appreciate that protecting the keys is non-trivial but it's > absolutely critical as everything else falls apart if the key is > compromised. I don't think we should be thinking that we're going to be Agreed, > done with key management or with providing ways to acquire keys even if > the currently proposed patches go in- we'll undoubtably need to provide > other options in the future. There's an interesting point in this > regarding how the flexibility of the shell-script based approach also > introduces this risk that an attacker could modify it and write the key > out to somewhere that they could get at pretty easily. Having support > for directly fetching the key from the Linux kernel or the various > vaulting systems would avoid this risk, I would think. Maybe there's a Agreed. > way to get PG to dump the key out of system memory by modifying other > files in PGDATA but that's surely quite a bit more difficult. > Ultimately, I don't think this voids the proposed approach but I do > think it means we'll want to improve on this in the future. OK. I was just saying we can't be sure we can improve it. > > Oh, that is a very meaningful lesson. I do think that for cluster file > > encryption, if we have a vulnerability, someone will write a script for > > it, and it could be widely exploited. I think RLS gets a little more > > flexibility since someone is already in the database when using it. > > In the current attack we're contemplating, the attacker's got write > access to the filesystem and if that's happening then they've managed to > get through a few layers already, I would think, so it seems unlikely > that it would be widely exploited. Of course, we'd like to avoid having Agreed. > vulnerabilities where we can, but a particular behavior is only a > vulnerabiliy if there's an expectation that we protect against that kind > of attack, which is why documentation is extremely important, which is > what I was trying to get at with the RLS example. True. > > I am not against adding more security features, but I need agreement > > that the existing features/protections, with the planned source code > > impact, is acceptable. I don't want to go down the road of getting the > > feature with the _hope_ that later changes will make the feature > > acceptable --- for me, either what we are planning now is acceptable > > given its code impact, or it is not. If the feature is not sufficient, > > then I would not move forward until we had a reasonable plan of when the > > feature would have acceptable usefulness, and acceptable source code > > impact. > > See above. I do think that the proposed approach is a valuable > capability and improvement in its own right. It seems likely that this > first step, as proposed, would allow us to support use-cases such as the > PCI one you mentioned previously. Taking it further and adding > integrity validation would move us into even more use-cases as it would > address NIST requirements which explicitly call for confidentiality and > integrity. Good. I wanted to express this so everyone is clear on what we are doing, and what we are not doing but might be able to do in the future. > > The big problem, as you outlined above, is that adding to the > > protections, like malicious write detection for a remote PGDATA, greatly > > increases the code impact, and ultimately, might be unsolvable. > > I don't think we really know that it increases the code impact hugely or > is unsolveable, but ultimately those are really debates for another day > at this point. True. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Fri, Feb 5, 2021 at 07:53:18PM -0500, Bruce Momjian wrote: > On Fri, Feb 5, 2021 at 05:21:22PM -0500, Stephen Frost wrote: > > > I disagree. If we only warn about some parts, attackers will just > > > attack other parts. It will also give users a false sense of security. > > > If you can get the keys, it doesn't matter if there is one or ten ways > > > of getting them, if they are all of equal difficulty. Same with > > > modifying the system files. > > > > I agree that there's an additional concern around the keys and that we > > would want to have a solid way to avoid having them be compromised. We > > might not be able to guarantee that attackers who can write to PGDATA > > can't gain access to the keys in the first implementation, but I don't > > see that as a problem- the TDE capability would still provide protection > > against improper disposal and some other use-cases, which is useful. I > > Agreed. > > > do think it'd be useful to consider how we could provide protection > > against an attacker who has write access from being able to acquire the > > keys, but that seems like a tractable problem. Following that, we could > > look at how to provide integrity checking for principal data, using one > > of the outlined approaches or maybe something else entirely. Lastly, > > perhaps we can find a way to provide confidentiality and integrity for > > other parts of the system. > > Yes, we should consider it, and I want to have this discussion. Ideally > we could implement that now, because it might be harder later. However, > I don't see how we can add additional security protections without > adding a lot more complexity. You are right we might have better ideas > later. I added a Limitations section so we can consider future improvements: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Limitations -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote: > > I have made significant progress on the cluster file encryption feature so > > it is time for me to post a new set of patches. > > Here is a rebase, to keep the cfbot green. Good stuff. > >From 110358c9ce8764f0c41c12dd37dabde57a92cf1f Mon Sep 17 00:00:00 2001 > From: Bruce Momjian <bruce@momjian.us> > Date: Mon, 15 Mar 2021 10:20:32 -0400 > Subject: [PATCH] cfe-11-persistent_over_cfe-10-hint squash commit > > --- > src/backend/access/gist/gistutil.c | 2 +- > src/backend/access/heap/heapam_handler.c | 2 +- > src/backend/catalog/pg_publication.c | 2 +- > src/backend/commands/tablecmds.c | 10 +++++----- > src/backend/optimizer/util/plancat.c | 3 +-- > src/backend/utils/cache/relcache.c | 2 +- > src/include/utils/rel.h | 10 ++++++++-- > src/include/utils/snapmgr.h | 3 +-- > 8 files changed, 19 insertions(+), 15 deletions(-) This particular patch (introducing the RelationIsPermanent() macro) seems like it'd be a nice thing to commit independently of the rest, reducing the size of this patch set..? Thanks! Stephen
Attachment
On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote: > > > I have made significant progress on the cluster file encryption feature so > > > it is time for me to post a new set of patches. > > > > Here is a rebase, to keep the cfbot green. > > Good stuff. Yes, I was happy I got to a stage where the encryption actually did something useful. > > >From 110358c9ce8764f0c41c12dd37dabde57a92cf1f Mon Sep 17 00:00:00 2001 > > From: Bruce Momjian <bruce@momjian.us> > > Date: Mon, 15 Mar 2021 10:20:32 -0400 > > Subject: [PATCH] cfe-11-persistent_over_cfe-10-hint squash commit > > > > --- > > src/backend/access/gist/gistutil.c | 2 +- > > src/backend/access/heap/heapam_handler.c | 2 +- > > src/backend/catalog/pg_publication.c | 2 +- > > src/backend/commands/tablecmds.c | 10 +++++----- > > src/backend/optimizer/util/plancat.c | 3 +-- > > src/backend/utils/cache/relcache.c | 2 +- > > src/include/utils/rel.h | 10 ++++++++-- > > src/include/utils/snapmgr.h | 3 +-- > > 8 files changed, 19 insertions(+), 15 deletions(-) > > This particular patch (introducing the RelationIsPermanent() macro) > seems like it'd be a nice thing to commit independently of the rest, > reducing the size of this patch set..? OK, if no one objects I will apply it in the next few days. The macro is used more in my later patches, which I will not apply now. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Patch 10 uses the term "WAL-skip relations". What does that mean? Is it "relations that are not WAL-logged"? I suppose we already have a term for this; I'm not sure it's a good idea to invent a different term that is only used in this new place. -- Álvaro Herrera 39°49'30"S 73°17'W
Greetings, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > Patch 10 uses the term "WAL-skip relations". What does that mean? Is > it "relations that are not WAL-logged"? I suppose we already have a > term for this; I'm not sure it's a good idea to invent a different term > that is only used in this new place. This is discussed in src/backend/access/transam/README, specifically the section that talks about Skipping WAL for New RelFileNode. Basically, it's the 'wal_level=minimal' optimization which allows WAL to be skipped. Thanks! Stephen
Attachment
On 2021-Mar-18, Stephen Frost wrote: > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > > Patch 10 uses the term "WAL-skip relations". What does that mean? Is > > it "relations that are not WAL-logged"? I suppose we already have a > > term for this; I'm not sure it's a good idea to invent a different term > > that is only used in this new place. > > This is discussed in src/backend/access/transam/README, specifically the > section that talks about Skipping WAL for New RelFileNode. Basically, > it's the 'wal_level=minimal' optimization which allows WAL to be > skipped. Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping *relations*. I thought WAL-skipping meant unlogged relations, but I understand now that that's unrelated. In the transam/README, WAL-skip means a change in a transaction in a relfilenode that, if rolled back, would disappear; and I'm not sure I understand how the code is handling the case that a relation is under that condition. This caught my attention because a comment says "encryption does not support WAL-skipped relations", but there's no direct change to the definition of RelFileNodeSkippingWAL() to account for that. Perhaps I am just overlooking something, since I'm just skimming anyway. -- Álvaro Herrera Valdivia, Chile
Greetings, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > On 2021-Mar-18, Stephen Frost wrote: > > > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > > > Patch 10 uses the term "WAL-skip relations". What does that mean? Is > > > it "relations that are not WAL-logged"? I suppose we already have a > > > term for this; I'm not sure it's a good idea to invent a different term > > > that is only used in this new place. > > > > This is discussed in src/backend/access/transam/README, specifically the > > section that talks about Skipping WAL for New RelFileNode. Basically, > > it's the 'wal_level=minimal' optimization which allows WAL to be > > skipped. > > Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping > *relations*. I thought WAL-skipping meant unlogged relations, but > I understand now that that's unrelated. In the transam/README, WAL-skip > means a change in a transaction in a relfilenode that, if rolled back, > would disappear; and I'm not sure I understand how the code is handling > the case that a relation is under that condition. > > This caught my attention because a comment says "encryption does not > support WAL-skipped relations", but there's no direct change to the > definition of RelFileNodeSkippingWAL() to account for that. Perhaps I > am just overlooking something, since I'm just skimming anyway. This is relatively current activity and so it's entirely possible comments and perhaps code need further updating in this area, but to explain what's going on in a bit more detail- Ultimately, we need to make sure that LSNs aren't re-used. There's two sources of LSNs today: those for relations which are being written into the WAL and those for relations which are not (UNLOGGED relations, specifically). The 'minimal' WAL level introduces complications with this requirement because tables created (or truncated) inside a transaction are considered permanent once they're committed, but the data pages in those relations don't go into the WAL and the LSNs on the pages of those relations isn't guaranteed to be either unique or even necessarily set, and if we were to generate LSNs for those it would be required to be done by actually advancing the WAL LSN, which would require writing into the WAL and therefore wouldn't be quite the optimization that's expected. I'm not sure if it's been explicitly done yet but I believe the idea is, based on my last discussion with Bruce, at least initially, simply disallow encrypted clusters from running with wal_level=minimal to avoid this issue. Thanks, Stephen
Attachment
On Thu, Mar 18, 2021 at 02:37:43PM -0300, Álvaro Herrera wrote: > On 2021-Mar-18, Stephen Frost wrote: > > This is discussed in src/backend/access/transam/README, specifically the > > section that talks about Skipping WAL for New RelFileNode. Basically, > > it's the 'wal_level=minimal' optimization which allows WAL to be > > skipped. > > Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping > *relations*. I thought WAL-skipping meant unlogged relations, but > I understand now that that's unrelated. In the transam/README, WAL-skip > means a change in a transaction in a relfilenode that, if rolled back, > would disappear; and I'm not sure I understand how the code is handling > the case that a relation is under that condition. > > This caught my attention because a comment says "encryption does not > support WAL-skipped relations", but there's no direct change to the > definition of RelFileNodeSkippingWAL() to account for that. Perhaps I > am just overlooking something, since I'm just skimming anyway. First, thanks for looking at these patches --- I know it isn't easy. Second, you are right that I equated WAL-skipping relfilenodes with relations, and this was wrong. I have updated the attached patch to use the term WAL-skipping "relfilenodes", and checked the rest of the patches for any incorrect 'skipping' term, but didn't find any. If "WAL-skipping relfilenodes" is not clear enough, we should probably rename RelFileNodeSkippingWAL(). -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Attachment
On Thu, Mar 18, 2021 at 01:46:28PM -0400, Stephen Frost wrote: > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > > This caught my attention because a comment says "encryption does not > > support WAL-skipped relations", but there's no direct change to the > > definition of RelFileNodeSkippingWAL() to account for that. Perhaps I > > am just overlooking something, since I'm just skimming anyway. > > This is relatively current activity and so it's entirely possible > comments and perhaps code need further updating in this area, but to > explain what's going on in a bit more detail- > > Ultimately, we need to make sure that LSNs aren't re-used. There's two > sources of LSNs today: those for relations which are being written into > the WAL and those for relations which are not (UNLOGGED relations, > specifically). The 'minimal' WAL level introduces complications with Well, the story is a little more complex than that --- we currently have four LSN uses: 1. real LSNs for WAL-logged relfilenodes 2. real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet relations 3. fake LSNs for GiST indexes for relfilenodes of non-permanenet relations 4. zero LSNs for non-GiST non-permanenet relations This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3 so the LSNs are always unique. > I'm not sure if it's been explicitly done yet but I believe the idea is, > based on my last discussion with Bruce, at least initially, simply > disallow encrypted clusters from running with wal_level=minimal to avoid > this issue. I adjusted the hint bit code so it potentially could work with wal_level minimal (just for safety), but the code disallows wal_level minimal, and is documented as such. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote: > > src/backend/access/gist/gistutil.c | 2 +- > > src/backend/access/heap/heapam_handler.c | 2 +- > > src/backend/catalog/pg_publication.c | 2 +- > > src/backend/commands/tablecmds.c | 10 +++++----- > > src/backend/optimizer/util/plancat.c | 3 +-- > > src/backend/utils/cache/relcache.c | 2 +- > > src/include/utils/rel.h | 10 ++++++++-- > > src/include/utils/snapmgr.h | 3 +-- > > 8 files changed, 19 insertions(+), 15 deletions(-) > > This particular patch (introducing the RelationIsPermanent() macro) > seems like it'd be a nice thing to commit independently of the rest, > reducing the size of this patch set..? Committed as suggested. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Mar 22, 2021 at 08:38:37PM -0400, Bruce Momjian wrote: > > This particular patch (introducing the RelationIsPermanent() macro) > > seems like it'd be a nice thing to commit independently of the rest, > > reducing the size of this patch set..? > > Committed as suggested. Also, I have written a short presentation on where I think we are with cluster file encryption: https://momjian.us/main/writings/pgsql/cfe.pdf -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi Bruce,
I went through these patches and executed the test script you added for the KMS section, which looks all good.
This is a point that looks like a bug - in patch 10, you changed the location and use of *RelFileNodeSkippingWAL()*, but the modified code logic seems different from the original when encryption is not enabled. After applying this patch, it still will execute the set LSN code flow when RelFileNodeSkippingWAL returns true, and encryption not enabled.
On Thu, Apr 1, 2021 at 2:47 PM Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote:
> I have made significant progress on the cluster file encryption feature so
> it is time for me to post a new set of patches.
Here is a rebase, to keep the cfbot green.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
If only the physical world exists, free will is an illusion.
There is no royal road to learning.
HighGo Software Co.
On Tue, Apr 6, 2021 at 04:56:36PM +0800, Neil Chen wrote: > Hi Bruce, > > I went through these patches and executed the test script you added for the KMS > section, which looks all good. Thank you for checking it. The src/test/crypto/t/003_clusterkey.pl test is one of the craziest tests I have ever written, so I am glad it worked for you. > This is a point that looks like a bug - in patch 10, you changed the location > and use of *RelFileNodeSkippingWAL()*, but the modified code logic seems > different from the original when encryption is not enabled. After applying this > patch, it still will execute the set LSN code flow when RelFileNodeSkippingWAL > returns true, and encryption not enabled. You are very correct. That 'return' inside the 'if' statement gave me trouble, and MarkBufferDirtyHint() was the hardest function I had to deal with. Attached is an updated version of patches with a rebase; the GitHub links listed on the wiki are updated too. Thanks for your help. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Attachment
- cfe-05-crypto_over_cfe-04-common.diff
- cfe-01-doc_over_master.diff
- cfe-02-internaldoc_over_cfe-01-doc.diff
- cfe-03-scripts_over_cfe-02-internaldoc.diff
- cfe-04-common_over_cfe-03-scripts.diff
- cfe-06-backend_over_cfe-05-crypto.diff
- cfe-07-bin_over_cfe-06-backend.diff
- cfe-08-pg_alterckey_over_cfe-07-bin.diff
- cfe-09-test_over_cfe-08-pg_alterckey.diff
- cfe-10-hint_over_cfe-09-test.diff
- cfe-11-gist_over_cfe-10-hint.diff
- cfe-12-rel_over_cfe-11-gist.diff
On Thu, Mar 18, 2021 at 2:59 PM Bruce Momjian <bruce@momjian.us> wrote: > > Ultimately, we need to make sure that LSNs aren't re-used. There's two > > sources of LSNs today: those for relations which are being written into > > the WAL and those for relations which are not (UNLOGGED relations, > > specifically). The 'minimal' WAL level introduces complications with > > Well, the story is a little more complex than that --- we currently have > four LSN uses: > > 1. real LSNs for WAL-logged relfilenodes > 2. real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet relations > 3. fake LSNs for GiST indexes for relfilenodes of non-permanenet relations > 4. zero LSNs for non-GiST non-permanenet relations > > This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3 > so the LSNs are always unique. Hi! This approach has a few disadvantages. For example, right now, we only need to WAL log hints for the first write to each page after a checkpoint, but in this approach, if the same page is written multiple times per checkpoint cycle, we'd need to log hints every time. In some workloads that could be quite expensive, especially if we log an FPI every time. Also, I think that all sorts of non-permanent relations currently get zero LSNs, not just GiST. Every unlogged table and every temporary table would need to use fake LSNs. Moreover, for unlogged tables, the buffer manager would need changes, because it is otherwise going to assume that anything it sees in the pd_lsn field other than a zero is a real LSN. So I would like to propose an alternative: store the nonce in the page. Now the next question is where to put it. I think that putting it into the page header would be far too invasive, so I propose that we instead store it at the end of the page, as part of the special space. That makes an awful lot of code not really notice that anything is different, because it always thought that the usable space on the page ended where the special space begins, and it doesn't really care where that is exactly. The code that knows about the special space might care a little bit, but whatever private data it's storing is going to be at the beginning of the special space, and the nonce would be stored - in this proposal - at the end of the special space. So it turns out that it doesn't really care that much either. Attached are a few WIP/POC patches from my colleague Bharath implementing this. There are certainly some implementation deficiencies here, which can be corrected if we decide this approach is worth pursuing, but I think they are sufficient to show that the approach is viable and also some of the consequences of going this way. One thing that happens is that a bunch of values that used to be constant - like TOAST_INDEX_TARGET and GinDataPageMaxDataSize - become non-constant. I suggested to Bharath that he handle this by changing those macros to take the nonce size as an argument, which is what the patch does, although it missed pushing that idea down all the way in some obscure case (e.g. SIGLEN_MAX). That has the down side that we will now have more computation to do at runtime vs. compile-time. I am unclear whether there would be enough impact to get exercised about, but I'm hopeful that the answer is "no". As written, the patch makes initdb take a --tde-nonce-size argument, but that's really just for demonstration purposes. I assume that, if we decide to go this way, we'd have an initdb option that selects whether to use encryption, or perhaps the specific encryption algorithm to be used, and then the nonce size would be computed based on that, or else set to 0 if encryption is not in use. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi, On 2021-05-25 12:46:45 -0400, Robert Haas wrote: > This approach has a few disadvantages. For example, right now, we only > need to WAL log hints for the first write to each page after a > checkpoint, but in this approach, if the same page is written multiple > times per checkpoint cycle, we'd need to log hints every time. In some > workloads that could be quite expensive, especially if we log an FPI > every time. Yes. I think it'd likely be prohibitively expensive in some situations. > So I would like to propose an alternative: store the nonce in the > page. Now the next question is where to put it. I think that putting > it into the page header would be far too invasive, so I propose that > we instead store it at the end of the page, as part of the special > space. That makes an awful lot of code not really notice that anything > is different, because it always thought that the usable space on the > page ended where the special space begins, and it doesn't really care > where that is exactly. The code that knows about the special space > might care a little bit, but whatever private data it's storing is > going to be at the beginning of the special space, and the nonce would > be stored - in this proposal - at the end of the special space. So it > turns out that it doesn't really care that much either. The obvious concerns are issues around binary upgrades for cases that already use the special space? Are you planning to address that by not having that path? Or by storing the nonce at the "start" of the special space (i.e. [normal data][nonce][existing special])? Is there an argument for generalizing the nonce approach for to replace fake LSNs for unlogged relations? Why is using pd_special better than finding space for a flag bit in the header indicating whether it has a nonce? Using pd_special will burden all code using special space, and maybe even some that does not (think empty pages now occasionally having a non-zero pd_special), whereas implementing it on the page level wouldn't quite have the same concerns. > One thing that happens is that a bunch of values that used to be > constant - like TOAST_INDEX_TARGET and GinDataPageMaxDataSize - become > non-constant. I suggested to Bharath that he handle this by changing > those macros to take the nonce size as an argument, which is what the > patch does, although it missed pushing that idea down all the way in > some obscure case (e.g. SIGLEN_MAX). That has the down side that we > will now have more computation to do at runtime vs. compile-time. I am > unclear whether there would be enough impact to get exercised about, > but I'm hopeful that the answer is "no". > > As written, the patch makes initdb take a --tde-nonce-size argument, > but that's really just for demonstration purposes. I assume that, if > we decide to go this way, we'd have an initdb option that selects > whether to use encryption, or perhaps the specific encryption > algorithm to be used, and then the nonce size would be computed based > on that, or else set to 0 if encryption is not in use. I do suspect having only the "no nonce" or "nonce is a compile time constant" cases would be good performance wise. Stuff like > +#define MaxHeapTupleSizeLimit (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \ > + sizeof(ItemIdData))) > +#define MaxHeapTupleSize(tdeNonceSize) (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \ > + sizeof(ItemIdData)) - MAXALIGN(tdeNonceSize)) won't be free. Greetings, Andres Freund
On Tue, May 25, 2021 at 1:37 PM Andres Freund <andres@anarazel.de> wrote: > The obvious concerns are issues around binary upgrades for cases that > already use the special space? Are you planning to address that by not > having that path? Or by storing the nonce at the "start" of the special > space (i.e. [normal data][nonce][existing special])? Well, there aren't any existing encrypted clusters, so what is the scenario exactly? Perhaps you are thinking that we'd have a pg_upgrade option that would take an unencrypted cluster and encrypt all the pages, without any other page format changes. If so, this design would preclude that choice, because there might be no free space available. > Is there an argument for generalizing the nonce approach for to replace > fake LSNs for unlogged relations? I hadn't thought about that. Maybe. But that would require including the nonce always, rather than only when TDE is selected, or including it always in some kinds of pages and only conditionally in others, which seems more complex. > Why is using pd_special better than finding space for a flag bit in the > header indicating whether it has a nonce? Using pd_special will burden > all code using special space, and maybe even some that does not (think > empty pages now occasionally having a non-zero pd_special), whereas > implementing it on the page level wouldn't quite have the same concerns. Well, I think there's a lot of code that knows where the line pointer array starts, and all those calculations will have to become more complex at runtime if we put the nonce anywhere near the start of the page. I think there are way fewer things that care about the end of the page. I dislike the idea that every call to PageGetItem() would need to know the nonce size - there are hundreds of those calls and making them more expensive seems a lot worse than the stuff this patch changes. It's always possible that I'm confused here, either about what you are proposing or how impactful it would actually be... > I do suspect having only the "no nonce" or "nonce is a compile time > constant" cases would be good performance wise. Stuff like > > > +#define MaxHeapTupleSizeLimit (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \ > > + sizeof(ItemIdData))) > > +#define MaxHeapTupleSize(tdeNonceSize) (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \ > > + sizeof(ItemIdData)) - MAXALIGN(tdeNonceSize)) > > won't be free. One question here is whether we're comfortable saying that the nonce is entirely constant. I wasn't sure about that. It seems possible to me that different encryption algorithms might want nonces of different sizes, either now or in the future. I am not a cryptographer, but that seemed like a bit of a limiting assumption. So Bharath and I decided to make the POC cater to a fully variable-size nonce rather than zero-or-some-constant. However, if the consensus is that zero-or-some-constant is better, fair enough! The patch can certainly be adjusted to cater to work that way. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, May 25, 2021 at 12:46:45PM -0400, Robert Haas wrote: > On Thu, Mar 18, 2021 at 2:59 PM Bruce Momjian <bruce@momjian.us> wrote: > > > Ultimately, we need to make sure that LSNs aren't re-used. There's two > > > sources of LSNs today: those for relations which are being written into > > > the WAL and those for relations which are not (UNLOGGED relations, > > > specifically). The 'minimal' WAL level introduces complications with > > > > Well, the story is a little more complex than that --- we currently have > > four LSN uses: > > > > 1. real LSNs for WAL-logged relfilenodes > > 2. real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet relations > > 3. fake LSNs for GiST indexes for relfilenodes of non-permanenet relations > > 4. zero LSNs for non-GiST non-permanenet relations > > > > This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3 > > so the LSNs are always unique. > > Hi! > > This approach has a few disadvantages. For example, right now, we only > need to WAL log hints for the first write to each page after a > checkpoint, but in this approach, if the same page is written multiple > times per checkpoint cycle, we'd need to log hints every time. In some > workloads that could be quite expensive, especially if we log an FPI > every time. Well, if we create a separate nonce counter, we still need to make sure it doesn't go backwards during a crash, so we have to WAL log it somehow, perhaps at a certain interval like 1k and advance the counter by 1k in case of crash recovery, like we do with the oid counter now, I think. The buffer encryption overhead is 2-4%, and WAL encryption is going to add to that, so I thought hint bit logging overhead would be minimal in comparison. > Also, I think that all sorts of non-permanent relations currently get > zero LSNs, not just GiST. Every unlogged table and every temporary > table would need to use fake LSNs. Moreover, for unlogged tables, the > buffer manager would need changes, because it is otherwise going to > assume that anything it sees in the pd_lsn field other than a zero is > a real LSN. Have you looked at the code, specifically EncryptPage(): https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch + if (!relation_is_permanent && !is_gist_page_or_similar) + PageSetLSN(page, LSNForEncryption(relation_is_permanent)); It assigns an LSN to unlogged pages. As far as the buffer manager seeing fake LSNs that already happens for GiST indexes, so I just built on that --- seemed to work fine. > So I would like to propose an alternative: store the nonce in the > page. Now the next question is where to put it. I think that putting > it into the page header would be far too invasive, so I propose that > we instead store it at the end of the page, as part of the special > space. That makes an awful lot of code not really notice that anything > is different, because it always thought that the usable space on the > page ended where the special space begins, and it doesn't really care > where that is exactly. The code that knows about the special space > might care a little bit, but whatever private data it's storing is > going to be at the beginning of the special space, and the nonce would > be stored - in this proposal - at the end of the special space. So it > turns out that it doesn't really care that much either. I think the big problem with that is that it adds a new counter, with new code, and it makes adding encryption offline, like we do for adding checksums, pretty much impossible since the page might not have space for a nonce. It also makes the idea of adding encryption as part of a pg_upgrade non-link mode also impossible, at least for me. ;-) I have to ask why we should consider adding it to the special space, since my current version seems fine, and has minimal code impact, and has some advantages over using the special space. Is it because of the WAL hint overhead, or for a cleaner API, or something else? Also, I need help with all the XXX comments I have in my patches before I can move forward: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Patches I stopped working on this to get beta out the door, but next week it would be nice to continue on this. However, I want to get this patch into a state where everyone is happy with it, rather than adding more code with an unclear future. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote: > One question here is whether we're comfortable saying that the nonce > is entirely constant. I wasn't sure about that. It seems possible to > me that different encryption algorithms might want nonces of different > sizes, either now or in the future. I am not a cryptographer, but that > seemed like a bit of a limiting assumption. So Bharath and I decided > to make the POC cater to a fully variable-size nonce rather than > zero-or-some-constant. However, if the consensus is that > zero-or-some-constant is better, fair enough! The patch can certainly > be adjusted to cater to work that way. A 16-byte nonce is sufficient for AES and I doubt we will need anything stronger than AES256 anytime soon. Making the nonce variable length seems it is just adding complexity for little purpose. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 10:37:32AM -0700, Andres Freund wrote: > The obvious concerns are issues around binary upgrades for cases that > already use the special space? Are you planning to address that by not > having that path? Or by storing the nonce at the "start" of the special > space (i.e. [normal data][nonce][existing special])? > > Is there an argument for generalizing the nonce approach for to replace > fake LSNs for unlogged relations? > > Why is using pd_special better than finding space for a flag bit in the > header indicating whether it has a nonce? Using pd_special will burden > all code using special space, and maybe even some that does not (think > empty pages now occasionally having a non-zero pd_special), whereas > implementing it on the page level wouldn't quite have the same concerns. My code can already identify if the LSN is fake or not --- why can't we build on that? Can someone show that logging WAL hint bits causes unacceptable overhead beyond the encryption overhead? I don't think we even know that since we don't know the overhead of encrypting WAL. One crazy idea would be to not log WAL hints, but rather use an LSN range that will never be valid for real LSNs, like the high bit being set. That special range would need to be WAL-logged, but again, perhaps every 1k, and increment by 1k on a crash. This discussion has cemented what I had already considered --- that doing a separate nonce will make this feature less usable/upgradable, and take it beyond my ability or desire to complete. Ideally, what I would like to do is to resolve my XXX questions in my patches, get everyone happy with what we have, then let me do the WAL encryption. We can then see if logging hint bits is significant overhead, and if it is, go with a special LSN range for fake LSNs. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 2:45 PM Bruce Momjian <bruce@momjian.us> wrote: > Well, if we create a separate nonce counter, we still need to make sure > it doesn't go backwards during a crash, so we have to WAL log it I think we don't really need a global counter, do we? We could simply increment the nonce every time we write the page. If we want to avoid using the same IV for different pages, then 8 bytes of the nonce could store a value that's different for every page, and the other 8 bytes could store a counter. Presumably we won't manage to write the same page more than 2^64 times, since LSNs are limited to be <2^64, and those are consumed more than 1 byte at a time for every change to any page anywhere. > The buffer encryption overhead is 2-4%, and WAL encryption is going to > add to that, so I thought hint bit logging overhead would be minimal > in comparison. I think it depends. If buffer evictions are rare, then it won't matter much. But if they are common, then using the LSN as the nonce will add a lot of overhead. > Have you looked at the code, specifically EncryptPage(): > > https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch > > + if (!relation_is_permanent && !is_gist_page_or_similar) > + PageSetLSN(page, LSNForEncryption(relation_is_permanent)); > > > It assigns an LSN to unlogged pages. As far as the buffer manager > seeing fake LSNs that already happens for GiST indexes, so I just built > on that --- seemed to work fine. I had not, but I don't see why this issue is specific to GiST rather than common to every kind of unlogged and temporary relation. > I have to ask why we should consider adding it to the special space, > since my current version seems fine, and has minimal code impact, and > has some advantages over using the special space. Is it because of the > WAL hint overhead, or for a cleaner API, or something else? My concern is about the overhead, and also the code complexity. I think that making sure that the LSN gets changed in all cases may be fairly tricky. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, May 25, 2021 at 03:09:03PM -0400, Robert Haas wrote: > On Tue, May 25, 2021 at 2:45 PM Bruce Momjian <bruce@momjian.us> wrote: > > Well, if we create a separate nonce counter, we still need to make sure > > it doesn't go backwards during a crash, so we have to WAL log it > > I think we don't really need a global counter, do we? We could simply > increment the nonce every time we write the page. If we want to avoid > using the same IV for different pages, then 8 bytes of the nonce could > store a value that's different for every page, and the other 8 bytes > could store a counter. Presumably we won't manage to write the same > page more than 2^64 times, since LSNs are limited to be <2^64, and > those are consumed more than 1 byte at a time for every change to any > page anywhere. The issue we had here is what do you use as a special value for each relation? Where do you store it if it is not computed? You can use a global counter for the per-page nonce that doesn't change when the page is updated, but that would still need to be a global counter. Also, when you change hint bits, either you don't change the nonce/LSN, and don't recrypt the page (and the hint bit changes are visible), or you change the nonce and reencrypt the page, and you are then WAL logging the page. I don't see how having a nonce different from the LSN helps here. > > The buffer encryption overhead is 2-4%, and WAL encryption is going to > > add to that, so I thought hint bit logging overhead would be minimal > > in comparison. > > I think it depends. If buffer evictions are rare, then it won't matter > much. But if they are common, then using the LSN as the nonce will add > a lot of overhead. Well, see above. A separate nonce somewhere else doesn't help much, as I see it. > > Have you looked at the code, specifically EncryptPage(): > > > > https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch > > > > + if (!relation_is_permanent && !is_gist_page_or_similar) > > + PageSetLSN(page, LSNForEncryption(relation_is_permanent)); > > > > > > It assigns an LSN to unlogged pages. As far as the buffer manager > > seeing fake LSNs that already happens for GiST indexes, so I just built > > on that --- seemed to work fine. > > I had not, but I don't see why this issue is specific to GiST rather > than common to every kind of unlogged and temporary relation. > > > I have to ask why we should consider adding it to the special space, > > since my current version seems fine, and has minimal code impact, and > > has some advantages over using the special space. Is it because of the > > WAL hint overhead, or for a cleaner API, or something else? > > My concern is about the overhead, and also the code complexity. I > think that making sure that the LSN gets changed in all cases may be > fairly tricky. Please look over the patch to see if I missed anything --- for me, it seemed quite clear, and I am not an expert in that area of the code. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 03:20:06PM -0400, Bruce Momjian wrote: > Also, when you change hint bits, either you don't change the nonce/LSN, > and don't re-encrypt the page (and the hint bit changes are visible), or > you change the nonce and reencrypt the page, and you are then WAL > logging the page. I don't see how having a nonce different from the LSN > helps here. Let me go into more detail here. The general rule is that you never encrypt _different_ data with the same key/nonce. Now, since a hint bit change changes the data, it should get a new nonce, and since it is a newly encrypted page (using a new nonce), it should be WAL logged because a torn page would make the data unreadable. Now, if we want to consult some security experts and have them tell us the hint bit visibility is not a problem, we could get by without using a new nonce for hint bit changes, and in that case it doesn't matter if we have a separate LSN or custom nonce --- it doesn't get changed for hint bit changes. My point is that we have to full-page-write cases where we change the nonce --- we get a new LSN/nonce for free if we are using the LSN as the nonce. What has made this approach much easier is that you basically tie a change of the nonce to require a change of LSN, since you are WAL logging it and every nonce change has to be full-page-write WAL logged. This makes the LSN-as-nonce less fragile to breakage than a custom nonce, in my opinion, which may explain why my patch is so small. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 03:34:04PM -0400, Bruce Momjian wrote: > Let me go into more detail here. The general rule is that you never > encrypt _different_ data with the same key/nonce. Now, since a hint bit > change changes the data, it should get a new nonce, and since it is a > newly encrypted page (using a new nonce), it should be WAL logged > because a torn page would make the data unreadable. > > Now, if we want to consult some security experts and have them tell us > the hint bit visibility is not a problem, we could get by without using a > new nonce for hint bit changes, and in that case it doesn't matter if we > have a separate LSN or custom nonce --- it doesn't get changed for hint > bit changes. > > My point is that we have to full-page-write cases where we change the > nonce --- we get a new LSN/nonce for free if we are using the LSN as the > nonce. What has made this approach much easier is that you basically > tie a change of the nonce to require a change of LSN, since you are WAL > logging it and every nonce change has to be full-page-write WAL logged. > This makes the LSN-as-nonce less fragile to breakage than a custom > nonce, in my opinion, which may explain why my patch is so small. This issue is covered at the bottom of this patch to the README file: https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch Hint Bits - - - - - For hint bit changes, the LSN normally doesn't change, which is a problem. By enabling wal_log_hints, you get full page writes to the WAL after the first hint bit change of the checkpoint. This is useful for two reasons. First, it generates a new LSN, which is needed for the IV to be secure. Second, full page images protect against torn pages, which is an even bigger requirement for encryption because the new LSN is re-encrypting the entire page, not just the hint bit changes. You can safely lose the hint bit changes, but you need to use the same LSN to decrypt the entire page, so a torn page with an LSN change cannot be decrypted. To prevent this, wal_log_hints guarantees that the pre-hint-bit version (and previous LSN version) of the page is restored. However, if a hint-bit-modified page is written to the file system during a checkpoint, and there is a later hint bit change switching the same page from clean to dirty during the same checkpoint, we need a new LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this is to update the page LSN by writing a dummy WAL record via xloginsert.c::LSNForEncryption() in such cases. Let me know if it needs more detai. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings,
On Tue, May 25, 2021 at 14:56 Bruce Momjian <bruce@momjian.us> wrote:
On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> One question here is whether we're comfortable saying that the nonce
> is entirely constant. I wasn't sure about that. It seems possible to
> me that different encryption algorithms might want nonces of different
> sizes, either now or in the future. I am not a cryptographer, but that
> seemed like a bit of a limiting assumption. So Bharath and I decided
> to make the POC cater to a fully variable-size nonce rather than
> zero-or-some-constant. However, if the consensus is that
> zero-or-some-constant is better, fair enough! The patch can certainly
> be adjusted to cater to work that way.
A 16-byte nonce is sufficient for AES and I doubt we will need anything
stronger than AES256 anytime soon. Making the nonce variable length
seems it is just adding complexity for little purpose.
I’d like to review this more and make sure using the special space is possible but if it is then it opens up a huge new possibility that we could use it for both the nonce AND an appropriately sized tag, giving us integrity along with encryption which would be a very significant additional feature. I’d considered using a fork instead but having it on the page would be far better.
I’ll also note that we could consider possibly even find an alternative use for the space used for checksums, or leave them as they are today, though they’d be redundant at that point to the tag.
Lastly, if the special space is actually able to be variable in size and we could, say, store a flag in pg_class which tells us what’s in the special space, then we could possibly give users the option of including the tag on each page, or the choice of the size of tag, or possibly for other interesting things in the future outside of encryption and data integrity.
Overall, I’m quite interested in the idea of making the special space able to be variable. I do accept that will make it so it’s not possible to do things like have physical replication between an unencrypted cluster and an encrypted one, but the advantages seem worthwhile and users would still be able to leverage logical replication to perform such a migration with relatively little downtime.
Thanks!
Stephen
Greetings,
On Tue, May 25, 2021 at 15:09 Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 25, 2021 at 2:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> Well, if we create a separate nonce counter, we still need to make sure
> it doesn't go backwards during a crash, so we have to WAL log it
I think we don't really need a global counter, do we? We could simply
increment the nonce every time we write the page. If we want to avoid
using the same IV for different pages, then 8 bytes of the nonce could
store a value that's different for every page, and the other 8 bytes
could store a counter. Presumably we won't manage to write the same
page more than 2^64 times, since LSNs are limited to be <2^64, and
those are consumed more than 1 byte at a time for every change to any
page anywhere.
The nonce does need to be absolutely unique for a given encryption key and therefore needs to be global in some form.
Thanks!
Stephen
Hi, On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote: > My point is that we have to full-page-write cases where we change the > nonce --- we get a new LSN/nonce for free if we are using the LSN as the > nonce. What has made this approach much easier is that you basically > tie a change of the nonce to require a change of LSN, since you are WAL > logging it and every nonce change has to be full-page-write WAL logged. > This makes the LSN-as-nonce less fragile to breakage than a custom > nonce, in my opinion, which may explain why my patch is so small. This disregards that we need to be able to increment nonces on standbys / during crash recovery. It may look like that's not needed, with an (wrong!) argument like: The only writes come from crash recovery, which always are associated with a WAL record, guaranteeing nonce increases. Hint bits are not an issue because they don't mark the buffer dirty. But unfortunately that analysis is wrong. Consider the following sequence: 1) replay record LSN X affecting page Y (FPI replay) 2) write out Y, encrypt Y using X as nonce 3) crash 4) replay record LSN X affecting page Y (FPI replay) 5) hint bit update to Y, resulting in Y' 6) write out Y', encrypt Y' using X as nonce While 5) did not mark the page as dirty, it still modified the page contents. Which means that we'd encrypt different content with the same nonce - which is not allowed. I'm pretty sure that there's several other ways to end up with page contents that differ, despite the LSN not changing. Greetings, Andres Freund
On Tue, May 25, 2021 at 01:54:21PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote: > > My point is that we have to full-page-write cases where we change the > > nonce --- we get a new LSN/nonce for free if we are using the LSN as the > > nonce. What has made this approach much easier is that you basically > > tie a change of the nonce to require a change of LSN, since you are WAL > > logging it and every nonce change has to be full-page-write WAL logged. > > This makes the LSN-as-nonce less fragile to breakage than a custom > > nonce, in my opinion, which may explain why my patch is so small. > > This disregards that we need to be able to increment nonces on standbys > / during crash recovery. > > It may look like that's not needed, with an (wrong!) argument like: The > only writes come from crash recovery, which always are associated with a > WAL record, guaranteeing nonce increases. Hint bits are not an issue > because they don't mark the buffer dirty. > > But unfortunately that analysis is wrong. Consider the following > sequence: > > 1) replay record LSN X affecting page Y (FPI replay) > 2) write out Y, encrypt Y using X as nonce > 3) crash > 4) replay record LSN X affecting page Y (FPI replay) > 5) hint bit update to Y, resulting in Y' > 6) write out Y', encrypt Y' using X as nonce > > While 5) did not mark the page as dirty, it still modified the page > contents. Which means that we'd encrypt different content with the same > nonce - which is not allowed. > > I'm pretty sure that there's several other ways to end up with page > contents that differ, despite the LSN not changing. Yes, I can see that happening. I think occasional leakage of hint bit changes to be acceptable. We might decide they are all acceptable. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 03:20:06PM -0400, Bruce Momjian wrote: > > Also, when you change hint bits, either you don't change the nonce/LSN, > > and don't re-encrypt the page (and the hint bit changes are visible), or > > you change the nonce and reencrypt the page, and you are then WAL > > logging the page. I don't see how having a nonce different from the LSN > > helps here. > > Let me go into more detail here. The general rule is that you never > encrypt _different_ data with the same key/nonce. Now, since a hint bit > change changes the data, it should get a new nonce, and since it is a > newly encrypted page (using a new nonce), it should be WAL logged > because a torn page would make the data unreadable. Right. > Now, if we want to consult some security experts and have them tell us > the hint bit visibility is not a problem, we could get by without using a > new nonce for hint bit changes, and in that case it doesn't matter if we > have a separate LSN or custom nonce --- it doesn't get changed for hint > bit changes. I do think it's reasonable to consider having hint bits not included in the encrypted part of the page and therefore remove the need to produce a new nonce for each hint bit change. Naturally, there's always an increased risk when any data in the system isn't encrypted but given the other parts of the system which aren't being encrypted as part of this effort it hardly seems like a significant increase of overall risk. I don't believe that any of the auditors and security teams I've discussed TDE with would have issue with hint bits not being encrypted- the principle concern has always been the primary data. Naturally, the more we are able to encrypt and the more we can do to provide data integrity validation, may open up the possibility for PG to be used in even more places, which argues for having some way of making these choices be options which a user could decide at initdb time, or at least contemplating a road map to where we could offer users the option to have other parts of the system be encrypted and ideally have data integrity checks, but I don't think we necessarily have to solve everything right now in that regard- just having TDE in some form will open up quite a few new possibilities for v15, even if it doesn't include data integrity validation beyond our existing checksums and doesn't encrypt hint bits. Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 04:29:08PM -0400, Stephen Frost wrote: > Greetings, > > On Tue, May 25, 2021 at 14:56 Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote: > > One question here is whether we're comfortable saying that the nonce > > is entirely constant. I wasn't sure about that. It seems possible to > > me that different encryption algorithms might want nonces of different > > sizes, either now or in the future. I am not a cryptographer, but that > > seemed like a bit of a limiting assumption. So Bharath and I decided > > to make the POC cater to a fully variable-size nonce rather than > > zero-or-some-constant. However, if the consensus is that > > zero-or-some-constant is better, fair enough! The patch can certainly > > be adjusted to cater to work that way. > > A 16-byte nonce is sufficient for AES and I doubt we will need anything > stronger than AES256 anytime soon. Making the nonce variable length > seems it is just adding complexity for little purpose. > > > I’d like to review this more and make sure using the special space is possible > but if it is then it opens up a huge new possibility that we could use it for > both the nonce AND an appropriately sized tag, giving us integrity along with > encryption which would be a very significant additional feature. I’d > considered using a fork instead but having it on the page would be far better. We already discussed that there are too many other ways to break system integrity that are not encrypted/integrity-checked, e.g., changes to clog. Do you disagree? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 05:04:50PM -0400, Stephen Frost wrote: > > Now, if we want to consult some security experts and have them tell us > > the hint bit visibility is not a problem, we could get by without using a > > new nonce for hint bit changes, and in that case it doesn't matter if we > > have a separate LSN or custom nonce --- it doesn't get changed for hint > > bit changes. > > I do think it's reasonable to consider having hint bits not included in > the encrypted part of the page and therefore remove the need to produce > a new nonce for each hint bit change. Naturally, there's always an > increased risk when any data in the system isn't encrypted but given > the other parts of the system which aren't being encrypted as part of > this effort it hardly seems like a significant increase of overall risk. > I don't believe that any of the auditors and security teams I've > discussed TDE with would have issue with hint bits not being encrypted- > the principle concern has always been the primary data. OK, this is good to know. I know the never-reuse rule, so it is good to know it can be relaxed for certain data without causing problems in other places. Should I modify my patch to do this? FYI, technically, the hint bit is still encrypted, but could _flip_ in the encrypted file if changed, so that's why we say it is visible. If we used a block cipher instead of a streaming one (CTR), this might not work because the earlier blocks can be based in the output of later blocks. > Naturally, the more we are able to encrypt and the more we can do to > provide data integrity validation, may open up the possibility for PG to > be used in even more places, which argues for having some way of making > these choices be options which a user could decide at initdb time, or at > least contemplating a road map to where we could offer users the option > to have other parts of the system be encrypted and ideally have data > integrity checks, but I don't think we necessarily have to solve > everything right now in that regard- just having TDE in some form will > open up quite a few new possibilities for v15, even if it doesn't > include data integrity validation beyond our existing checksums and > doesn't encrypt hint bits. I am thinking full-file system encryption should still be used by people needing that. I am concerned that if we add too many restrictions/additions on this feature, it will not be very useful. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 01:54:21PM -0700, Andres Freund wrote: > > On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote: > > > My point is that we have to full-page-write cases where we change the > > > nonce --- we get a new LSN/nonce for free if we are using the LSN as the > > > nonce. What has made this approach much easier is that you basically > > > tie a change of the nonce to require a change of LSN, since you are WAL > > > logging it and every nonce change has to be full-page-write WAL logged. > > > This makes the LSN-as-nonce less fragile to breakage than a custom > > > nonce, in my opinion, which may explain why my patch is so small. > > > > This disregards that we need to be able to increment nonces on standbys > > / during crash recovery. > > > > It may look like that's not needed, with an (wrong!) argument like: The > > only writes come from crash recovery, which always are associated with a > > WAL record, guaranteeing nonce increases. Hint bits are not an issue > > because they don't mark the buffer dirty. > > > > But unfortunately that analysis is wrong. Consider the following > > sequence: > > > > 1) replay record LSN X affecting page Y (FPI replay) > > 2) write out Y, encrypt Y using X as nonce > > 3) crash > > 4) replay record LSN X affecting page Y (FPI replay) > > 5) hint bit update to Y, resulting in Y' > > 6) write out Y', encrypt Y' using X as nonce > > > > While 5) did not mark the page as dirty, it still modified the page > > contents. Which means that we'd encrypt different content with the same > > nonce - which is not allowed. > > > > I'm pretty sure that there's several other ways to end up with page > > contents that differ, despite the LSN not changing. > > Yes, I can see that happening. I think occasional leakage of hint bit > changes to be acceptable. We might decide they are all acceptable. I don't think that I agree with the idea that this would ultimately only leak the hint bits- I'm fairly sure that this would make it relatively trivial for an attacker to be able to deduce the contents of the entire 8k page. I don't know that we should be willing to accept that as a part of regular operation (which we generally view crashes as being). I had thought there was something in place to address this though. If not, it does seem like there should be. Thanks, Stephen
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 04:29:08PM -0400, Stephen Frost wrote: > > On Tue, May 25, 2021 at 14:56 Bruce Momjian <bruce@momjian.us> wrote: > > > > On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote: > > > One question here is whether we're comfortable saying that the nonce > > > is entirely constant. I wasn't sure about that. It seems possible to > > > me that different encryption algorithms might want nonces of different > > > sizes, either now or in the future. I am not a cryptographer, but that > > > seemed like a bit of a limiting assumption. So Bharath and I decided > > > to make the POC cater to a fully variable-size nonce rather than > > > zero-or-some-constant. However, if the consensus is that > > > zero-or-some-constant is better, fair enough! The patch can certainly > > > be adjusted to cater to work that way. > > > > A 16-byte nonce is sufficient for AES and I doubt we will need anything > > stronger than AES256 anytime soon. Making the nonce variable length > > seems it is just adding complexity for little purpose. > > > > > > I’d like to review this more and make sure using the special space is possible > > but if it is then it opens up a huge new possibility that we could use it for > > both the nonce AND an appropriately sized tag, giving us integrity along with > > encryption which would be a very significant additional feature. I’d > > considered using a fork instead but having it on the page would be far better. > > We already discussed that there are too many other ways to break system > integrity that are not encrypted/integrity-checked, e.g., changes to > clog. Do you disagree? We had agreed that this wasn't something that was strictly required in the first version and I continue to agree with that. On the other hand, if we decide that we ultimately need to use an independent nonce and further that we can make room in the special space for it, then it's trivial to also include the tag and we absolutely should (or make it optional to do so) in that case. Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 05:14:24PM -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > Yes, I can see that happening. I think occasional leakage of hint bit > > changes to be acceptable. We might decide they are all acceptable. > > I don't think that I agree with the idea that this would ultimately only > leak the hint bits- I'm fairly sure that this would make it relatively > trivial for an attacker to be able to deduce the contents of the entire > 8k page. I don't know that we should be willing to accept that as a > part of regular operation (which we generally view crashes as being). I > had thought there was something in place to address this though. If > not, it does seem like there should be. Uh, can you please explain more? Would the hint bits leak? In another email you said hint bit leaking was OK. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote: > > We already discussed that there are too many other ways to break system > > integrity that are not encrypted/integrity-checked, e.g., changes to > > clog. Do you disagree? > > We had agreed that this wasn't something that was strictly required in > the first version and I continue to agree with that. On the other hand, > if we decide that we ultimately need to use an independent nonce and > further that we can make room in the special space for it, then it's > trivial to also include the tag and we absolutely should (or make it > optional to do so) in that case. Well, if we can't really say the data has integrity, what does the validation bytes accomplish? And if are going to encrypt everything that would allow integrity, we need to encrypt almost the entire file system. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 05:04:50PM -0400, Stephen Frost wrote: > > > Now, if we want to consult some security experts and have them tell us > > > the hint bit visibility is not a problem, we could get by without using a > > > new nonce for hint bit changes, and in that case it doesn't matter if we > > > have a separate LSN or custom nonce --- it doesn't get changed for hint > > > bit changes. > > > > I do think it's reasonable to consider having hint bits not included in > > the encrypted part of the page and therefore remove the need to produce > > a new nonce for each hint bit change. Naturally, there's always an > > increased risk when any data in the system isn't encrypted but given > > the other parts of the system which aren't being encrypted as part of > > this effort it hardly seems like a significant increase of overall risk. > > I don't believe that any of the auditors and security teams I've > > discussed TDE with would have issue with hint bits not being encrypted- > > the principle concern has always been the primary data. > > OK, this is good to know. I know the never-reuse rule, so it is good to > know it can be relaxed for certain data without causing problems in > other places. Should I modify my patch to do this? Err, to be clear, I was saying that we could exclude the hint bits *entirely* from what's being encrypted and I don't think that would be a huge issue. We still absolutely need to continue to implement a never-reuse rule when it comes to nonces and making sure that we don't encrypt different sets of data with the same key+nonce, it's just that if we exclude the hint bits from encryption then we don't need to worry about making sure to use a different nonce each time the hint bits change- because they're no longer relevant. > FYI, technically, the hint bit is still encrypted, but could _flip_ in > the encrypted file if changed, so that's why we say it is visible. If > we used a block cipher instead of a streaming one (CTR), this might not > work because the earlier blocks can be based in the output of later > blocks. No, in what I'm talking about, the hint bits would be entirely excluded and therefore not encrypted. I don't think we should keep the hint bits as part of what's encrypted but not increase the nonce, that's dangerous imv. > > Naturally, the more we are able to encrypt and the more we can do to > > provide data integrity validation, may open up the possibility for PG to > > be used in even more places, which argues for having some way of making > > these choices be options which a user could decide at initdb time, or at > > least contemplating a road map to where we could offer users the option > > to have other parts of the system be encrypted and ideally have data > > integrity checks, but I don't think we necessarily have to solve > > everything right now in that regard- just having TDE in some form will > > open up quite a few new possibilities for v15, even if it doesn't > > include data integrity validation beyond our existing checksums and > > doesn't encrypt hint bits. > > I am thinking full-file system encryption should still be used by people > needing that. I am concerned that if we add too many > restrictions/additions on this feature, it will not be very useful. I disagree in the long term but I'm fine with paring down what we specifically work to address for v15. Thanks, Stephen
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 05:14:24PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > Yes, I can see that happening. I think occasional leakage of hint bit > > > changes to be acceptable. We might decide they are all acceptable. > > > > I don't think that I agree with the idea that this would ultimately only > > leak the hint bits- I'm fairly sure that this would make it relatively > > trivial for an attacker to be able to deduce the contents of the entire > > 8k page. I don't know that we should be willing to accept that as a > > part of regular operation (which we generally view crashes as being). I > > had thought there was something in place to address this though. If > > not, it does seem like there should be. > > Uh, can you please explain more? Would the hint bits leak? In another > email you said hint bit leaking was OK. See my recent email, think I clarified it well over there. Thanks, Stephen
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote: > > > We already discussed that there are too many other ways to break system > > > integrity that are not encrypted/integrity-checked, e.g., changes to > > > clog. Do you disagree? > > > > We had agreed that this wasn't something that was strictly required in > > the first version and I continue to agree with that. On the other hand, > > if we decide that we ultimately need to use an independent nonce and > > further that we can make room in the special space for it, then it's > > trivial to also include the tag and we absolutely should (or make it > > optional to do so) in that case. > > Well, if we can't really say the data has integrity, what does the > validation bytes accomplish? And if are going to encrypt everything > that would allow integrity, we need to encrypt almost the entire file > system. I'm not following this logic. The primary data would be guaranteed to be unchanged and there is absolutely value in that, even if the metadata is not guaranteed to be unmolested. Security always comes with a lot of tradeoffs. RLS doesn't prevent certain side-channel attacks but it still is extremely useful in a great many cases. Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 05:22:43PM -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > OK, this is good to know. I know the never-reuse rule, so it is good to > > know it can be relaxed for certain data without causing problems in > > other places. Should I modify my patch to do this? > > Err, to be clear, I was saying that we could exclude the hint bits > *entirely* from what's being encrypted and I don't think that would be a > huge issue. We still absolutely need to continue to implement a > never-reuse rule when it comes to nonces and making sure that we don't > encrypt different sets of data with the same key+nonce, it's just that > if we exclude the hint bits from encryption then we don't need to worry > about making sure to use a different nonce each time the hint bits > change- because they're no longer relevant. So, let me ask --- I thought CTR basically took an encrypted stream of bits and XOR'ed them with the data. If that is true, then why are changing hint bits a problem? We already can see some of the bit stream by knowing some bytes of the page. I do think skipping encryption of just the hint bits is more complex, so I want to understand why if is needed. (This is a question I eventually wanted to discuss, just like my XXX questions.) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 05:25:36PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote: > > > > We already discussed that there are too many other ways to break system > > > > integrity that are not encrypted/integrity-checked, e.g., changes to > > > > clog. Do you disagree? > > > > > > We had agreed that this wasn't something that was strictly required in > > > the first version and I continue to agree with that. On the other hand, > > > if we decide that we ultimately need to use an independent nonce and > > > further that we can make room in the special space for it, then it's > > > trivial to also include the tag and we absolutely should (or make it > > > optional to do so) in that case. > > > > Well, if we can't really say the data has integrity, what does the > > validation bytes accomplish? And if are going to encrypt everything > > that would allow integrity, we need to encrypt almost the entire file > > system. > > I'm not following this logic. The primary data would be guaranteed to > be unchanged and there is absolutely value in that, even if the metadata > is not guaranteed to be unmolested. Security always comes with a lot of > tradeoffs. RLS doesn't prevent certain side-channel attacks but it > still is extremely useful in a great many cases. Well, changing the clog would change how the integrity-protected data is interpreted, so I don't see much value in it. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-05-25 16:34:10 -0400, Stephen Frost wrote: > The nonce does need to be absolutely unique for a given encryption key and > therefore needs to be global in some form. You can achieve that without a global counter though, by prepending a per-relation nonce with some local counter. I'm doubtful it's worth it though - compared to all the other costs, one shared atomic increment is pretty OK price to pay I think. Greetings, Andres Freund
Hi, On 2021-05-25 17:04:50 -0400, Stephen Frost wrote: > I do think it's reasonable to consider having hint bits not included in > the encrypted part of the page and therefore remove the need to produce > a new nonce for each hint bit change. Huh. How are you going to track that efficiently? Do you want to mask them out before writing? As far as I understand you can't just re-encrypt a page with the same nonce, but different contents, without leaking information that you can't have leaked, even if the differences are not of a secret nature. I don't think hint bits are the only way to end up with needing to re-write a page with slightly different content, but the same LSN, during recovery, after a crash. I think it's just not going to fly to use LSNs as nonces, and that it's not worth butchering all kinds of aspect of the system to make it appear to work. Greetings, Andres Freund
Hi, On 2021-05-25 17:22:43 -0400, Stephen Frost wrote: > Err, to be clear, I was saying that we could exclude the hint bits > *entirely* from what's being encrypted and I don't think that would be a > huge issue. It's a *huge* issue. For one, the computational effort of doing so would be a problem. But there's a more fundamental issue: We don't even know the type of the page at the time we write data out! We can't do a lookup of pg_class in the checkpointer to see whether the page is a heap page where we need to mask out hint bits. Greetings, Andres Freund
Hi, On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote: > So, let me ask --- I thought CTR basically took an encrypted stream of > bits and XOR'ed them with the data. If that is true, then why are > changing hint bits a problem? We already can see some of the bit stream > by knowing some bytes of the page. A *single* reuse of the nonce in CTR reveals nearly all of the plaintext. As you say, the data is XORed with the key stream. Reusing the nonce means that you reuse the key stream. Which in turn allows you to do: (data ^ stream) ^ (data' ^ stream) which can be simplified to (data ^ data') thereby leaking all of data except the difference between data and data'. That's why it's so crucial to ensure that stream *always* differs between two rounds of encrypting "related" data. We can't just "hope" that data doesn't change and use CTR. Greetings, Andres Freund
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 05:22:43PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > OK, this is good to know. I know the never-reuse rule, so it is good to > > > know it can be relaxed for certain data without causing problems in > > > other places. Should I modify my patch to do this? > > > > Err, to be clear, I was saying that we could exclude the hint bits > > *entirely* from what's being encrypted and I don't think that would be a > > huge issue. We still absolutely need to continue to implement a > > never-reuse rule when it comes to nonces and making sure that we don't > > encrypt different sets of data with the same key+nonce, it's just that > > if we exclude the hint bits from encryption then we don't need to worry > > about making sure to use a different nonce each time the hint bits > > change- because they're no longer relevant. > > So, let me ask --- I thought CTR basically took an encrypted stream of > bits and XOR'ed them with the data. If that is true, then why are > changing hint bits a problem? We already can see some of the bit stream > by knowing some bytes of the page. I do think skipping encryption of > just the hint bits is more complex, so I want to understand why if is > needed. (This is a question I eventually wanted to discuss, just like > my XXX questions.) That's how CTR works, yes. The issue that you run into is that once you've got two pages which have different data but were encrypted with the same key and nonce then you can use crib-dragging. A good example of how this works is here: http://travisdazell.blogspot.com/2012/11/many-time-pad-attack-crib-drag.html Once you've got the two different pages which had the same key+nonce used, you can XOR them together and then start cribbing, scanning the page for legitimate data which doesn't have to be in the part of the data that was different between the two original pages. Not sure what you're referring to in the second half ... simply knowing that some of the data has a given plaintext (such as having a really good idea that the word 'the' exists in a given message) doesn't provide you the same level of information as two pages encrypted with the same key+nonce but having different data. Indeed, AES is generally believed to be quite effective against even given plaintext attacks: https://math.stackexchange.com/questions/51960/is-it-possible-to-guess-an-aes-key-from-a-series-of-messages-encrypted-with-that/57428 Thanks, Stephen
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 05:25:36PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote: > > > > > We already discussed that there are too many other ways to break system > > > > > integrity that are not encrypted/integrity-checked, e.g., changes to > > > > > clog. Do you disagree? > > > > > > > > We had agreed that this wasn't something that was strictly required in > > > > the first version and I continue to agree with that. On the other hand, > > > > if we decide that we ultimately need to use an independent nonce and > > > > further that we can make room in the special space for it, then it's > > > > trivial to also include the tag and we absolutely should (or make it > > > > optional to do so) in that case. > > > > > > Well, if we can't really say the data has integrity, what does the > > > validation bytes accomplish? And if are going to encrypt everything > > > that would allow integrity, we need to encrypt almost the entire file > > > system. > > > > I'm not following this logic. The primary data would be guaranteed to > > be unchanged and there is absolutely value in that, even if the metadata > > is not guaranteed to be unmolested. Security always comes with a lot of > > tradeoffs. RLS doesn't prevent certain side-channel attacks but it > > still is extremely useful in a great many cases. > > Well, changing the clog would change how the integrity-protected data is > interpreted, so I don't see much value in it. I hate to have to say it, but no, it's simply not correct to presume that the ability to maniuplate any data means that it's not valuable to protect anything. Further, while clog could be manipulated today, hopefully one day it would become quite difficult to do so. I'm not asking for that today, or to be in v15, but if we do come down on the side of making space in the special area for a nonce, then, even if you don't feel it's useful, I would strongly argue to have an option for space to also exist for a tag to go. Even if your claim that it's useless until clog is addressed were correct, which I dispute, surely if we do one day have such validation of clog we would also need a tag in the regular user pages, so why not add the option while it's easy to do and let users decide if it's useful to them or not? This does presume that we ultimately agree on the approach which involves the special area, of course. Thanks, Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-25 16:34:10 -0400, Stephen Frost wrote: > > The nonce does need to be absolutely unique for a given encryption key and > > therefore needs to be global in some form. > > You can achieve that without a global counter though, by prepending a > per-relation nonce with some local counter. > > I'm doubtful it's worth it though - compared to all the other costs, one > shared atomic increment is pretty OK price to pay I think. Yes, I tend to agree. Thanks, Stephen
Attachment
On 2021-05-25 19:48:54 -0400, Stephen Frost wrote: > That's how CTR works, yes. The issue that you run into is that once > you've got two pages which have different data but were encrypted with > the same key and nonce then you can use crib-dragging. > > A good example of how this works is here: > > http://travisdazell.blogspot.com/2012/11/many-time-pad-attack-crib-drag.html > > Once you've got the two different pages which had the same key+nonce > used, you can XOR them together and then start cribbing, scanning the > page for legitimate data which doesn't have to be in the part of the > data that was different between the two original pages. IOW, purely hint bit changes are the *dream* case for an attacker, because any difference can just be ignored. All an attacker has to do is to look at the writes, see if an IV repeats for a block, and the attacker will get the *entire* page's worth of data. Either minus hint bits (which are irrelevant), or with a trivial bit of inferrence even that (because hint bits can only change in one direction). Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-25 17:04:50 -0400, Stephen Frost wrote: > > I do think it's reasonable to consider having hint bits not included in > > the encrypted part of the page and therefore remove the need to produce > > a new nonce for each hint bit change. > > Huh. How are you going to track that efficiently? Do you want to mask > them out before writing? As far as I understand you can't just > re-encrypt a page with the same nonce, but different contents, without > leaking information that you can't have leaked, even if the differences > are not of a secret nature. The simple thought I had was masking them out, yes. No, you can't re-encrypt a different page with the same nonce. (Re-encrypting the exact same page with the same nonce, however, just yields the same cryptotext and therefore is fine). > I don't think hint bits are the only way to end up with needing to > re-write a page with slightly different content, but the same LSN, > during recovery, after a crash. Any other cases would have to be addressed if we were to use LSNs, of course. > I think it's just not going to fly to use LSNs as nonces, and that it's > not worth butchering all kinds of aspect of the system to make it appear > to work. I do agree that we'd want to avoid "butchering all kinds of aspects of the system" if possible. :) Thanks! Stephen
Attachment
On 2021-05-25 17:15:55 -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > We already discussed that there are too many other ways to break system > > integrity that are not encrypted/integrity-checked, e.g., changes to > > clog. Do you disagree? > > We had agreed that this wasn't something that was strictly required in > the first version and I continue to agree with that. On the other hand, > if we decide that we ultimately need to use an independent nonce and > further that we can make room in the special space for it, then it's > trivial to also include the tag and we absolutely should (or make it > optional to do so) in that case. The page format for clog and that for relation data is unrelated.
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-25 17:22:43 -0400, Stephen Frost wrote: > > Err, to be clear, I was saying that we could exclude the hint bits > > *entirely* from what's being encrypted and I don't think that would be a > > huge issue. > > It's a *huge* issue. For one, the computational effort of doing so would > be a problem. But there's a more fundamental issue: We don't even know > the type of the page at the time we write data out! We can't do a lookup > of pg_class in the checkpointer to see whether the page is a heap page > where we need to mask out hint bits. Yeah, I hadn't been contemplating the challenge in figuring out if the changes were hint bit changes or if it was some other page- merely reflecting on the question of if hint bits, themselves, could possibly be excluded. Thanks, Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-25 17:15:55 -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > We already discussed that there are too many other ways to break system > > > integrity that are not encrypted/integrity-checked, e.g., changes to > > > clog. Do you disagree? > > > > We had agreed that this wasn't something that was strictly required in > > the first version and I continue to agree with that. On the other hand, > > if we decide that we ultimately need to use an independent nonce and > > further that we can make room in the special space for it, then it's > > trivial to also include the tag and we absolutely should (or make it > > optional to do so) in that case. > > The page format for clog and that for relation data is unrelated. Indeed they are, but that's not relevant to the thrust of this specific debate. Bruce is arguing that because clog is unprotected that it's not useful to protect relation data, with regard to data integrity validation as provided by AES-GCM using/storing tags. I dispute this, as relation data is primary data while clog, for all its value, is still metadata. Yes, impacting the metadata has an impact on the primary data, but it doesn't *change* that primary data at its core (and it's also more likely to be detected than random bit flipping in the relation data would be, which is possible if you're only encrypting and not providing any integrity validation). Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 08:03:14PM -0400, Stephen Frost wrote: > Indeed they are, but that's not relevant to the thrust of this specific > debate. > > Bruce is arguing that because clog is unprotected that it's not useful > to protect relation data, with regard to data integrity validation as > provided by AES-GCM using/storing tags. I dispute this, as relation > data is primary data while clog, for all its value, is still metadata. > Yes, impacting the metadata has an impact on the primary data, but it > doesn't *change* that primary data at its core (and it's also more > likely to be detected than random bit flipping in the relation data > would be, which is possible if you're only encrypting and not providing > any integrity validation). Even if you can protect clog, this documentation paragraph makes it clear that if you can modify the cluster, you can weaken security enough to read and write any data you want: https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch Cluster file encryption does not protect against unauthorized file system writes. Such writes can allow data decryption if used to weaken the system's security and the weakened system is later supplied with the externally-stored cluster encryption key. This also does not always detect if users with write access remove or modify database files. I know of no way to make that safer, so again, I don't see the value in modification detection. Maybe someday we would find a way, but it seems so remote as to not warrant consideration. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote: > > So, let me ask --- I thought CTR basically took an encrypted stream of > > bits and XOR'ed them with the data. If that is true, then why are > > changing hint bits a problem? We already can see some of the bit stream > > by knowing some bytes of the page. > > A *single* reuse of the nonce in CTR reveals nearly all of the > plaintext. As you say, the data is XORed with the key stream. Reusing > the nonce means that you reuse the key stream. Which in turn allows you > to do: > (data ^ stream) ^ (data' ^ stream) > which can be simplified to > (data ^ data') > thereby leaking all of data except the difference between data and > data'. That's why it's so crucial to ensure that stream *always* differs > between two rounds of encrypting "related" data. > > We can't just "hope" that data doesn't change and use CTR. My point was about whether we need to change the nonce, and hence WAL-log full page images if we change hint bits. If we don't and reencrypt the page with the same nonce, don't we only expose the hint bits? I was not suggesting we avoid changing the nonce in non-hint-bit cases. I don't understand your computation above. You decrypt the page into shared buffers, you change a hint bit, and rewrite the page. You are re-XOR'ing the buffer copy with the same key and nonce. Doesn't that only change the hint bits in the new write? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 08:03:14PM -0400, Stephen Frost wrote: > > Indeed they are, but that's not relevant to the thrust of this specific > > debate. > > > > Bruce is arguing that because clog is unprotected that it's not useful > > to protect relation data, with regard to data integrity validation as > > provided by AES-GCM using/storing tags. I dispute this, as relation > > data is primary data while clog, for all its value, is still metadata. > > Yes, impacting the metadata has an impact on the primary data, but it > > doesn't *change* that primary data at its core (and it's also more > > likely to be detected than random bit flipping in the relation data > > would be, which is possible if you're only encrypting and not providing > > any integrity validation). > > Even if you can protect clog, this documentation paragraph makes it > clear that if you can modify the cluster, you can weaken security enough > to read and write any data you want: > > https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch > > Cluster file encryption does not protect against unauthorized > file system writes. Such writes can allow data decryption if > used to weaken the system's security and the weakened system is > later supplied with the externally-stored cluster encryption key. > This also does not always detect if users with write access remove > or modify database files. This is clearly a different consideration than the concern around clog and speaks to the issues with how we fetch and maintain the key- things which we can and really should be better about than what is currently being done, and which I do believe we will improve upon. > I know of no way to make that safer, so again, I don't see the value in > modification detection. Maybe someday we would find a way, but it seems > so remote as to not warrant consideration. I'm rather baffled by the comment that there's 'no way to make that safer'. Giving users a way to segregate actual data from configuration and commands would greatly improve the situation by making it much more difficult for a user who only has access to the data directory, where much of the data is encrypted and protected against data maniupulation using proper tags, to capture the encryption key. The concerns which are not actually discussed in the paragraph above relate to how the key is handled- specifically that we run some external command that the user provides to fetch it, and that command can be overridden via postgresql.auto.conf that lives in the data directory. That's not a terribly safe thing to do and we can certainly do better, and without all that much difficulty if we actually look at doing so. A very simple approach would be to just require that the command to fetch the encryption key come from postgresql.conf and then simply encrypt+protect postgresql.auto.conf. We'd then document that the user needs to ensure they have appropriate protection of postgresql.conf, which could and probably should live elsewhere. I'd like to see us incrementally move in the direction of providing a way for users, probably advanced ones to start but hopefully eventually getting to a point that you don't have to be an advanced user, to implement a reasonably secure solution which provides both confidentiality and integrity. We do not have to solve all of these things in the first release, but I don't think we should be talking today about tossing out the idea that, some day down the road, we could have a robust system which provides both. Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 07:48:54PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, May 25, 2021 at 05:22:43PM -0400, Stephen Frost wrote: > > > * Bruce Momjian (bruce@momjian.us) wrote: > > > > OK, this is good to know. I know the never-reuse rule, so it is good to > > > > know it can be relaxed for certain data without causing problems in > > > > other places. Should I modify my patch to do this? > > > > > > Err, to be clear, I was saying that we could exclude the hint bits > > > *entirely* from what's being encrypted and I don't think that would be a > > > huge issue. We still absolutely need to continue to implement a > > > never-reuse rule when it comes to nonces and making sure that we don't > > > encrypt different sets of data with the same key+nonce, it's just that > > > if we exclude the hint bits from encryption then we don't need to worry > > > about making sure to use a different nonce each time the hint bits > > > change- because they're no longer relevant. > > > > So, let me ask --- I thought CTR basically took an encrypted stream of > > bits and XOR'ed them with the data. If that is true, then why are > > changing hint bits a problem? We already can see some of the bit stream > > by knowing some bytes of the page. I do think skipping encryption of > > just the hint bits is more complex, so I want to understand why if is > > needed. (This is a question I eventually wanted to discuss, just like > > my XXX questions.) > > That's how CTR works, yes. The issue that you run into is that once > you've got two pages which have different data but were encrypted with > the same key and nonce then you can use crib-dragging. > > A good example of how this works is here: > > http://travisdazell.blogspot.com/2012/11/many-time-pad-attack-crib-drag.html > > Once you've got the two different pages which had the same key+nonce > used, you can XOR them together and then start cribbing, scanning the > page for legitimate data which doesn't have to be in the part of the > data that was different between the two original pages. > > Not sure what you're referring to in the second half ... simply knowing > that some of the data has a given plaintext (such as having a really > good idea that the word 'the' exists in a given message) doesn't provide > you the same level of information as two pages encrypted with the same > key+nonce but having different data. Indeed, AES is generally believed > to be quite effective against even given plaintext attacks: > > https://math.stackexchange.com/questions/51960/is-it-possible-to-guess-an-aes-key-from-a-series-of-messages-encrypted-with-that/57428 Agreed. I was just reinforcing that, and trying to say that hint bit change might also be considered known information. Anyway, if you think the hint bit changes would leak, I an accept that. It means we need to wal log hit bit changes, no matter if the nonce is the LSN or a custom one. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 07:48:54PM -0400, Stephen Frost wrote: > > Not sure what you're referring to in the second half ... simply knowing > > that some of the data has a given plaintext (such as having a really > > good idea that the word 'the' exists in a given message) doesn't provide > > you the same level of information as two pages encrypted with the same > > key+nonce but having different data. Indeed, AES is generally believed > > to be quite effective against even given plaintext attacks: > > > > https://math.stackexchange.com/questions/51960/is-it-possible-to-guess-an-aes-key-from-a-series-of-messages-encrypted-with-that/57428 > > Agreed. I was just reinforcing that, and trying to say that hint bit > change might also be considered known information. > > Anyway, if you think the hint bit changes would leak, I an accept that. > It means we need to wal log hit bit changes, no matter if the nonce is > the LSN or a custom one. The nonce needs to be a new one, if we include the hint bits in the set of data which is encrypted. However, what I believe folks are getting at here is that we could keep the LSN the same, but increase the nonce when the hint bits change, but *not* WAL log either the nonce change or the hint bit change (unless it's being logged for some other reason, in which case log both), thus reducing the amount of WAL being produced. What would matter is that both the hint bit change and the new nonce hit disk at the same time, or neither do, or we replay back to some state where the nonce and the hint bits 'match up' so that the page decrypts (and the integrity check works). That generally seems pretty reasonable to me and basically makes the increase in nonce work very much in the same manner that the hint bits themselves do- sometimes it changes even when the LSN doesn't but, in such cases, we don't actually WAL it, and that's ok because we don't actually care about it being updated- what's in the WAL when the page is replayed is perfectly fine and we'll just update the hint bits again when and if we decide we need to based on the actual visibility information at that time. Now, making sure that we don't end up re-using the same nonce over again is a concern and we'd want to address that somehow, as suggested earlier perhaps by simply incrementing it making sure to durably note whenever we'd crossed some threshold (each 1k or whatever) and then on crash recovery making sure we bump past that, but that seems entirely doable. Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote: > The nonce needs to be a new one, if we include the hint bits in the set > of data which is encrypted. > > However, what I believe folks are getting at here is that we could keep > the LSN the same, but increase the nonce when the hint bits change, but > *not* WAL log either the nonce change or the hint bit change (unless > it's being logged for some other reason, in which case log both), thus > reducing the amount of WAL being produced. What would matter is that > both the hint bit change and the new nonce hit disk at the same time, or > neither do, or we replay back to some state where the nonce and the hint > bits 'match up' so that the page decrypts (and the integrity check > works). How do we prevent torn pages if we are writing the page with a new nonce, and no WAL-logged full page image? > That generally seems pretty reasonable to me and basically makes the > increase in nonce work very much in the same manner that the hint bits > themselves do- sometimes it changes even when the LSN doesn't but, in > such cases, we don't actually WAL it, and that's ok because we don't > actually care about it being updated- what's in the WAL when the page is > replayed is perfectly fine and we'll just update the hint bits again > when and if we decide we need to based on the actual visibility > information at that time. We get away with this because hint-bit only changes only change single bytes on the page, and we can't tear a page between bytes, but if we change the nonce, the entire page will have different bytes. What am I missing here? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 2021-05-25 21:51:31 -0400, Bruce Momjian wrote: > How do we prevent torn pages if we are writing the page with a new > nonce, and no WAL-logged full page image? That should only arise if we are guaranteed to replay from a redo point that is followed by at least one FPI for the page we're about to write? - Andres
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote: > > The nonce needs to be a new one, if we include the hint bits in the set > > of data which is encrypted. > > > > However, what I believe folks are getting at here is that we could keep > > the LSN the same, but increase the nonce when the hint bits change, but > > *not* WAL log either the nonce change or the hint bit change (unless > > it's being logged for some other reason, in which case log both), thus > > reducing the amount of WAL being produced. What would matter is that > > both the hint bit change and the new nonce hit disk at the same time, or > > neither do, or we replay back to some state where the nonce and the hint > > bits 'match up' so that the page decrypts (and the integrity check > > works). > > How do we prevent torn pages if we are writing the page with a new > nonce, and no WAL-logged full page image? err, we'd still WAL the FPI, same as we do for checksums, that's what I would expect and would think we'd need. As long as the FPI is in the WAL since the last checkpoint, later changes to hint bits or the nonce wouldn't matter- we'll replay the FPI and that'll have the right nonce for the hint bits that were part of the FPI. Any subsequent changes to the hint bits wouldn't be WAL'd though and neither would the changes to the nonce and that all should be fine because we'll blow away the entire page on crash recovery to push it back to what it was when we first wrote the page after the last checkpoint. Naturally, other changes which have to be WAL'd would still be done but those would be replayed in shared buffers on top of the prior FPI and the nonce set to some $new value (one which we know couldn't have been used prior, by incrementing by some value) when we go to write out that new page. Thanks, Stephen
Attachment
On Tue, May 25, 2021 at 09:58:22PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote: > > > The nonce needs to be a new one, if we include the hint bits in the set > > > of data which is encrypted. > > > > > > However, what I believe folks are getting at here is that we could keep > > > the LSN the same, but increase the nonce when the hint bits change, but > > > *not* WAL log either the nonce change or the hint bit change (unless > > > it's being logged for some other reason, in which case log both), thus > > > reducing the amount of WAL being produced. What would matter is that > > > both the hint bit change and the new nonce hit disk at the same time, or > > > neither do, or we replay back to some state where the nonce and the hint > > > bits 'match up' so that the page decrypts (and the integrity check > > > works). > > > > How do we prevent torn pages if we are writing the page with a new > > nonce, and no WAL-logged full page image? > > err, we'd still WAL the FPI, same as we do for checksums, that's what I > would expect and would think we'd need. As long as the FPI is in the > WAL since the last checkpoint, later changes to hint bits or the nonce > wouldn't matter- we'll replay the FPI and that'll have the right nonce > for the hint bits that were part of the FPI. > > Any subsequent changes to the hint bits wouldn't be WAL'd though and > neither would the changes to the nonce and that all should be fine > because we'll blow away the entire page on crash recovery to push it > back to what it was when we first wrote the page after the last > checkpoint. Naturally, other changes which have to be WAL'd would still > be done but those would be replayed in shared buffers on top of the > prior FPI and the nonce set to some $new value (one which we know > couldn't have been used prior, by incrementing by some value) when we go > to write out that new page. OK, I see what you are saying. If we use a nonce that is not the full page write LSN then we can use it for hint bit changes _after_ the first full page write during the checkpoint, and we don't need to WAL log that since it isn't a real LSN and we can throw it away on crash recovery. This is not possible if we are using the LSN for the full page write LSN for the hint bit nonce, though we could use a dummy WAL record to generate an LSN for this, right? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 2021-05-25 22:11:46 -0400, Bruce Momjian wrote: > This is not possible if we are using the LSN for the full page write LSN > for the hint bit nonce, though we could use a dummy WAL record to > generate an LSN for this, right? We cannot use a dummy WAL record, see my explanation about the standby / crash recovery issues.
Greetings,
On Tue, May 25, 2021 at 22:11 Bruce Momjian <bruce@momjian.us> wrote:
On Tue, May 25, 2021 at 09:58:22PM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote:
> > > The nonce needs to be a new one, if we include the hint bits in the set
> > > of data which is encrypted.
> > >
> > > However, what I believe folks are getting at here is that we could keep
> > > the LSN the same, but increase the nonce when the hint bits change, but
> > > *not* WAL log either the nonce change or the hint bit change (unless
> > > it's being logged for some other reason, in which case log both), thus
> > > reducing the amount of WAL being produced. What would matter is that
> > > both the hint bit change and the new nonce hit disk at the same time, or
> > > neither do, or we replay back to some state where the nonce and the hint
> > > bits 'match up' so that the page decrypts (and the integrity check
> > > works).
> >
> > How do we prevent torn pages if we are writing the page with a new
> > nonce, and no WAL-logged full page image?
>
> err, we'd still WAL the FPI, same as we do for checksums, that's what I
> would expect and would think we'd need. As long as the FPI is in the
> WAL since the last checkpoint, later changes to hint bits or the nonce
> wouldn't matter- we'll replay the FPI and that'll have the right nonce
> for the hint bits that were part of the FPI.
>
> Any subsequent changes to the hint bits wouldn't be WAL'd though and
> neither would the changes to the nonce and that all should be fine
> because we'll blow away the entire page on crash recovery to push it
> back to what it was when we first wrote the page after the last
> checkpoint. Naturally, other changes which have to be WAL'd would still
> be done but those would be replayed in shared buffers on top of the
> prior FPI and the nonce set to some $new value (one which we know
> couldn't have been used prior, by incrementing by some value) when we go
> to write out that new page.
OK, I see what you are saying. If we use a nonce that is not the full
page write LSN then we can use it for hint bit changes _after_ the first
full page write during the checkpoint, and we don't need to WAL log that
since it isn't a real LSN and we can throw it away on crash recovery.
This is not possible if we are using the LSN for the full page write LSN
for the hint bit nonce, though we could use a dummy WAL record to
generate an LSN for this, right?
Yes, think you’ve got it. To do it using LSNs and ensure that we always have a unique nonce we’d have to generated dummy WAL, in order to get new LSNs to make sure the nonce is always unique and that wouldn’t be great.
Andres mentioned other possible cases where the LSN doesn’t change even though we change the page and, as he’s probably right, we would have to figure out a solution in those cases too (potentially including cases like crash recovery or replay on a replica where we can’t really just go around creating dummy WAL records to get new LSNs..). If the nonce isn’t the LSN then suddenly those cases are fine and the LSN can stay the same and it doesn’t matter that the nonce is changed when we write out the page during crash recovery because it’s not tied to the WAL/LSN stream.
If I’ve got it right, that does mean that the nonces on the replica might differ from those on the primary though and I’m not completely sure how I feel about that. We might wish to explicitly document that, due to such risk, users should use unique and distinct keys on each replica that are different from the primary and each other (not a bad idea in general anyway, but would be quite important with this strategy).
Thanks,
Stephen
On Tue, May 25, 2021 at 10:23:46PM -0400, Stephen Frost wrote: > If I’ve got it right, that does mean that the nonces on the replica might > differ from those on the primary though and I’m not completely sure how I feel > about that. We might wish to explicitly document that, due to such risk, users > should use unique and distinct keys on each replica that are different from the > primary and each other (not a bad idea in general anyway, but would be quite > important with this strategy). I have to think more about this, but we were planning to allow different primary and replica relation encryption keys to allow for relation key rotation. The WAL key has to be the same for both. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, May 25, 2021 at 09:31:02PM -0400, Bruce Momjian wrote: > I don't understand your computation above. You decrypt the page into > shared buffers, you change a hint bit, and rewrite the page. You are > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > only change the hint bits in the new write? Can someone explain the hint bit exploit using the process I describe here? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Bruce Momjian <bruce@momjian.us> wrote: > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > Hi, > > > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote: > > > So, let me ask --- I thought CTR basically took an encrypted stream of > > > bits and XOR'ed them with the data. If that is true, then why are > > > changing hint bits a problem? We already can see some of the bit stream > > > by knowing some bytes of the page. > > > > A *single* reuse of the nonce in CTR reveals nearly all of the > > plaintext. As you say, the data is XORed with the key stream. Reusing > > the nonce means that you reuse the key stream. Which in turn allows you > > to do: > > (data ^ stream) ^ (data' ^ stream) > > which can be simplified to > > (data ^ data') > > thereby leaking all of data except the difference between data and > > data'. That's why it's so crucial to ensure that stream *always* differs > > between two rounds of encrypting "related" data. > > > > We can't just "hope" that data doesn't change and use CTR. > > My point was about whether we need to change the nonce, and hence > WAL-log full page images if we change hint bits. If we don't and > reencrypt the page with the same nonce, don't we only expose the hint > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > cases. > > I don't understand your computation above. You decrypt the page into > shared buffers, you change a hint bit, and rewrite the page. You are > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > only change the hint bits in the new write? The way I view things is that the CTR mode encrypts each individual bit, independent from any other bit on the page. For non-hint bits data=data', so (data ^ data') is always zero, regardless the actual values of the data. So I agree with you that by reusing the nonce we only expose the hint bits. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Tue, May 25, 2021 at 7:58 PM Stephen Frost <sfrost@snowman.net> wrote: > The simple thought I had was masking them out, yes. No, you can't > re-encrypt a different page with the same nonce. (Re-encrypting the > exact same page with the same nonce, however, just yields the same > cryptotext and therefore is fine). In the interest of not being viewed as too much of a naysayer, let me first reiterate that I am generally in favor of TDE going forward and am not looking to throw up unnecessary obstacles in the way of making that happen. That said, I don't see how this particular idea can work. When we want to write a page out to disk, we need to identify which bits in the page are hint bits, so that we can avoid including them in what is encrypted, which seems complicated and expensive. But even worse, when we then read a page back off of disk, we'd need to decrypt everything except for the hint bits, but how do we know which bits are hint bits if the page isn't decrypted yet? We can't annotate an 8kB page that might be full with enough extra information to say where the non-encrypted parts are and still have the result be guaranteed to fit within 8kb. Also, it's not just hint bits per se, but anything that would cause us to use MarkBufferDirtyHint(). For a btree index, per _bt_check_unique and _bt_killitems, that includes the entire line pointer array, because of how ItemIdMarkDead() is used. Even apart from the problem of how decryption would know which things we encrypted and which things we didn't, I really have a hard time believing that it's OK to exclude the entire line pointer array in every btree page from encryption from a security perspective. Among other potential problems, that's leaking all the information an attacker could possibly want to have about where their known plaintext might occur in the page. However, I believe that if we store the nonce in the page explicitly, as proposed here, rather trying to derive it from the LSN, then we don't need to worry about this kind of masking, which I think is better from both a security perspective and a performance perspective. There is one thing I'm not quite sure about, though. I had previously imagined that each page would have a nonce and we could just do nonce++ each time we write the page. But that doesn't quite work if the standby can do more writes of the same page than the master. One vague idea I have for fixing this is: let each page's 16-byte nonce consist of 8 random bytes and an 8-byte counter that will be incremented on every write. But, the first time a standby writes each page, force a "key rotation" where the 8-byte random value is replaced with a new one, different one from what the master is using for that page. Detecting this is a bit expensive, because it probably means we need to store the TLI that last wrote each page on every page too, but maybe it could be made to work; we're talking about a feature that is expensive by nature. However, I'm a little worried about the cryptographic properties of this approach. It would often mean that an attacker who has full filesystem access can get multiple encrypted images of the same data, each encrypted with a different nonce. I don't know whether that's a hazard or not, but it feels like the sort of thing that, if I were a cryptographer, I would be pleased to have. Another idea might be - instead of doing nonce++ every time we write the page, do nonce=random(). That's eventually going to repeat a value, but it's extremely likely to take a *super* long time if there are enough bits. A potentially rather large problem, though, is that generating random numbers in large quantities isn't very cheap. Anybody got a better idea? I really like your (Stephen's) idea of including something in the special space that permits integrity checking. One thing that is quite nice about that is we could do it first, as an independent patch, before we did TDE. It would be an independently useful feature, and it would mean that if there are any problems with the code that injects stuff into the special space, we could try to track those down in a non-TDE context. That's really good, because in a TDE context, the pages are going to be garbled and unreadable (we hope, anyway). If we have a problem that we can reproduce with just an integrity-checking token shoved into every page, you can look at the page and try to understand what went wrong. So I really like this direction both from the point of view of improving integrity checking, and also from the point of view of being able to debug problems. Now, one downside of this approach is that if we have the ability to turn integrity-checking tokens on and off, and separately we can turn encryption on and off, then we can't simplify down to two cases as Andres was advocating above; you have to cater to a variety of possible values of how-much-stuff-we-squeezed-into-the-special space. At that point you kind of end up with the approach the draft patches were already taking, which Andres was worried would be expensive. I am not entirely certain, however, that I understand what the proposal is here exactly for integrity verification. I Googled "AES-GCM using/storing tags" but it didn't help me that much, because I don't really know the subject area. A really simple integrity verifier for a page would be to store the db OID, ts OID, relfilenode, and block number in the page, and check them on read, preventing blocks from moving around without us noticing. But I gather that perhaps the idea here is to store something like hash(db_oid||ts_oid||relfilenode||block||block_contents) in each page, basically a beefed-up checksum that is too wide to fake easily. It's probably more complicated than that, though: I admit to having limited knowledge of modern cryptography. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, May 25, 2021 at 7:58 PM Stephen Frost <sfrost@snowman.net> wrote: > > The simple thought I had was masking them out, yes. No, you can't > > re-encrypt a different page with the same nonce. (Re-encrypting the > > exact same page with the same nonce, however, just yields the same > > cryptotext and therefore is fine). > > In the interest of not being viewed as too much of a naysayer, let me > first reiterate that I am generally in favor of TDE going forward and > am not looking to throw up unnecessary obstacles in the way of making > that happen. Quite glad to hear that. Hopefully we'll all be able to get on the same page to move TDE forward. > That said, I don't see how this particular idea can work. When we want > to write a page out to disk, we need to identify which bits in the > page are hint bits, so that we can avoid including them in what is > encrypted, which seems complicated and expensive. But even worse, when > we then read a page back off of disk, we'd need to decrypt everything > except for the hint bits, but how do we know which bits are hint bits > if the page isn't decrypted yet? We can't annotate an 8kB page that > might be full with enough extra information to say where the > non-encrypted parts are and still have the result be guaranteed to fit > within 8kb. Yeah, Andres pointed that out and it's certainly an issue with this general idea. > Also, it's not just hint bits per se, but anything that would cause us > to use MarkBufferDirtyHint(). For a btree index, per _bt_check_unique > and _bt_killitems, that includes the entire line pointer array, > because of how ItemIdMarkDead() is used. Even apart from the problem > of how decryption would know which things we encrypted and which > things we didn't, I really have a hard time believing that it's OK to > exclude the entire line pointer array in every btree page from > encryption from a security perspective. Among other potential > problems, that's leaking all the information an attacker could > possibly want to have about where their known plaintext might occur in > the page. Also a good point. > However, I believe that if we store the nonce in the page explicitly, > as proposed here, rather trying to derive it from the LSN, then we > don't need to worry about this kind of masking, which I think is > better from both a security perspective and a performance perspective. > There is one thing I'm not quite sure about, though. I had previously > imagined that each page would have a nonce and we could just do > nonce++ each time we write the page. But that doesn't quite work if > the standby can do more writes of the same page than the master. One > vague idea I have for fixing this is: let each page's 16-byte nonce > consist of 8 random bytes and an 8-byte counter that will be > incremented on every write. But, the first time a standby writes each > page, force a "key rotation" where the 8-byte random value is replaced > with a new one, different one from what the master is using for that > page. Detecting this is a bit expensive, because it probably means we > need to store the TLI that last wrote each page on every page too, but > maybe it could be made to work; we're talking about a feature that is > expensive by nature. However, I'm a little worried about the > cryptographic properties of this approach. It would often mean that an > attacker who has full filesystem access can get multiple encrypted > images of the same data, each encrypted with a different nonce. I > don't know whether that's a hazard or not, but it feels like the sort > of thing that, if I were a cryptographer, I would be pleased to have. I do agree that, in general, this is a feature that's expensive to begin with and folks are generally going to be accepting of that. Encrypting the same data with different nonces will produce different results and shouldn't be an issue. The nonces really do need to be unique for a given key though. > Another idea might be - instead of doing nonce++ every time we write > the page, do nonce=random(). That's eventually going to repeat a > value, but it's extremely likely to take a *super* long time if there > are enough bits. A potentially rather large problem, though, is that > generating random numbers in large quantities isn't very cheap. There's specific discussion about how to choose a nonce in NIST publications and using a properly random one that's large enough is one accepted approach, though my recollection was that the preference was to use an incrementing guaranteed-unique nonce and using a random one was more of a "if you can't coordinate using an incrementing one then you can do this". I can try to hunt for the specifics on that though. The issue of getting large amounts of cryptographically random numbers seems very likely to make this not work so well though. > Anybody got a better idea? If we stipulate (and document) that all replicas need their own keys then we no longer need to worry about nonce re-use between the primary and the replica. Not sure that's *better*, per se, but I do think it's worth consideration. Teaching pg_basebackup how to decrypt and then re-encrypt with a different key wouldn't be challenging. > I really like your (Stephen's) idea of including something in the > special space that permits integrity checking. One thing that is quite > nice about that is we could do it first, as an independent patch, > before we did TDE. It would be an independently useful feature, and it > would mean that if there are any problems with the code that injects > stuff into the special space, we could try to track those down in a > non-TDE context. That's really good, because in a TDE context, the > pages are going to be garbled and unreadable (we hope, anyway). If we > have a problem that we can reproduce with just an integrity-checking > token shoved into every page, you can look at the page and try to > understand what went wrong. So I really like this direction both from > the point of view of improving integrity checking, and also from the > point of view of being able to debug problems. I agree with all of this. > Now, one downside of this approach is that if we have the ability to > turn integrity-checking tokens on and off, and separately we can turn > encryption on and off, then we can't simplify down to two cases as > Andres was advocating above; you have to cater to a variety of > possible values of how-much-stuff-we-squeezed-into-the-special space. > At that point you kind of end up with the approach the draft patches > were already taking, which Andres was worried would be expensive. Yes, if the amount of space available is variable then there's an added cost for that. While I appreciate the concern about having that be expensive, for my 2c at least, I like to think that having this sudden space that's available for use may lead to other really interesting capabilities beyond the ones we're talking about here, so I'm not really thrilled with the idea of boiling it down to just two cases. > I am not entirely certain, however, that I understand what the > proposal is here exactly for integrity verification. I Googled > "AES-GCM using/storing tags" but it didn't help me that much, because > I don't really know the subject area. A really simple integrity > verifier for a page would be to store the db OID, ts OID, relfilenode, > and block number in the page, and check them on read, preventing > blocks from moving around without us noticing. But I gather that > perhaps the idea here is to store something like > hash(db_oid||ts_oid||relfilenode||block||block_contents) in each page, > basically a beefed-up checksum that is too wide to fake easily. It's > probably more complicated than that, though: I admit to having limited > knowledge of modern cryptography. Happy to help on this bit. Probably the simplest way to explain what's going on here is that you have two functions- encrypt and decrypt. The encrypt function takes: (key, nonce, plaintext) and returns (ciphertext, tag). The decrypt function takes: (key, nonce, ciphertext, tag) and returns: (plaintext) ... OR an error saying "data integrity check failed". As an example, here's a test case from NIST for AES GCM *encryption*: Key = 31bdadd96698c204aa9ce1448ea94ae1fb4a9a0b3c9d773b51bb1822666b8f22 IV = 0d18e06c7c725ac9e362e1ce PT = 2db5168e932556f8089a0622981d017d AAD = CT = fa4362189661d163fcd6a56d8bf0405a Tag = d636ac1bbedd5cc3ee727dc2ab4a9489 key/IV (aka nonce)/PT are inputs, CT and Tag are outputs. Then an example for AES GCM *decryption*: Key = 4c8ebfe1444ec1b2d503c6986659af2c94fafe945f72c1e8486a5acfedb8a0f8 IV = 473360e0ad24889959858995 CT = d2c78110ac7e8f107c0df0570bd7c90c AAD = Tag = c26a379b6d98ef2852ead8ce83a833a7 PT = 7789b41cb3ee548814ca0b388c10b343 Key/IV/CT/Tag are inputs, PT is the output ... but, a more interesting one when considering the tag is: Key = c997768e2d14e3d38259667a6649079de77beb4543589771e5068e6cd7cd0b14 IV = 835090aed9552dbdd45277e2 CT = 9f6607d68e22ccf21928db0986be126e AAD = Tag = f32617f67c574fd9f44ef76ff880ab9f FAIL Again, Key/IV/CT/Tag are inputs, but there's no PT output and instead you just get FAIL and that's because the data integrity check failed. Exactly how the tag is generated is discussed here if you're really curious: https://en.wikipedia.org/wiki/Galois/Counter_Mode but the gist of that is that it's done as part of the encryption. Note that you can include additional data beyond just what you're encrypting in the tag. In our case, we would probably include the LSN, which would mean that the LSN would be confirmed to be correct additional information that wasn't actually encrypted. The "AAD" above is "Additional Authenticated Data". One thing to be absolutely clear about here though is that simply taking a hash() of the ciphertext and storing that with the data does *not* provide cryptographic data integrity validation for the page because it doesn't involve the actual key or IV at all and the hash is done after the ciphertext is generated- therefore an attacker can change the data and just change the hash to match and you'd never know. Now, when it comes to hashing the *plaintext* data and storing that, you have to be careful ther because you can very easily fall into the trap of giving away information about the plaintext data that way if an attacker can reason about what the plaintext might look like. If I know the block contains just a single english word and all we've done is sha256'd it then I can just run sha256 on all english words and figure out what it is, so to protect the data you need to incorporate the key, nonce, etc, somehow into the hash (that is- something that should be very hard for the attacker to discover) and suddently you're doing what AES-GCM *already* does for you, except you're trying to hack it yourself instead of using the tools available which were written by experts. The way that I tend to look at this area is that everyone used to try and do encryption and data integrity independently and the result was a bunch of different implementations, some good, some bad (and therefore leaked sensitive information) and the crypto folks basically said "ok, let's take the *good* implementations and bake that in, because otherwise people are going to just keep screwing up and using bad approaches for this." What this means for your proposal above is that the actual data validation information will be generated in two different ways depending on if we're using AES-GCM and doing TDE, or if we're doing just the data validation piece and not encrypting anything. That's maybe not ideal but I don't think it's a huge issue either and your proposal will still address the question of if we end up missing anything when it comes to how the special area is handled throughout the code. If it'd help, I'd be happy to jump on a call to discuss further. Also happy to continue on this thread too, of course. Thanks, Stephen
Attachment
On Wed, May 26, 2021 at 07:14:47AM +0200, Antonin Houska wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote: > > > > So, let me ask --- I thought CTR basically took an encrypted stream of > > > > bits and XOR'ed them with the data. If that is true, then why are > > > > changing hint bits a problem? We already can see some of the bit stream > > > > by knowing some bytes of the page. > > > > > > A *single* reuse of the nonce in CTR reveals nearly all of the > > > plaintext. As you say, the data is XORed with the key stream. Reusing > > > the nonce means that you reuse the key stream. Which in turn allows you > > > to do: > > > (data ^ stream) ^ (data' ^ stream) > > > which can be simplified to > > > (data ^ data') > > > thereby leaking all of data except the difference between data and > > > data'. That's why it's so crucial to ensure that stream *always* differs > > > between two rounds of encrypting "related" data. > > > > > > We can't just "hope" that data doesn't change and use CTR. > > > > My point was about whether we need to change the nonce, and hence > > WAL-log full page images if we change hint bits. If we don't and > > reencrypt the page with the same nonce, don't we only expose the hint > > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > > cases. > > > > I don't understand your computation above. You decrypt the page into > > shared buffers, you change a hint bit, and rewrite the page. You are > > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > > only change the hint bits in the new write? > > The way I view things is that the CTR mode encrypts each individual bit, > independent from any other bit on the page. For non-hint bits data=data', so > (data ^ data') is always zero, regardless the actual values of the data. So I > agree with you that by reusing the nonce we only expose the hint bits. OK, that's what I thought. We already expose the clog and fsm, so exposing the hint bits seems acceptable. If everyone agrees, I will adjust my patch to not WAL log hint bit changes. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Stephen Frost (sfrost@snowman.net) wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: > > Another idea might be - instead of doing nonce++ every time we write > > the page, do nonce=random(). That's eventually going to repeat a > > value, but it's extremely likely to take a *super* long time if there > > are enough bits. A potentially rather large problem, though, is that > > generating random numbers in large quantities isn't very cheap. > > There's specific discussion about how to choose a nonce in NIST > publications and using a properly random one that's large enough is > one accepted approach, though my recollection was that the preference > was to use an incrementing guaranteed-unique nonce and using a random > one was more of a "if you can't coordinate using an incrementing one > then you can do this". I can try to hunt for the specifics on that > though. Disucssion of generating IVs here: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf section 8.2 specifically. Note that 8.3 also discusses subsequent limitations which one should follow when using a random nonce, to reduce the chances of a collision. Thanks, Stephen
Attachment
On Wed, May 26, 2021 at 2:37 PM Stephen Frost <sfrost@snowman.net> wrote: > > Anybody got a better idea? > > If we stipulate (and document) that all replicas need their own keys > then we no longer need to worry about nonce re-use between the primary > and the replica. Not sure that's *better*, per se, but I do think it's > worth consideration. Teaching pg_basebackup how to decrypt and then > re-encrypt with a different key wouldn't be challenging. I agree that we could do that and that it's possible worth considering. However, it would be easy - and tempting - for users to violate the no-nonce-reuse principle. For example, consider a hypothetical user who takes a backup on Monday via a filesystem snapshot - which might be either (a) a snapshot of the cluster while it is stopped, or (b) a snapshot of the cluster while it's running, from which crash recovery can be safely performed as long as it's a true atomic snapshot, or (c) a snapshot taken between pg_start_backup and pg_stop_backup which will be used just like a backup taken by pg_basebackup. In any of these cases, there's no opportunity for a tool we provide to intervene and re-key. Now, we would provide a tool that re-keys in such situations and tell people to be sure they run it before using any of those backups, and maybe that's the best we can do. However, that tool is going to run for a good long time because it has to rewrite the entire cluster, so someone with a terabyte-scale database is going to be sorely tempted to skip this "unnecessary" and time-consuming step. If it were possible to set things up so that good things happen automatically and without user action, that would be swell. Here's another idea: suppose that a nonce is 128 bits, 64 of which are randomly generated at server startup, and the other 64 of which are a counter. If you're willing to assume that the 64 bits generated randomly at server startup are not going to collide in practice, because the number of server lifetimes per key should be very small compared to 2^64, then this gets you the benefits of a randomly-generate nonce without needing to keep on generating new cryptographically strong random numbers, and pretty much regardless of what users do with their backups. If you replay an FPI, you can write out the page exactly as you got it from the master, without re-encrypting. If you modify and then write a page, you generate a nonce for it containing your own server lifetime identifier. > Yes, if the amount of space available is variable then there's an added > cost for that. While I appreciate the concern about having that be > expensive, for my 2c at least, I like to think that having this sudden > space that's available for use may lead to other really interesting > capabilities beyond the ones we're talking about here, so I'm not really > thrilled with the idea of boiling it down to just two cases. Although I'm glad you like some things about this idea, I think the proposed system will collapse if we press it too hard. We're going to need to be judicious. > One thing to be absolutely clear about here though is that simply taking > a hash() of the ciphertext and storing that with the data does *not* > provide cryptographic data integrity validation for the page because it > doesn't involve the actual key or IV at all and the hash is done after > the ciphertext is generated- therefore an attacker can change the data > and just change the hash to match and you'd never know. Ah, right. So you'd actually want something more like hash(dboid||tsoid||relfilenode||blockno||block_contents||secret). Maybe not generated exactly that way: perhaps the secret is really the IV for the hash function rather than part of the hashed data, or whatever. However you do it exactly, it prevents someone from verifying - or faking - a signature unless they have the secret. > very hard for the attacker to discover) and suddently you're doing what > AES-GCM *already* does for you, except you're trying to hack it yourself > instead of using the tools available which were written by experts. I am all in favor of using the expert-written tools provided we can figure out how to do it in a way we all agree is correct. > What this means for your proposal above is that the actual data > validation information will be generated in two different ways depending > on if we're using AES-GCM and doing TDE, or if we're doing just the data > validation piece and not encrypting anything. That's maybe not ideal > but I don't think it's a huge issue either and your proposal will still > address the question of if we end up missing anything when it comes to > how the special area is handled throughout the code. Hmm. Is there no expert-written method for this sort of thing without encryption? One thing that I think would be really helpful is to be able to take a TDE-ified cluster and run it through decryption, ending up with a cluster that still has extra special space but which isn't actually encrypted any more. Ideally it can end up in a state where integrity validation still works. This might be something people just Want To Do, and they're willing to sacrifice the space. But it would also be real nice for testing and debugging. Imagine for example that the data on page X is physiologically corrupted i.e. decryption produces something that looks like a page, but there's stuff wrong with it, like the item pointers point to a page offset greater than the page size. Well, what you really want to do with this page is run pg_filedump on it, or hexdump, or od, or pg_hexedit, or whatever your favorite tool is, so that you can figure out what's going on, but that's going to be hard if the pages are all encrypted. I guess nothing in what you are saying really precludes that, but I agree that if we have to switch up the method for creating the integrity verifier thing in this situation, that's not great. > If it'd help, I'd be happy to jump on a call to discuss further. Also > happy to continue on this thread too, of course. I am finding the written discussion to be helpful right now, and it has the advantage of being easy to refer back to later, so my vote would be to keep doing this for now and we can always reassess if it seems to make sense. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > OK, that's what I thought. We already expose the clog and fsm, so > exposing the hint bits seems acceptable. If everyone agrees, I will > adjust my patch to not WAL log hint bit changes. Robert pointed out that it's not just hint bits where this is happening though, but it can also happen with btree line pointer arrays. Even if we were entirely comfortable accepting that the hint bits are leaked because of this, leaking the btree line pointer array doesn't seem like it could possibly be acceptable.. I've not run down that code myself, but I don't have any reason to doubt Robert's assessment. Thanks, Stephen
Attachment
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote: > However, I believe that if we store the nonce in the page explicitly, > as proposed here, rather trying to derive it from the LSN, then we > don't need to worry about this kind of masking, which I think is > better from both a security perspective and a performance perspective. You are saying that by using a non-LSN nonce, you can write out the page with a new nonce, but the same LSN, and also discard the page during crash recovery and use the WAL copy? I am confused why checksums, which are widely used, acceptably require wal_log_hints, but there is concern that file encryption, which is heavier, cannot acceptably require wal_log_hints. I must be missing something. Why can't checksums also throw away hint bit changes like you want to do for file encryption and not require wal_log_hints? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, May 26, 2021 at 2:37 PM Stephen Frost <sfrost@snowman.net> wrote: > > > Anybody got a better idea? > > > > If we stipulate (and document) that all replicas need their own keys > > then we no longer need to worry about nonce re-use between the primary > > and the replica. Not sure that's *better*, per se, but I do think it's > > worth consideration. Teaching pg_basebackup how to decrypt and then > > re-encrypt with a different key wouldn't be challenging. > > I agree that we could do that and that it's possible worth > considering. However, it would be easy - and tempting - for users to > violate the no-nonce-reuse principle. For example, consider a (guessing you meant no-key-reuse above) > hypothetical user who takes a backup on Monday via a filesystem > snapshot - which might be either (a) a snapshot of the cluster while > it is stopped, or (b) a snapshot of the cluster while it's running, > from which crash recovery can be safely performed as long as it's a > true atomic snapshot, or (c) a snapshot taken between pg_start_backup > and pg_stop_backup which will be used just like a backup taken by > pg_basebackup. In any of these cases, there's no opportunity for a > tool we provide to intervene and re-key. Now, we would provide a tool > that re-keys in such situations and tell people to be sure they run it > before using any of those backups, and maybe that's the best we can > do. However, that tool is going to run for a good long time because it > has to rewrite the entire cluster, so someone with a terabyte-scale > database is going to be sorely tempted to skip this "unnecessary" and > time-consuming step. If it were possible to set things up so that good > things happen automatically and without user action, that would be > swell. Yes, if someone were to use a snapshot and set up a replica from it they'd end up with the same key being used and potentially have an issue with the key+nonce combination being re-used between the primary and replica with different data leading to a possible data leak. > Here's another idea: suppose that a nonce is 128 bits, 64 of which are > randomly generated at server startup, and the other 64 of which are a > counter. If you're willing to assume that the 64 bits generated > randomly at server startup are not going to collide in practice, > because the number of server lifetimes per key should be very small > compared to 2^64, then this gets you the benefits of a > randomly-generate nonce without needing to keep on generating new > cryptographically strong random numbers, and pretty much regardless of > what users do with their backups. If you replay an FPI, you can write > out the page exactly as you got it from the master, without > re-encrypting. If you modify and then write a page, you generate a > nonce for it containing your own server lifetime identifier. Yes, this kind of approach is discussed in the NIST publication in section 8.2.2. We'd have to keep track of what nonce we used for which page, of course, but that should be alright using the special space as discussed. > > Yes, if the amount of space available is variable then there's an added > > cost for that. While I appreciate the concern about having that be > > expensive, for my 2c at least, I like to think that having this sudden > > space that's available for use may lead to other really interesting > > capabilities beyond the ones we're talking about here, so I'm not really > > thrilled with the idea of boiling it down to just two cases. > > Although I'm glad you like some things about this idea, I think the > proposed system will collapse if we press it too hard. We're going to > need to be judicious. Sure. > > One thing to be absolutely clear about here though is that simply taking > > a hash() of the ciphertext and storing that with the data does *not* > > provide cryptographic data integrity validation for the page because it > > doesn't involve the actual key or IV at all and the hash is done after > > the ciphertext is generated- therefore an attacker can change the data > > and just change the hash to match and you'd never know. > > Ah, right. So you'd actually want something more like > hash(dboid||tsoid||relfilenode||blockno||block_contents||secret). > Maybe not generated exactly that way: perhaps the secret is really the > IV for the hash function rather than part of the hashed data, or > whatever. However you do it exactly, it prevents someone from > verifying - or faking - a signature unless they have the secret. > > > very hard for the attacker to discover) and suddently you're doing what > > AES-GCM *already* does for you, except you're trying to hack it yourself > > instead of using the tools available which were written by experts. > > I am all in favor of using the expert-written tools provided we can > figure out how to do it in a way we all agree is correct. In the patch set that Bruce has which uses the OpenSSL functions to do AES GCM with tag there is included a test suite which works with the NIST published test vectors to verify that it all works correctly with the key, nonce/IV, plaintext, tag, ciphertext, etc. The patch set includes a subset of the NIST tests since we rely on OpenSSL for the heavy lifting there, but the entire test suite passes if you pull down the test vectors and run them. > > What this means for your proposal above is that the actual data > > validation information will be generated in two different ways depending > > on if we're using AES-GCM and doing TDE, or if we're doing just the data > > validation piece and not encrypting anything. That's maybe not ideal > > but I don't think it's a huge issue either and your proposal will still > > address the question of if we end up missing anything when it comes to > > how the special area is handled throughout the code. > > Hmm. Is there no expert-written method for this sort of thing without > encryption? One thing that I think would be really helpful is to be > able to take a TDE-ified cluster and run it through decryption, ending > up with a cluster that still has extra special space but which isn't > actually encrypted any more. Ideally it can end up in a state where > integrity validation still works. This might be something people just > Want To Do, and they're willing to sacrifice the space. But it would > also be real nice for testing and debugging. Imagine for example that > the data on page X is physiologically corrupted i.e. decryption > produces something that looks like a page, but there's stuff wrong > with it, like the item pointers point to a page offset greater than > the page size. Well, what you really want to do with this page is run > pg_filedump on it, or hexdump, or od, or pg_hexedit, or whatever your > favorite tool is, so that you can figure out what's going on, but > that's going to be hard if the pages are all encrypted. So ... yes and no. If you want to actually verify that the data is valid and unmolested by virtue of a key being involved, then you can actually use AES GCM and simply only feed it AADlen. The NIST examples have test cases for exactly this too: Count = 0 Key = 78dc4e0aaf52d935c3c01eea57428f00ca1fd475f5da86a49c8dd73d68c8e223 IV = d79cf22d504cc793c3fb6c8a PT = AAD = b96baa8c1c75a671bfb2d08d06be5f36 CT = Tag = 3e5d486aa2e30b22e040b85723a06e76 Note that in this case there's a key and an IV/nonce, but there isn't any plaintext while there *is* AAD ("Additional Authenticated Data"). We could certainly do that too, the downside there is mostly that we'd still need a key and an IV and those seem like odd parameters to require when we aren't doing encryption, but it would mean we'd be using the exact same functions with OpenSSL that we would be in the TDE case, just passing in the block as AAD instead of as plaintext to be encrypted, so there is that advantage to it. > I guess nothing in what you are saying really precludes that, but I > agree that if we have to switch up the method for creating the > integrity verifier thing in this situation, that's not great. I had been imagining that we wouldn't want to require a key and have to calculate an IV/nonce for the "not doing TDE" case, so I was figuring we'd just use a hash and it'd be very much like our existing checksum and not provide any real protection against an attacker intentionally molesting the page (since they can just calculate a new checksum that includes whatever their changes were). At the end of the day though, I'm fine with either (or both, for that matter; I don't see any of these aspects being the difficult to implement bits, the question is mainly what do we give our users the ability to do, what do we just use for development, etc). > > If it'd help, I'd be happy to jump on a call to discuss further. Also > > happy to continue on this thread too, of course. > > I am finding the written discussion to be helpful right now, and it > has the advantage of being easy to refer back to later, so my vote > would be to keep doing this for now and we can always reassess if it > seems to make sense. Sure. Thanks! Stephen
Attachment
On Wed, May 26, 2021 at 03:49:43PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > OK, that's what I thought. We already expose the clog and fsm, so > > exposing the hint bits seems acceptable. If everyone agrees, I will > > adjust my patch to not WAL log hint bit changes. > > Robert pointed out that it's not just hint bits where this is happening > though, but it can also happen with btree line pointer arrays. Even if > we were entirely comfortable accepting that the hint bits are leaked > because of this, leaking the btree line pointer array doesn't seem like > it could possibly be acceptable.. > > I've not run down that code myself, but I don't have any reason to doubt > Robert's assessment. OK, I guess we could split out log_hints to maybe just FPW-log btree changes or something, but my recent email questions why wal_log_hints is an issue anyway. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, May 26, 2021 at 04:40:48PM -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote: > > However, I believe that if we store the nonce in the page explicitly, > > as proposed here, rather trying to derive it from the LSN, then we > > don't need to worry about this kind of masking, which I think is > > better from both a security perspective and a performance perspective. > > You are saying that by using a non-LSN nonce, you can write out the page > with a new nonce, but the same LSN, and also discard the page during > crash recovery and use the WAL copy? > > I am confused why checksums, which are widely used, acceptably require > wal_log_hints, but there is concern that file encryption, which is > heavier, cannot acceptably require wal_log_hints. I must be missing > something. > > Why can't checksums also throw away hint bit changes like you want to do > for file encryption and not require wal_log_hints? One detail might be this extra hint bit FPW case: https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch However, if a hint-bit-modified page is written to the file system during a checkpoint, and there is a later hint bit change switching the same page from clean to dirty during the same checkpoint, we need a new LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this is to update the page LSN by writing a dummy WAL record via xloginsert.c::LSNForEncryption() in such cases. Is this how file encryption is different from checksum wal_log_hints, and the big concern? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote: > In the interest of not being viewed as too much of a naysayer, let me > first reiterate that I am generally in favor of TDE going forward and > am not looking to throw up unnecessary obstacles in the way of making > that happen. Rather than surprise anyone, I might as well just come out and say some things. First, I have always admitted this feature has limited usefulness. I think a non-LSN nonce adds a lot of code complexity, which adds a code and maintenance burden. It also prevents the creation of an encrypted replica from a non-encrypted primary using binary replication, which makes deployment harder. Take a feature of limited usefulness, add code complexity and deployment difficulty, and the feature becomes even less useful. For these reasons, if we decide to go in the direction of using a non-LSN nonce, I no longer plan to continue working on this feature. I would rather work on things that have a more positive impact. Maybe a non-LSN nonce is a better long-term plan, but there are too many unknowns and complexity for me to feel comfortable with it. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-05-26 07:14:47 +0200, Antonin Houska wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > My point was about whether we need to change the nonce, and hence > > WAL-log full page images if we change hint bits. If we don't and > > reencrypt the page with the same nonce, don't we only expose the hint > > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > > cases. > > > > I don't understand your computation above. You decrypt the page into > > shared buffers, you change a hint bit, and rewrite the page. You are > > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > > only change the hint bits in the new write? Yea, I had a bit of a misfire there. Sorry. I suspect that if we try to not disclose data if an attacker has write access, this still leaves us with issues around nonce reuse, unless we also employ integrity measures. Particularly due to CTR mode, which makes it easy to manipulate individual parts of the encrypted page without causing the decrypted page to be invalid. E.g. the attacker can just update pd_upper on the page by a small offset, and suddenly the replay will insert the tuple at a slightly shifted offset - which then seems to leak enough data to actually analyze things? As the patch stands that seems trivially doable, because as I read it, most of the page header is not encrypted, and checksums are done of the already encrypted data. But even if that weren't the case, brute forcing 16bit worth of checksum isn't too bad, even though it would obviously make an attack a lot more noisy. https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92 /* * Check if the page has a special size == GISTPageOpaqueData, a valid * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN. This * is true for all GiST pages, and perhaps a few pages that are not. The * only downside of guessing wrong is that we might not update the LSN for * some non-permanent relation page changes, and therefore reuse the IV, * which seems acceptable. */ Huh? Regards, Andres
Hi, On 2021-05-25 22:23:46 -0400, Stephen Frost wrote: > Andres mentioned other possible cases where the LSN doesn’t change even > though we change the page and, as he’s probably right, we would have to > figure out a solution in those cases too (potentially including cases like > crash recovery or replay on a replica where we can’t really just go around > creating dummy WAL records to get new LSNs..). Yea, I think there's quite a few of those. For one, we don't guarantee that that the hole between pd_lower/upper is zeroes. It e.g. contains old tuple data after deleted tuples are pruned away. But when logging an FPI, we omit that range. Which means that after crash recovery the area is zeroed out. There's several cases where padding can result in the same. Just look at checkXLogConsistency(), heap_mask() et al for all the differences that can occur and that need to be ignored for the recovery consistency checking to work. Particularly the hole issue seems trivial to exploit, because we know the plaintext of the hole after crash recovery (0s). I don't see how using the LSN alone is salvagable. Greetings, Andres Freund
Hi, On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > If we used a block cipher instead of a streaming one (CTR), this might > not work because the earlier blocks can be based in the output of > later blocks. What made us choose CTR for WAL & data file encryption? I checked the README in the patchset and the wiki page, and neither seem to discuss that. The dangers around nonce reuse, the space overhead of storing the nonce, the fact that single bit changes in the encrypted data don't propagate seem not great? Why aren't we using something like XTS? It has obvious issues as wel, but CTR's weaknesses seem at least as great. And if we want a MAC, then we don't want CTR either. Greetings, Andres Freund
Greetings,
On Thu, May 27, 2021 at 4:52 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> I am confused why checksums, which are widely used, acceptably require
> wal_log_hints, but there is concern that file encryption, which is
> heavier, cannot acceptably require wal_log_hints. I must be missing
> something.
>
> Why can't checksums also throw away hint bit changes like you want to do
> for file encryption and not require wal_log_hints?
I'm really confused about it, too. I read the above communication, not sure if my understanding is correct... What we are facing is not only the change of flag such as *pd_flags*, but also others like pointer array changes in btree like Robert said. We don't need them to write a WAL record.
I have an immature idea, could we use LSN+blkno+checksum as the nonce when the checksum enabled? And when the checksum disabled, we just use a global counter to generate a number as the fake checksum value... Then we also use LSN+blkno+fake_checksum as the nonce. Is there anything wrong with that?
There is no royal road to learning.
HighGo Software Co.
On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote: > You are saying that by using a non-LSN nonce, you can write out the page > with a new nonce, but the same LSN, and also discard the page during > crash recovery and use the WAL copy? I don't know what "discard the page during crash recovery and use the WAL copy" means. > I am confused why checksums, which are widely used, acceptably require > wal_log_hints, but there is concern that file encryption, which is > heavier, cannot acceptably require wal_log_hints. I must be missing > something. I explained this in the first complete paragraph of my first email with this subject line: "For example, right now, we only need to WAL log hints for the first write to each page after a checkpoint, but in this approach, if the same page is written multiple times per checkpoint cycle, we'd need to log hints every time." That's a huge difference. Page eviction in some workloads can push the same pages out of shared buffers every few seconds, whereas something that has to be done once per checkpoint cycle cannot affect each page nearly so often. A checkpoint is only going to occur every 5 minutes by default, or more realistically every 10-15 minutes in a well-tuned production system. In other words, we're not holding up some kind of double standard, where the existing feature is allowed to depend on doing a certain thing but your feature isn't allowed to depend on the same thing. Your design depends on doing something which is potentially 100x+ more expensive than the existing thing. It's not always going to be that expensive, but it can be. > Why can't checksums also throw away hint bit changes like you want to do > for file encryption and not require wal_log_hints? Well, I don't want to throw away hint bit changes, just like we don't throw them away right now. And I want to do that by making sure that each time the page is written, we use a different nonce, but without the expense of having to advance the LSN. Now, another option is to do what you suggest here. We could say that if a dirty page is evicted, but the page is only dirty because of hint-type changes, we don't actually write it out. That does avoid using the same nonce for multiple writes, because now there's only one write. It also fixes the problem on standbys that Andres was complaining about, because on a standby, the only way a page can possibly be dirtied without an associated WAL record is through a hint-type change. However, I think we'd find that this, too, is pretty expensive in certain workloads. It's useful to write hint bits - that's why we do it. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-26 07:14:47 +0200, Antonin Houska wrote: > > Bruce Momjian <bruce@momjian.us> wrote: > > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > > My point was about whether we need to change the nonce, and hence > > > WAL-log full page images if we change hint bits. If we don't and > > > reencrypt the page with the same nonce, don't we only expose the hint > > > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > > > cases. > > > > > > I don't understand your computation above. You decrypt the page into > > > shared buffers, you change a hint bit, and rewrite the page. You are > > > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > > > only change the hint bits in the new write? > > Yea, I had a bit of a misfire there. Sorry. > > I suspect that if we try to not disclose data if an attacker has write > access, this still leaves us with issues around nonce reuse, unless we > also employ integrity measures. Particularly due to CTR mode, which > makes it easy to manipulate individual parts of the encrypted page > without causing the decrypted page to be invalid. E.g. the attacker can > just update pd_upper on the page by a small offset, and suddenly the > replay will insert the tuple at a slightly shifted offset - which then > seems to leak enough data to actually analyze things? Yes, I don't think protecting from write access is a realistic goal at this point, and frankly ever. I think write access protection needs all-cluster-file encryption. This is documented: https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch Cluster file encryption does not protect against unauthorized file system writes. Such writes can allow data decryption if used to weaken the system's security and the weakened system is later supplied with the externally-stored cluster encryption key. This also does not always detect if users with write access remove or modify database files. If this needs more text, let me know. > As the patch stands that seems trivially doable, because as I read it, > most of the page header is not encrypted, and checksums are done of the > already encrypted data. But even if that weren't the case, brute forcing > 16bit worth of checksum isn't too bad, even though it would obviously > make an attack a lot more noisy. > > https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92 > /* > * Check if the page has a special size == GISTPageOpaqueData, a valid > * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN. This > * is true for all GiST pages, and perhaps a few pages that are not. The > * only downside of guessing wrong is that we might not update the LSN for > * some non-permanent relation page changes, and therefore reuse the IV, > * which seems acceptable. > */ > > Huh? Are you asking about this C commention in relation to the discussion above, or is it an independent question? Are asking what it means? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-25 22:23:46 -0400, Stephen Frost wrote: > > Andres mentioned other possible cases where the LSN doesn’t change even > > though we change the page and, as he’s probably right, we would have to > > figure out a solution in those cases too (potentially including cases like > > crash recovery or replay on a replica where we can’t really just go around > > creating dummy WAL records to get new LSNs..). > > Yea, I think there's quite a few of those. For one, we don't guarantee > that that the hole between pd_lower/upper is zeroes. It e.g. contains > old tuple data after deleted tuples are pruned away. But when logging an > FPI, we omit that range. Which means that after crash recovery the area > is zeroed out. There's several cases where padding can result in the > same. > > Just look at checkXLogConsistency(), heap_mask() et al for all the > differences that can occur and that need to be ignored for the recovery > consistency checking to work. > > Particularly the hole issue seems trivial to exploit, because we know > the plaintext of the hole after crash recovery (0s). > > > I don't see how using the LSN alone is salvagable. OK, so you are saying the replica would have all zeros because of crash recovery, so XOR'ing that with the encryption steam makes the encryption stream visible, and you could use that to decrypt the dead data on the primary. That is an interesting case that would need to fix. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > Hi, > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > If we used a block cipher instead of a streaming one (CTR), this might > > not work because the earlier blocks can be based in the output of > > later blocks. > > What made us choose CTR for WAL & data file encryption? I checked the > README in the patchset and the wiki page, and neither seem to discuss > that. > > The dangers around nonce reuse, the space overhead of storing the nonce, > the fact that single bit changes in the encrypted data don't propagate > seem not great? Why aren't we using something like XTS? It has obvious > issues as wel, but CTR's weaknesses seem at least as great. And if we > want a MAC, then we don't want CTR either. We chose CTR because it was fast, and we could use the same method for WAL, which needs a streaming, not block, cipher. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, May 27, 2021 at 05:45:21PM +0800, Neil Chen wrote: > Greetings, > > On Thu, May 27, 2021 at 4:52 PM Bruce Momjian <bruce@momjian.us> wrote: > > > > > I am confused why checksums, which are widely used, acceptably require > > wal_log_hints, but there is concern that file encryption, which is > > heavier, cannot acceptably require wal_log_hints. I must be missing > > something. > > > > Why can't checksums also throw away hint bit changes like you want to do > > for file encryption and not require wal_log_hints? > > > > I'm really confused about it, too. I read the above communication, not sure if > my understanding is correct... What we are facing is not only the change of > flag such as *pd_flags*, but also others like pointer array changes in btree > like Robert said. We don't need them to write a WAL record. Well, the code now does write full page images for hint bit changes, so it should work fine. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote: > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote: > > You are saying that by using a non-LSN nonce, you can write out the page > > with a new nonce, but the same LSN, and also discard the page during > > crash recovery and use the WAL copy? > > I don't know what "discard the page during crash recovery and use the > WAL copy" means. I was asking how decoupling the nonce from the LSN allows for us to avoid full page writes for hint bit changes. I am guessing you are saying that on recovery, if we see a hint-bit-only change in the WAL (with a new nonce), we just throw away the page because it could be torn and use the WAL full page write version. > > I am confused why checksums, which are widely used, acceptably require > > wal_log_hints, but there is concern that file encryption, which is > > heavier, cannot acceptably require wal_log_hints. I must be missing > > something. > > I explained this in the first complete paragraph of my first email > with this subject line: "For example, right now, we only need to WAL > log hints for the first write to each page after a checkpoint, but in > this approach, if the same page is written multiple times per > checkpoint cycle, we'd need to log hints every time." That's a huge > difference. Page eviction in some workloads can push the same pages > out of shared buffers every few seconds, whereas something that has to > be done once per checkpoint cycle cannot affect each page nearly so > often. A checkpoint is only going to occur every 5 minutes by default, > or more realistically every 10-15 minutes in a well-tuned production > system. In other words, we're not holding up some kind of double > standard, where the existing feature is allowed to depend on doing a > certain thing but your feature isn't allowed to depend on the same > thing. Your design depends on doing something which is potentially > 100x+ more expensive than the existing thing. It's not always going to > be that expensive, but it can be. Yes, it might be 1e100+++ more expensive too, but we don't know, and I am not ready to add a lot of complexity for such an unknown. > > Why can't checksums also throw away hint bit changes like you want to do > > for file encryption and not require wal_log_hints? > > Well, I don't want to throw away hint bit changes, just like we don't > throw them away right now. And I want to do that by making sure that > each time the page is written, we use a different nonce, but without > the expense of having to advance the LSN. > > Now, another option is to do what you suggest here. We could say that > if a dirty page is evicted, but the page is only dirty because of > hint-type changes, we don't actually write it out. That does avoid > using the same nonce for multiple writes, because now there's only one > write. It also fixes the problem on standbys that Andres was > complaining about, because on a standby, the only way a page can > possibly be dirtied without an associated WAL record is through a > hint-type change. However, I think we'd find that this, too, is pretty > expensive in certain workloads. It's useful to write hint bits - > that's why we do it. Oh, that does sound nice. It is kind of an exit hatch if we are evicting pages often for hint bit changes. I like it. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > Hi, > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > If we used a block cipher instead of a streaming one (CTR), this might > > > not work because the earlier blocks can be based in the output of > > > later blocks. > > > > What made us choose CTR for WAL & data file encryption? I checked the > > README in the patchset and the wiki page, and neither seem to discuss > > that. > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > the fact that single bit changes in the encrypted data don't propagate > > seem not great? Why aren't we using something like XTS? It has obvious > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > want a MAC, then we don't want CTR either. > > We chose CTR because it was fast, and we could use the same method for > WAL, which needs a streaming, not block, cipher. The WAL is block oriented too. Andres
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > not work because the earlier blocks can be based in the output of > > > > later blocks. > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > README in the patchset and the wiki page, and neither seem to discuss > > > that. > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > the fact that single bit changes in the encrypted data don't propagate > > > seem not great? Why aren't we using something like XTS? It has obvious > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > want a MAC, then we don't want CTR either. > > > > We chose CTR because it was fast, and we could use the same method for > > WAL, which needs a streaming, not block, cipher. > > The WAL is block oriented too. I'm curious what you'd suggest for the heap where we wouldn't be able to have block chaining (at least, I presume we aren't talking about rewriting entire segments whenever we change something in a heap). Thanks, Stephen
Attachment
On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > Rather than surprise anyone, I might as well just come out and say some > things. First, I have always admitted this feature has limited > usefulness. > > I think a non-LSN nonce adds a lot of code complexity, which adds a code > and maintenance burden. It also prevents the creation of an encrypted > replica from a non-encrypted primary using binary replication, which > makes deployment harder. > > Take a feature of limited usefulness, add code complexity and deployment > difficulty, and the feature becomes even less useful. > > For these reasons, if we decide to go in the direction of using a > non-LSN nonce, I no longer plan to continue working on this feature. I > would rather work on things that have a more positive impact. Maybe a > non-LSN nonce is a better long-term plan, but there are too many > unknowns and complexity for me to feel comfortable with it. I had some more time to think about this. The big struggle for this feature has not been writing it, but rather keeping it lean enough that its code complexity will be acceptable for a feature of limited usefulness. (The Windows port and pg_upgrade took similar approaches.) Thinking about the feature to add checksums online, it seems to have failed due to us over-complexifying the feature. If we had avoided allowing the checksum restart requirement, the patch would probably be part of Postgres today. However, a few people asked for restart-ability, and since we don't really have much infrastructure to do online whole-cluster changes, it added a lot of code. Once the patch was done, we looked at the code size and the benefits of the feature, and decided it wasn't worth it. I suspect that if we start adding a non-LSN nonce and malicious write detection, we will end up with the same problem --- a complex patch for a feature that has limited usefulness, and requires dump/restore or logical replication to add it to a cluster. I think such a patch would be rejected, and I would probably even vote against it myself. I don't want this to sound like I only want to do this my way, but I also don't want to be silent when I smell failure, and if the probability of failure gets too high, I am willing to abandon a feature rather than continue. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-05-27 11:49:33 -0400, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: > > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > > not work because the earlier blocks can be based in the output of > > > > > later blocks. > > > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > > README in the patchset and the wiki page, and neither seem to discuss > > > > that. > > > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > > the fact that single bit changes in the encrypted data don't propagate > > > > seem not great? Why aren't we using something like XTS? It has obvious > > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > > want a MAC, then we don't want CTR either. > > > > > > We chose CTR because it was fast, and we could use the same method for > > > WAL, which needs a streaming, not block, cipher. > > > > The WAL is block oriented too. > > I'm curious what you'd suggest for the heap where we wouldn't be able to > have block chaining (at least, I presume we aren't talking about > rewriting entire segments whenever we change something in a heap). What prevents us from using something like XTS? I'm not saying that that is the right approach, due to the fact that it leaks information about a block being the same as an earlier version of the same block. But right now we are talking about using CTR without addressing the weaknesses CTR has, where a failure to increase the nonce is fatal (the code even documents known cases where that could happen!), and where there's no error propagation within a block. Greetings, Andres Freund
On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote: > Hi, > > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > not work because the earlier blocks can be based in the output of > > > > later blocks. > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > README in the patchset and the wiki page, and neither seem to discuss > > > that. > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > the fact that single bit changes in the encrypted data don't propagate > > > seem not great? Why aren't we using something like XTS? It has obvious > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > want a MAC, then we don't want CTR either. > > > > We chose CTR because it was fast, and we could use the same method for > > WAL, which needs a streaming, not block, cipher. > > The WAL is block oriented too. Well, AES block mode only does 16-byte blocks, as far as I know, and I assume WAL is more granular than that. Also, you need to know the bytes _before_ the WAL do write a new 16-byte block, so it seems overly complex for our usage too. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, May 27, 2021 at 11:19 AM Bruce Momjian <bruce@momjian.us> wrote: > On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote: > > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote: > > > You are saying that by using a non-LSN nonce, you can write out the page > > > with a new nonce, but the same LSN, and also discard the page during > > > crash recovery and use the WAL copy? > > > > I don't know what "discard the page during crash recovery and use the > > WAL copy" means. > > I was asking how decoupling the nonce from the LSN allows for us to > avoid full page writes for hint bit changes. I am guessing you are > saying that on recovery, if we see a hint-bit-only change in the WAL > (with a new nonce), we just throw away the page because it could be torn > and use the WAL full page write version. Well, in the design where the nonce is stored in the page, there is no need for every hint-type change to appear in the WAL at all. Once per checkpoint cycle, you need to write a full page image, as we do for checksums or wal_log_hints. The rest of the time, you can just bump the nonce and rewrite the page, same as we do today. > Yes, it might be 1e100+++ more expensive too, but we don't know, and I > am not ready to add a lot of complexity for such an unknown. No, it can't be 1e100+++ more expensive, because it's not realistically possible for a page to be written to disk 1e100+++ times per checkpoint cycle. It is however entirely possible for it to be written 100 times per checkpoint cycle. That is not something unknown about which we need to speculate; it is easy to see that this can happen, even on a simple test like pgbench with a data set larger than shared buffers. It is not right to confuse "we have no idea whether this will be expensive" with "how expensive this will be is workload-dependent," which is what you seem to be doing here. If we had no idea whether something would be expensive, then I agree that it might not be worth adding complexity for it, or maybe some testing should be done first to find out. But if we know for certain that in some workloads something can be very expensive, then we had better at least talk about whether it is worth adding complexity in order to resolve the problem. And that is the situation here. I am not even convinced that storing the nonce in the block is going to be more complex, because it seems to me that the patches I posted upthread worked out pretty cleanly. There are some things to discuss and think about there, for sure, but it is not like we are talking about inventing warp drive. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-27 12:01:16 -0400, Bruce Momjian wrote: > On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote: > > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote: > > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote: > > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote: > > > > > If we used a block cipher instead of a streaming one (CTR), this might > > > > > not work because the earlier blocks can be based in the output of > > > > > later blocks. > > > > > > > > What made us choose CTR for WAL & data file encryption? I checked the > > > > README in the patchset and the wiki page, and neither seem to discuss > > > > that. > > > > > > > > The dangers around nonce reuse, the space overhead of storing the nonce, > > > > the fact that single bit changes in the encrypted data don't propagate > > > > seem not great? Why aren't we using something like XTS? It has obvious > > > > issues as wel, but CTR's weaknesses seem at least as great. And if we > > > > want a MAC, then we don't want CTR either. > > > > > > We chose CTR because it was fast, and we could use the same method for > > > WAL, which needs a streaming, not block, cipher. > > > > The WAL is block oriented too. > > Well, AES block mode only does 16-byte blocks, as far as I know, and I > assume WAL is more granular than that. WAL is 8kB blocks by default. We only ever write it out with at least that granularity. > Also, you need to know the bytes _before_ the WAL do write a new > 16-byte block, so it seems overly complex for our usage too. See the XTS reference. Yes, it needs the previous 16bytes, but only within the 8kB page. Greetings, Andres Freund
On Thu, May 27, 2021 at 12:03:00PM -0400, Robert Haas wrote: > On Thu, May 27, 2021 at 11:19 AM Bruce Momjian <bruce@momjian.us> wrote: > > I was asking how decoupling the nonce from the LSN allows for us to > > avoid full page writes for hint bit changes. I am guessing you are > > saying that on recovery, if we see a hint-bit-only change in the WAL > > (with a new nonce), we just throw away the page because it could be torn > > and use the WAL full page write version. > > Well, in the design where the nonce is stored in the page, there is no > need for every hint-type change to appear in the WAL at all. Once per > checkpoint cycle, you need to write a full page image, as we do for > checksums or wal_log_hints. The rest of the time, you can just bump > the nonce and rewrite the page, same as we do today. What is it about having the nonce be the LSN that doesn't allow that to happen? Could we just create a dummy LSN record and assign that to the page and use that as a nonce. > > Yes, it might be 1e100+++ more expensive too, but we don't know, and I > > am not ready to add a lot of complexity for such an unknown. > > No, it can't be 1e100+++ more expensive, because it's not > realistically possible for a page to be written to disk 1e100+++ times > per checkpoint cycle. It is however entirely possible for it to be > written 100 times per checkpoint cycle. That is not something unknown > about which we need to speculate; it is easy to see that this can > happen, even on a simple test like pgbench with a data set larger than > shared buffers. I guess you didn't get my joke on that one. ;-) > It is not right to confuse "we have no idea whether this will be > expensive" with "how expensive this will be is workload-dependent," > which is what you seem to be doing here. If we had no idea whether > something would be expensive, then I agree that it might not be worth > adding complexity for it, or maybe some testing should be done first > to find out. But if we know for certain that in some workloads > something can be very expensive, then we had better at least talk > about whether it is worth adding complexity in order to resolve the > problem. And that is the situation here. Sure, but the downsides of avoiding it seem very high to me, not only in code complexity but in requiring dump/reload or logical replication to deploy. > I am not even convinced that storing the nonce in the block is going > to be more complex, because it seems to me that the patches I posted > upthread worked out pretty cleanly. There are some things to discuss > and think about there, for sure, but it is not like we are talking > about inventing warp drive. See above. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-05-27 11:10:00 -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote: > > On 2021-05-25 22:23:46 -0400, Stephen Frost wrote: > > > Andres mentioned other possible cases where the LSN doesn’t change even > > > though we change the page and, as he’s probably right, we would have to > > > figure out a solution in those cases too (potentially including cases like > > > crash recovery or replay on a replica where we can’t really just go around > > > creating dummy WAL records to get new LSNs..). > > > > Yea, I think there's quite a few of those. For one, we don't guarantee > > that that the hole between pd_lower/upper is zeroes. It e.g. contains > > old tuple data after deleted tuples are pruned away. But when logging an > > FPI, we omit that range. Which means that after crash recovery the area > > is zeroed out. There's several cases where padding can result in the > > same. > > > > Just look at checkXLogConsistency(), heap_mask() et al for all the > > differences that can occur and that need to be ignored for the recovery > > consistency checking to work. > > > > Particularly the hole issue seems trivial to exploit, because we know > > the plaintext of the hole after crash recovery (0s). > > > > > > I don't see how using the LSN alone is salvagable. > > OK, so you are saying the replica would have all zeros because of crash > recovery, so XOR'ing that with the encryption steam makes the encryption > stream visible, and you could use that to decrypt the dead data on the > primary. That is an interesting case that would need to fix. I don't see how it's a viable security model to assume that you can ensure that we never write different data with the same LSN. Yes, you can fix a few cases, but how can we be confident that we're actually doing a good job, when the consequences are pretty dramatic. Nor do I think it's architecturally OK to impose a significant new hurdle against doing any sort of "changing" writes on standbys. It's time to move on from the idea of using the LSN as the nonce. Greetings, Andres Freund
Hi, On 2021-05-27 10:57:24 -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote: > > I suspect that if we try to not disclose data if an attacker has write > > access, this still leaves us with issues around nonce reuse, unless we > > also employ integrity measures. Particularly due to CTR mode, which > > makes it easy to manipulate individual parts of the encrypted page > > without causing the decrypted page to be invalid. E.g. the attacker can > > just update pd_upper on the page by a small offset, and suddenly the > > replay will insert the tuple at a slightly shifted offset - which then > > seems to leak enough data to actually analyze things? > > Yes, I don't think protecting from write access is a realistic goal at > this point, and frankly ever. I think write access protection needs > all-cluster-file encryption. This is documented: > > https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch > > Cluster file encryption does not protect against unauthorized > file system writes. Such writes can allow data decryption if > used to weaken the system's security and the weakened system is > later supplied with the externally-stored cluster encryption key. > This also does not always detect if users with write access remove > or modify database files. > > If this needs more text, let me know. Well, it's one thing to say that it's not a complete protection, and another that a few byte sized writes to a single page are sufficient to get access to encrypted data. And "all-cluster-file" encryption won't help against the type of scenario I outlined. > > https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92 > > /* > > * Check if the page has a special size == GISTPageOpaqueData, a valid > > * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN. This > > * is true for all GiST pages, and perhaps a few pages that are not. The > > * only downside of guessing wrong is that we might not update the LSN for > > * some non-permanent relation page changes, and therefore reuse the IV, > > * which seems acceptable. > > */ > > > > Huh? > > Are you asking about this C commention in relation to the discussion > above, or is it an independent question? Are asking what it means? The comment is blithely waving away a fundamental no-no (reusing nonces) when using CTR mode as "acceptable". Greetings, Andres Freund
On Thu, May 27, 2021 at 12:01 PM Andres Freund <andres@anarazel.de> wrote: > What prevents us from using something like XTS? I'm not saying that that > is the right approach, due to the fact that it leaks information about a > block being the same as an earlier version of the same block. But right > now we are talking about using CTR without addressing the weaknesses CTR > has, where a failure to increase the nonce is fatal (the code even > documents known cases where that could happen!), and where there's no > error propagation within a block. I spent some time this morning reading up on XTS in general and also on previous discussions on this list on the list. It seems like XTS is considered state-of-the-art for full disk encryption, and what we're doing seems to me to be similar in concept. The most useful on-list discussion that I found was on this thread: https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df There are a lot of things that people said on that thread, but then Bruce basically proposes CBC and/or CTR and I couldn't clearly understand the reasons for that choice. Maybe there was some off-list discussion of this that wasn't captured in the email traffic? All that having been said, I am pretty sure I don't fully understand what any of these modes involve. I gather that XTS requires two keys, but it seems like it doesn't require a nonce. It seems to use a "tweak" that is generated from the block number and the position within the block (since an e.g. 8kB database block is being encrypted as a bunch of 16-byte AES blocks) but apparently there's no problem with the tweak being the same every time the block is encrypted? If no nonce is required, that seems like a massive advantage, since then we don't need to worry about how to get one or about how to make sure it's never reused. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-27 10:47:13 -0400, Robert Haas wrote: > Now, another option is to do what you suggest here. We could say that > if a dirty page is evicted, but the page is only dirty because of > hint-type changes, we don't actually write it out. That does avoid > using the same nonce for multiple writes, because now there's only one > write. It also fixes the problem on standbys that Andres was > complaining about, because on a standby, the only way a page can > possibly be dirtied without an associated WAL record is through a > hint-type change. What does that protect against that I was concerned about? That still allows hint bits to be leaked, via 1) replay WAL record with FPI 2) hint bit change during read 3) incremental page change vs 1) 3). Even if we declare that OK, it doesn't actually address the whole issue of WAL replay not necessarily re-creating bit identical page contents. Greetings, Andres Freund
Hi, On 2021-05-27 12:28:39 -0400, Robert Haas wrote: > All that having been said, I am pretty sure I don't fully understand > what any of these modes involve. I gather that XTS requires two keys, > but it seems like it doesn't require a nonce. It needs a second secret, but that second secret can - as far as I understand it - be generated using a strong prng and encrypted with the "main" key, and stored in a central location. > It seems to use a "tweak" that is generated from the block number and > the position within the block (since an e.g. 8kB database block is > being encrypted as a bunch of 16-byte AES blocks) but apparently > there's no problem with the tweak being the same every time the block > is encrypted? Right. That comes with a price however: It leaks the information that a block "version" is identical to an earlier version of the block. That's obviously better than leaking information that allows decryption like with the nonce reuse issue. Nor does it provide integrity - which does seem like a significant issue going forward. Which does require storing additional per-page data... Greetings, Andres Freund
On Thu, May 27, 2021 at 12:28:39PM -0400, Robert Haas wrote: > On Thu, May 27, 2021 at 12:01 PM Andres Freund <andres@anarazel.de> wrote: > > What prevents us from using something like XTS? I'm not saying that that > > is the right approach, due to the fact that it leaks information about a > > block being the same as an earlier version of the same block. But right > > now we are talking about using CTR without addressing the weaknesses CTR > > has, where a failure to increase the nonce is fatal (the code even > > documents known cases where that could happen!), and where there's no > > error propagation within a block. > > I spent some time this morning reading up on XTS in general and also > on previous discussions on this list on the list. It seems like XTS is > considered state-of-the-art for full disk encryption, and what we're > doing seems to me to be similar in concept. The most useful on-list > discussion that I found was on this thread: > > https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df > > There are a lot of things that people said on that thread, but then > Bruce basically proposes CBC and/or CTR and I couldn't clearly > understand the reasons for that choice. Maybe there was some off-list > discussion of this that wasn't captured in the email traffic? There was no other discussion about XTS that I know of. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, May 27, 2021 at 12:31 PM Andres Freund <andres@anarazel.de> wrote: > What does that protect against that I was concerned about? That still > allows hint bits to be leaked, via > > 1) replay WAL record with FPI > 2) hint bit change during read > 3) incremental page change > > vs 1) 3). Even if we declare that OK, it doesn't actually address the > whole issue of WAL replay not necessarily re-creating bit identical page > contents. You're right. That seems fatal, as it would lead to encrypting the different versions of the page with the IV on the master and the standby, and the differences would consist of old data that could be recovered by XORing the two encrypted page versions. To be clear, it is tuple data that would be recovered, not just hint bits. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-27 12:28:39 -0400, Robert Haas wrote: > > All that having been said, I am pretty sure I don't fully understand > > what any of these modes involve. I gather that XTS requires two keys, > > but it seems like it doesn't require a nonce. > > It needs a second secret, but that second secret can - as far as I > understand it - be generated using a strong prng and encrypted with the > "main" key, and stored in a central location. Yes, I'm fairly confident this is the case. > > It seems to use a "tweak" that is generated from the block number and > > the position within the block (since an e.g. 8kB database block is > > being encrypted as a bunch of 16-byte AES blocks) but apparently > > there's no problem with the tweak being the same every time the block > > is encrypted? > > Right. That comes with a price however: It leaks the information that a > block "version" is identical to an earlier version of the block. That's > obviously better than leaking information that allows decryption like > with the nonce reuse issue. Right, if we simply can't solve the nonce-reuse concern then that would be better. > Nor does it provide integrity - which does seem like a significant issue > going forward. Which does require storing additional per-page data... Yeah, this is one of the reasons that I hadn't really been thrilled with XTS- I've really been looking down the road at eventually having GCM and having actual integrity validation included. That's not really a reason to rule it out though and Bruce's point about having a way to get to an encrypted cluster from an unencrypted one is certainly worth consideration. Naturally, we'd need to document everything appropriately but there isn't anything saying that we couldn't, say, have XTS in v15 without any adjustments to the page layout, accepting that there's no data integrity validation and focusing just on encryption, and then returning to the question about adding in data integrity validation for a future version, perhaps using the special area for a nonce+tag with GCM or maybe something else. Users who wish to move to a cluster with encryption and data integrity validation would have to get there through some other means than replication, but that's going to always be the case because we have to have space to store the tag, even if we can figure out some other solution for the nonce. Thanks, Stephen
Attachment
Hi, On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > That's not really a reason to rule it out though and Bruce's point about > having a way to get to an encrypted cluster from an unencrypted one is > certainly worth consideration. Naturally, we'd need to document > everything appropriately but there isn't anything saying that we > couldn't, say, have XTS in v15 without any adjustments to the page > layout, accepting that there's no data integrity validation and focusing > just on encryption, and then returning to the question about adding in > data integrity validation for a future version, perhaps using the > special area for a nonce+tag with GCM or maybe something else. Users > who wish to move to a cluster with encryption and data integrity > validation would have to get there through some other means than > replication, but that's going to always be the case because we have to > have space to store the tag, even if we can figure out some other > solution for the nonce. But won't we then end up with a different set of requirements around nonce assignment durability when introducing GCM support? That's not actually entirely trivial to do correctly on a standby. I guess we can use AES-GCM-SSIV and be ok with living with edge cases leading to nonce reuse, but ... Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > > That's not really a reason to rule it out though and Bruce's point about > > having a way to get to an encrypted cluster from an unencrypted one is > > certainly worth consideration. Naturally, we'd need to document > > everything appropriately but there isn't anything saying that we > > couldn't, say, have XTS in v15 without any adjustments to the page > > layout, accepting that there's no data integrity validation and focusing > > just on encryption, and then returning to the question about adding in > > data integrity validation for a future version, perhaps using the > > special area for a nonce+tag with GCM or maybe something else. Users > > who wish to move to a cluster with encryption and data integrity > > validation would have to get there through some other means than > > replication, but that's going to always be the case because we have to > > have space to store the tag, even if we can figure out some other > > solution for the nonce. > > But won't we then end up with a different set of requirements around > nonce assignment durability when introducing GCM support? That's not > actually entirely trivial to do correctly on a standby. I guess we can > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > reuse, but ... Not sure if I'm entirely following the question but I would have thought the up-thread idea of generating a random part of the nonce for each start up and then a global counter for the rest, which would be written whenever the page is updated (meaning it wouldn't have anything to do with the LSN and would be stored in the special area as Robert contemplated) would work for both primaries and replicas. Taking a step back, while I like the idea of trying to think through these complications in a future world where we add GCM support, if we're actually agreed on seriously looking at XTS for v15 then maybe we should focus on that for the moment. As Bruce says, there's a lot of moving parts in this patch that likely need discussion and agreement in order for us to be able to move forward with it. For one, we'd probably want to get agreement on what we'd use to construct the tweak, for starters. Thanks, Stephen
Attachment
On Thu, May 27, 2021 at 12:15 PM Bruce Momjian <bruce@momjian.us> wrote: > > Well, in the design where the nonce is stored in the page, there is no > > need for every hint-type change to appear in the WAL at all. Once per > > checkpoint cycle, you need to write a full page image, as we do for > > checksums or wal_log_hints. The rest of the time, you can just bump > > the nonce and rewrite the page, same as we do today. > > What is it about having the nonce be the LSN that doesn't allow that to > happen? Could we just create a dummy LSN record and assign that to the > page and use that as a nonce. I can't tell which of two possible proposals you are describing here. If the LSN is used to derive the nonce, then one option is to just log a WAL record every time we need a new nonce. As I understand it, that's basically what you've already implemented, and we've discussed the disadvantages of that approach at some length already. The basic problems seem to be: - It's potentially very expensive if page evictions are frequent, which they will be whenever the workload is write-heavy and the working set is larger than shared_buffers. - If there's ever a situation where we need to write a page image different from any page image written previously and we cannot at that time write a WAL record to generate a new LSN for use as the nonce, then the algorithm is broken entirely. Andres's latest post points out - I think correctly - that this happens on standbys, because WAL replay does not generate byte-identical results on standbys even if you ignore hint bits. The first point strikes me as a sufficiently serious performance problem to justify giving up on this design, but that's a judgement call. The second one seems like it breaks it entirely. Now, there's another possible direction that is also suggested by your remarks here: maybe you meant using a fake LSN in cases where we can't use a real one. For example, suppose you decide to reserve half of the LSN space - all LSNs with the high bit set, for example - for this purpose. Well, you somehow need to ensure that you never use one of those values more than once, so you might think of putting a counter in shared memory. But now imagine a master with two standbys. How would you avoid having the same counter value used on one standby and also on the other standby? Even if they use the same counter for different pages, it's a critical security flaw. And since those standbys don't even need to know that the other one exists, that seems pretty well impossible to avoid. Now you might ask why we don't have the same problem if we store the nonce in the special space. One difference is that if you store the nonce explicitly, you can allow however much bit space you need in order to guarantee uniqueness, whereas reserving half the LSN space only gives you 63 bits. That's not enough to achieve uniqueness without tight coordination. With 128 bits, you can do things like just generate random values and assume they're vanishingly unlikely to collide, or randomly generate half the value and use the other half as a counter and be pretty safe. With 63 bits you just don't have enough bit space available to reliably avoid collisions using algorithms of that type, due to the birthday paradox. I think it would be adequate for uniqueness if there were a single shared counter and every allocation came from it, but again, as soon as you imagine a master and a bunch of standbys, that breaks down. Also, it's not entirely clear to me that you can avoid needing the LSN space on the page for a real LSN at the same time you also need it for a fake-LSN-being-used-as-a-nonce. We rely on the LSN field containing the LSN of the last WAL record for the page in order to obey the WAL-before-data rule, without which crash recovery will not work reliably. Now, if you sometimes have to use that field for a nonce that is a fake LSN, that means you no longer always have a place to store the real LSN. I can't convince myself off-hand that it's completely impossible to work around that problem, but it seems like any attempt to do so would be complicated and fragile at best. I don't think that's a direction that we want to go. Making crash recovery work reliably is a hard problem where we've had lots of bugs despite years of dedicated effort. TDE is also complex and has lots of pitfalls of its own. If we take two things which are individually complicated and hard to get right and intertwine them by making them share bit-space, I think it drives the complexity up to a level where we don't have much hope of getting things right. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-27 12:00:03 -0400, Bruce Momjian wrote: > On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > > Rather than surprise anyone, I might as well just come out and say some > > things. First, I have always admitted this feature has limited > > usefulness. > > > > I think a non-LSN nonce adds a lot of code complexity, which adds a code > > and maintenance burden. It also prevents the creation of an encrypted > > replica from a non-encrypted primary using binary replication, which > > makes deployment harder. > > > > Take a feature of limited usefulness, add code complexity and deployment > > difficulty, and the feature becomes even less useful. > > > > For these reasons, if we decide to go in the direction of using a > > non-LSN nonce, I no longer plan to continue working on this feature. I > > would rather work on things that have a more positive impact. Maybe a > > non-LSN nonce is a better long-term plan, but there are too many > > unknowns and complexity for me to feel comfortable with it. > > [...] > I suspect that if we start adding a non-LSN nonce and malicious write > detection, we will end up with the same problem --- a complex patch for > a feature that has limited usefulness, and requires dump/restore or > logical replication to add it to a cluster. I think such a patch would > be rejected, and I would probably even vote against it myself. I think it's diametrically the opposite. Using the LSN as the nonce requires that all code modifying pages needs to be audited (which clearly hasn't been done yet), whereas an independent nonce can be maintained in a few central places. And that's not just a one-off issue, it's a forevermore issue. Greetings, Andres Freund
On Thu, May 27, 2021 at 12:49 PM Stephen Frost <sfrost@snowman.net> wrote: > Right, if we simply can't solve the nonce-reuse concern then that would > be better. Given the issues that Andres raised about standbys and the treatment of the "hole," I see using the LSN for the nonce as a dead-end. I think it's pretty bad on performance grounds too, for reasons already discussed, but you could always hypothesize that people care so much about security that they will ignore any amount of trouble with performance. You can hardly hypothesize that those same people also won't mind security vulnerabilities that expose tuple data, though. I don't think the idea of storing the nonce at the end of the page is dead. There seem to be some difficulties there, but I think there are reasonable prospects of solving them. At the very least there's the brute-force approach of generating a ton of cryptographically strong random numbers, and there seems to be some possibility of doing better than that. However, I'm pretty excited by this idea of using XTS. Now granted I didn't have the foggiest idea what XTS was before today, but I hear you and Andres saying that we can use that approach without needing a nonce at all. That seems to make a lot of the problems we're talking about here just go away. > > Nor does it provide integrity - which does seem like a significant issue > > going forward. Which does require storing additional per-page data... > > Yeah, this is one of the reasons that I hadn't really been thrilled with > XTS- I've really been looking down the road at eventually having GCM and > having actual integrity validation included. > > That's not really a reason to rule it out though and Bruce's point about > having a way to get to an encrypted cluster from an unencrypted one is > certainly worth consideration. Naturally, we'd need to document > everything appropriately but there isn't anything saying that we > couldn't, say, have XTS in v15 without any adjustments to the page > layout, accepting that there's no data integrity validation and focusing > just on encryption, and then returning to the question about adding in > data integrity validation for a future version, perhaps using the > special area for a nonce+tag with GCM or maybe something else. Users > who wish to move to a cluster with encryption and data integrity > validation would have to get there through some other means than > replication, but that's going to always be the case because we have to > have space to store the tag, even if we can figure out some other > solution for the nonce. +1 from me to all of this except the idea of foreclosing present discussion on how data-integrity validation could be made to work. I think it would great to have more discussion of that problem now, in case it informs our decisions about anything else, especially because, based on your earlier remarks, it seems like there is some coupling between the two problems. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-27 13:26:11 -0400, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: > > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > > > That's not really a reason to rule it out though and Bruce's point about > > > having a way to get to an encrypted cluster from an unencrypted one is > > > certainly worth consideration. Naturally, we'd need to document > > > everything appropriately but there isn't anything saying that we > > > couldn't, say, have XTS in v15 without any adjustments to the page > > > layout, accepting that there's no data integrity validation and focusing > > > just on encryption, and then returning to the question about adding in > > > data integrity validation for a future version, perhaps using the > > > special area for a nonce+tag with GCM or maybe something else. Users > > > who wish to move to a cluster with encryption and data integrity > > > validation would have to get there through some other means than > > > replication, but that's going to always be the case because we have to > > > have space to store the tag, even if we can figure out some other > > > solution for the nonce. > > > > But won't we then end up with a different set of requirements around > > nonce assignment durability when introducing GCM support? That's not > > actually entirely trivial to do correctly on a standby. I guess we can > > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > > reuse, but ... > > Not sure if I'm entirely following the question It seems like it might end up with lots of duplicated effort to go for XTS in the short term, if we medium term then have to solve all the issues around how to maintain efficiently and correctly nonces anyway, because we want integrity support. > but I would have thought the up-thread idea of generating a random > part of the nonce for each start up and then a global counter for the > rest, which would be written whenever the page is updated (meaning it > wouldn't have anything to do with the LSN and would be stored in the > special area as Robert contemplated) would work for both primaries and > replicas. Yea, it's not a bad approach. Particularly because it removes the need to ensure that "global nonce counter" increments are guaranteed to be durable. > For one, we'd probably want to get agreement on what we'd use to > construct the tweak, for starters. Hm, isn't that just a pg_strong_random() and storing it encrypted? Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-27 13:26:11 -0400, Stephen Frost wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote: > > > > That's not really a reason to rule it out though and Bruce's point about > > > > having a way to get to an encrypted cluster from an unencrypted one is > > > > certainly worth consideration. Naturally, we'd need to document > > > > everything appropriately but there isn't anything saying that we > > > > couldn't, say, have XTS in v15 without any adjustments to the page > > > > layout, accepting that there's no data integrity validation and focusing > > > > just on encryption, and then returning to the question about adding in > > > > data integrity validation for a future version, perhaps using the > > > > special area for a nonce+tag with GCM or maybe something else. Users > > > > who wish to move to a cluster with encryption and data integrity > > > > validation would have to get there through some other means than > > > > replication, but that's going to always be the case because we have to > > > > have space to store the tag, even if we can figure out some other > > > > solution for the nonce. > > > > > > But won't we then end up with a different set of requirements around > > > nonce assignment durability when introducing GCM support? That's not > > > actually entirely trivial to do correctly on a standby. I guess we can > > > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > > > reuse, but ... > > > > Not sure if I'm entirely following the question > > It seems like it might end up with lots of duplicated effort to go for > XTS in the short term, if we medium term then have to solve all the > issues around how to maintain efficiently and correctly nonces anyway, > because we want integrity support. You and Robert both seem to be going in that direction, one which I tend to share, while Bruce is very hard set against it from the perspective that he doesn't view integrity as important (I disagree quite strongly with that, even if we can't protect everything, I see it as certainly valuable to protect the primary data) and that this approach adds complexity (the amount of which doesn't seem to be agreed upon). I'm also not sure how much of the effort would really be duplicated. Were we to start with XTS, that's almost drop-in with what Bruce has (actually, it should simplify some parts since we no longer need to deal with making sure we always increase the LSN, etc) gives users more flexibility in terms of getting to an encrypted cluster and solves certain use-cases. Very little of that seems like it would be ripped out if we were to (also) provide a GCM option. Now, if we were to *only* provide a GCM option then maybe we wouldn't need to think about the XTS case of having to come up with a tweak (though that seems like a rather small amount of code) but that would also mean we need to change the page format and we can't do any kind of binary/page-level transistion to an encrypted cluster, like we could with XTS. Trying to break it down, the end-goal states look like: GCM-only: no binary upgrade path due to having to store the tag XTS-only: no data integrity option GCM+XTS: binary upgrade path for XTS, data integrity with GCM If we want both a binary upgrade path, and a data integrity option, then it seems like the only end state which provides both is GCM+XTS, in which case I don't think there's a lot of actual duplication. Perhaps there's an "XTS + some other data integrity approach" option where we could preserve the page format by stuffing information into another fork or maybe telling users to hash their data and store that hash as another column which would allow us to avoid implementing GCM, but I don't see a way to avoid having XTS if we are going to provide a binary upgrade path. Perhaps AES-GCM-SIV would be interesting to consider in general, but that still means we need to find space for the tag and that still precludes a binary upgrade path. > > but I would have thought the up-thread idea of generating a random > > part of the nonce for each start up and then a global counter for the > > rest, which would be written whenever the page is updated (meaning it > > wouldn't have anything to do with the LSN and would be stored in the > > special area as Robert contemplated) would work for both primaries and > > replicas. > > Yea, it's not a bad approach. Particularly because it removes the need > to ensure that "global nonce counter" increments are guaranteed to be > durable. Right. > > For one, we'd probably want to get agreement on what we'd use to > > construct the tweak, for starters. > > Hm, isn't that just a pg_strong_random() and storing it encrypted? Perhaps it is, but at least in some other cases it's generated based on sector and block (which maybe could be relfilenode and block for us?): https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac Thanks, Stephen
Attachment
On Thu, May 27, 2021 at 1:07 PM Andres Freund <andres@anarazel.de> wrote: > But won't we then end up with a different set of requirements around > nonce assignment durability when introducing GCM support? That's not > actually entirely trivial to do correctly on a standby. I guess we can > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce > reuse, but ... All these different encryption modes are hard for me to grok. That said, I want to mention a point which I think may be relevant here. As far as I know, in the case of a permanent table page, we never write X then X' then X again. If the change is WAL-logged, then the LSN advances, and it will never thereafter go backward. Otherwise, it's something that uses MarkBufferDirtyHint(). As far as I know, all of those changes are one-way. For example, we set hint bits without logging the change, but anything that clears hint bits is logged. We mark btree index items dead as a type of hint, but they never come back to life again; instead, they get cleaned out of the page entirely as a WAL-logged operation. So I don't know that an adversary seeing the same exact ciphertext multiple times is really likely to occur. Well, it could certainly occur for temporary or unlogged tables, since those have LSN = 0. And in cases were we currently copy pages around, like creating a new database, it could happen. I suspect those cases could be fixed, if we cared enough, and there are independent reasons to want to fix the create-new-database case. It would be fairly easy to put fake LSNs in temporary buffers, since they're in a separate pool of buffers in backend-private memory with a separate buffer manager. And it could probably even be done for unlogged tables, though not as easily. Or we could use the special-space technique to put some unpredictable garbage into each page and then change the garbage every time we write the page. I read the discussion so far to say that maybe these kinds of measures aren't even needed, and if so, great. But even without doing anything, I don't think it's going to happen very much. Another case where this sort of thing might happen is a standby doing whatever the master did. I suppose that could be avoided if the standby always has its own encryption keys, but that forces a key rotation when you create a standby, and it doesn't seem like a lot of fun to insist on that. But the information leak seems minor. If we get to a point where an adversary with full filesystem access on all our systems can't do better than assessing our replication lag, we'll be a lot better off then than we are now. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > I'm also not sure how much of the effort would really be duplicated. > > Were we to start with XTS, that's almost drop-in with what Bruce has > (actually, it should simplify some parts since we no longer need to deal > with making sure we always increase the LSN, etc) gives users more > flexibility in terms of getting to an encrypted cluster and solves > certain use-cases. Very little of that seems like it would be ripped > out if we were to (also) provide a GCM option. > Now, if we were to *only* provide a GCM option then maybe we wouldn't > need to think about the XTS case of having to come up with a tweak > (though that seems like a rather small amount of code) but that would > also mean we need to change the page format and we can't do any kind of > binary/page-level transistion to an encrypted cluster, like we could > with XTS. > Trying to break it down, the end-goal states look like: > > GCM-only: no binary upgrade path due to having to store the tag > XTS-only: no data integrity option > GCM+XTS: binary upgrade path for XTS, data integrity with GCM Why would GCM + XTS make sense? Especially if we were to go with AES-GCM-SIV or something, drastically reducing the danger of nonce reuse? And I don't think there's an easy way to do both using openssl, without double encrypting, which we'd obviously not want for performance reasons. And I don't think we'd want to implement either ourselves - leaving other dangers aside, I don't think we want to do the optimization work necessary to get good performance. > If we want both a binary upgrade path, and a data integrity option, then > it seems like the only end state which provides both is GCM+XTS, in > which case I don't think there's a lot of actual duplication. I honestly feel that Bruce's point about trying to shoot for the moon, and thus not getting the basic feature done, applies much more to the binary upgrade path than anything else. I think we should just stop aiming for that for now. If we later want to add code that goes through the cluster to ensure that there's enough space on each page for integrity data, to provide a migration path, fine. But we shouldn't make the binary upgrade path for TED a hard requirement. > > > For one, we'd probably want to get agreement on what we'd use to > > > construct the tweak, for starters. > > > > Hm, isn't that just a pg_strong_random() and storing it encrypted? > > Perhaps it is, but at least in some other cases it's generated based on > sector and block (which maybe could be relfilenode and block for us?): > > https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac My understanding is that you'd use tweak_secret + block_offset or someop(tweak_secret, relfilenode) block_offset to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak. Greetings, Andres Freund
On Thu, May 27, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote: > Trying to break it down, the end-goal states look like: > > GCM-only: no binary upgrade path due to having to store the tag > XTS-only: no data integrity option > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > If we want both a binary upgrade path, and a data integrity option, then > it seems like the only end state which provides both is GCM+XTS, in > which case I don't think there's a lot of actual duplication. > > Perhaps there's an "XTS + some other data integrity approach" option > where we could preserve the page format by stuffing information into > another fork or maybe telling users to hash their data and store that > hash as another column which would allow us to avoid implementing GCM, > but I don't see a way to avoid having XTS if we are going to provide a > binary upgrade path. > > Perhaps AES-GCM-SIV would be interesting to consider in general, but > that still means we need to find space for the tag and that still > precludes a binary upgrade path. Anything that decouples features without otherwise losing ground is a win. If there are things A and B, such that A does encryption and B does integrity validation, and A and B can be turned on and off independently of each other, that is better than some otherwise-comparable C that provides both features. But I'm going to have to defer to you and Andres and whoever else on whether that's true for any encryption methods/modes in particular. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-27 15:48:09 -0400, Robert Haas wrote: > That said, I want to mention a point which I think may be relevant > here. As far as I know, in the case of a permanent table page, we > never write X then X' then X again. Well, there's crash recovery / restarts. And as previously explained they can end up with different page contents than before. > And in cases were we currently copy pages around, like creating a new > database, it could happen. As long as its identical data that should be fine, except leaking that the data is identical. Which doesn't make me really concerned in case of template databases. > I suspect those cases could be fixed, if we cared enough, and there > are independent reasons to want to fix the create-new-database > case. It would be fairly easy to put fake LSNs in temporary buffers, > since they're in a separate pool of buffers in backend-private memory > with a separate buffer manager. And it could probably even be done for > unlogged tables, though not as easily. [...] I read > the discussion so far to say that maybe these kinds of measures aren't > even needed, and if so, great. But even without doing anything, I > don't think it's going to happen very much. What precisely are you referring to with "aren't even needed"? I don't see how the fake LSN approach can work for the crash recovery issues? > Or we could use the special-space technique to put some unpredictable > garbage into each page and then change the garbage every time we write > the page Unfortunately with CTR mode that doesn't provide much protection, if it's part of the encrypted data (vs IV/nonce). A one bit change in the encrypted data only changes one bit in the unencrypted data, as the data is just XORd with the cipher stream. So random changes in one place doesn't prevent disclosure in other parts of the data if the nonce doesn't also change. And one can easily predict the effect of flipping certain bits. > Another case where this sort of thing might happen is a standby doing > whatever the master did. I suppose that could be avoided if the > standby always has its own encryption keys, but that forces a key > rotation when you create a standby, and it doesn't seem like a lot of > fun to insist on that. But the information leak seems minor. Which leaks seem minor? The "hole" issues leak all the prior contents of the hole, without needing any complicated analysis of the data, because one plain text is known (zeroes). Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > > I'm also not sure how much of the effort would really be duplicated. > > > > Were we to start with XTS, that's almost drop-in with what Bruce has > > (actually, it should simplify some parts since we no longer need to deal > > with making sure we always increase the LSN, etc) gives users more > > flexibility in terms of getting to an encrypted cluster and solves > > certain use-cases. Very little of that seems like it would be ripped > > out if we were to (also) provide a GCM option. > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't > > need to think about the XTS case of having to come up with a tweak > > (though that seems like a rather small amount of code) but that would > > also mean we need to change the page format and we can't do any kind of > > binary/page-level transistion to an encrypted cluster, like we could > > with XTS. > > > Trying to break it down, the end-goal states look like: > > > > GCM-only: no binary upgrade path due to having to store the tag > > XTS-only: no data integrity option > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > Why would GCM + XTS make sense? Especially if we were to go with > AES-GCM-SIV or something, drastically reducing the danger of nonce > reuse? You can't get to a GCM-based solution without changing the page format and therefore you can't get there using streaming replication or a pg_upgrade that does an encrypt step along with the copy. > And I don't think there's an easy way to do both using openssl, without > double encrypting, which we'd obviously not want for performance > reasons. And I don't think we'd want to implement either ourselves - > leaving other dangers aside, I don't think we want to do the > optimization work necessary to get good performance. Errrr, clearly a misunderstanding here- what I'm suggesting is that we'd have initdb options where someone could initdb and say they want XTS, OR they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV). I'm not talking about doing both in the cluster at the same time.. Or, with XTS, we could have an option to pg_basebackup + encrypt into XTS to build an encrypted replica from an unencrypted cluster. There isn't any way we could do that with GCM though since we wouldn't have any place to put the tag. > > If we want both a binary upgrade path, and a data integrity option, then > > it seems like the only end state which provides both is GCM+XTS, in > > which case I don't think there's a lot of actual duplication. > > I honestly feel that Bruce's point about trying to shoot for the moon, > and thus not getting the basic feature done, applies much more to the > binary upgrade path than anything else. I think we should just stop > aiming for that for now. If we later want to add code that goes through > the cluster to ensure that there's enough space on each page for > integrity data, to provide a migration path, fine. But we shouldn't make > the binary upgrade path for TED a hard requirement. Ok, that's a pretty clear fundamental disagreement between you and Bruce. For my 2c, I tend to agree with you that the binary upgrade path isn't that critical. If we agree to forgo the binary upgrade requirement and are willing to accept Robert's approach to use the special area for the nonce+tag, or similar, then we could perhaps avoid the work of supporting XTS. > > > > For one, we'd probably want to get agreement on what we'd use to > > > > construct the tweak, for starters. > > > > > > Hm, isn't that just a pg_strong_random() and storing it encrypted? > > > > Perhaps it is, but at least in some other cases it's generated based on > > sector and block (which maybe could be relfilenode and block for us?): > > > > https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac > > My understanding is that you'd use > tweak_secret + block_offset > or > someop(tweak_secret, relfilenode) block_offset > > to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak. The above article, at least, suggested encrypting the sector number using the second key and then multiplying that times 2^(block number), where those blocks were actually AES 128bit blocks. The article further claims that this is what's used in things like Bitlocker, TrueCrypt, VeraCrypt and OpenSSL. While the documentation isn't super clear, I'm taking that to mean that when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it with a 256-bit key (twice the size of the AES key length function), and you give it a 'tweak', that what you would actually be passing in would be the "sector number" in the above method, or for us perhaps it would be relfilenode+block number, or maybe just block number but it seems like it'd be better to include the relfilenode to me. OpenSSL docs: https://www.openssl.org/docs/man1.1.1/man3/EVP_aes_256_cbc.html Naturally, we would implement testing and use the NIST AES-XTS test vectors to verify that we're getting the correct results from OpenSSL based on this understanding. Still leaves us with the question of what exactly we should pass into OpenSSL as the 'tweak', if it should be the block offset inside the file only, or the block offset + relfilenode, or something else. Thanks, Stephen
Attachment
On 2021-May-27, Andres Freund wrote: > On 2021-05-27 15:48:09 -0400, Robert Haas wrote: > > Another case where this sort of thing might happen is a standby doing > > whatever the master did. I suppose that could be avoided if the > > standby always has its own encryption keys, but that forces a key > > rotation when you create a standby, and it doesn't seem like a lot of > > fun to insist on that. But the information leak seems minor. > > Which leaks seem minor? The "hole" issues leak all the prior contents of > the hole, without needing any complicated analysis of the data, because > one plain text is known (zeroes). Maybe that problem could be solved by having PageRepairFragmentation, compactify_tuples et al always fill the hole with zeroes, in encrypted databases. -- Álvaro Herrera Valdivia, Chile
Hi, On 2021-05-27 16:09:13 -0400, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: > > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > > > I'm also not sure how much of the effort would really be duplicated. > > > > > > Were we to start with XTS, that's almost drop-in with what Bruce has > > > (actually, it should simplify some parts since we no longer need to deal > > > with making sure we always increase the LSN, etc) gives users more > > > flexibility in terms of getting to an encrypted cluster and solves > > > certain use-cases. Very little of that seems like it would be ripped > > > out if we were to (also) provide a GCM option. > > > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't > > > need to think about the XTS case of having to come up with a tweak > > > (though that seems like a rather small amount of code) but that would > > > also mean we need to change the page format and we can't do any kind of > > > binary/page-level transistion to an encrypted cluster, like we could > > > with XTS. > > > > > Trying to break it down, the end-goal states look like: > > > > > > GCM-only: no binary upgrade path due to having to store the tag > > > XTS-only: no data integrity option > > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > > [...] > > And I don't think there's an easy way to do both using openssl, without > > double encrypting, which we'd obviously not want for performance > > reasons. And I don't think we'd want to implement either ourselves - > > leaving other dangers aside, I don't think we want to do the > > optimization work necessary to get good performance. > > Errrr, clearly a misunderstanding here- what I'm suggesting is that we'd > have initdb options where someone could initdb and say they want XTS, OR > they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV). I'm > not talking about doing both in the cluster at the same time.. Ah, that makes more sense ;). So the end goal states are the different paths we could take? > Still leaves us with the question of what exactly we should pass into > OpenSSL as the 'tweak', if it should be the block offset inside the > file only, or the block offset + relfilenode, or something else. I think it has to include the relfilenode as a minimum. It'd not be great if you could identify equivalent blocks in different tables. It might even be worth complicating createdb() a bit and including the dboid as well. Greetings, Andres Freund
Hi, On 2021-05-27 16:13:44 -0400, Alvaro Herrera wrote: > Maybe that problem could be solved by having PageRepairFragmentation, > compactify_tuples et al always fill the hole with zeroes, in encrypted > databases. If that were the only issue, maybe. But there's plenty other places were similar things happen. Look at all the stuff that needs to be masked out for wal consistency checking (checkXLogConsistency() + all the things it calls). And there's no way proposed to actually have a maintainable way of detecting omissions around this. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-27 16:09:13 -0400, Stephen Frost wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote: > > > > I'm also not sure how much of the effort would really be duplicated. > > > > > > > > Were we to start with XTS, that's almost drop-in with what Bruce has > > > > (actually, it should simplify some parts since we no longer need to deal > > > > with making sure we always increase the LSN, etc) gives users more > > > > flexibility in terms of getting to an encrypted cluster and solves > > > > certain use-cases. Very little of that seems like it would be ripped > > > > out if we were to (also) provide a GCM option. > > > > > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't > > > > need to think about the XTS case of having to come up with a tweak > > > > (though that seems like a rather small amount of code) but that would > > > > also mean we need to change the page format and we can't do any kind of > > > > binary/page-level transistion to an encrypted cluster, like we could > > > > with XTS. > > > > > > > Trying to break it down, the end-goal states look like: > > > > > > > > GCM-only: no binary upgrade path due to having to store the tag > > > > XTS-only: no data integrity option > > > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM > > > > > [...] > > > And I don't think there's an easy way to do both using openssl, without > > > double encrypting, which we'd obviously not want for performance > > > reasons. And I don't think we'd want to implement either ourselves - > > > leaving other dangers aside, I don't think we want to do the > > > optimization work necessary to get good performance. > > > > Errrr, clearly a misunderstanding here- what I'm suggesting is that we'd > > have initdb options where someone could initdb and say they want XTS, OR > > they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV). I'm > > not talking about doing both in the cluster at the same time.. > > Ah, that makes more sense ;). So the end goal states are the different > paths we could take? The end goals are different possible things we could provide support for, not in one cluster, but in one build of PG. That is, we could add support in v15 (or whatever) for: initdb --encryption-type=AES-XTS and then in v16 add support for: initdb --encryption-type=AES-GCM (or AES-GCM-SIV, whatever) while keeping support for AES-XTS. Users who just want encryption could go do a binary upgrade of some kind to a cluster which has AES-XTS encryption, but to get GCM they'd have to initialize a new cluster and migrate data to it using logical replication or pg_dump/restore. There's also been requests for other possible encryption options, so I don't think these would even be the only options eventually, though I do think we'd probably have them broken down into "just encryption" or "encryption + data integrity" with the same resulting limitations regarding the ability to do binary upgrades. > > Still leaves us with the question of what exactly we should pass into > > OpenSSL as the 'tweak', if it should be the block offset inside the > > file only, or the block offset + relfilenode, or something else. > > I think it has to include the relfilenode as a minimum. It'd not be > great if you could identify equivalent blocks in different tables. It > might even be worth complicating createdb() a bit and including the > dboid as well. At this point I'm wondering if it's just: dboid/relfilenode:block-offset and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128) (or -256, whatever we're using based on what was passed to initdb) returns. Thanks, Stephen
Attachment
On Thu, May 27, 2021 at 4:04 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-05-27 15:48:09 -0400, Robert Haas wrote: > > That said, I want to mention a point which I think may be relevant > > here. As far as I know, in the case of a permanent table page, we > > never write X then X' then X again. > > Well, there's crash recovery / restarts. And as previously explained > they can end up with different page contents than before. Right, I'm not trying to oversell this point ... if in system XYZ there's a serious security exposure from ever repeating a page write, we should not use system XYZ unless we do some work to make sure that every page write is different. But if we just think it would be nicer if page writes didn't repeat, that's probably *mostly* true today already. > I don't see how the fake LSN approach can work for the crash recovery > issues? I wasn't trying to say it could. You've convinced me on that point. > > Or we could use the special-space technique to put some unpredictable > > garbage into each page and then change the garbage every time we write > > the page > > Unfortunately with CTR mode that doesn't provide much protection, if > it's part of the encrypted data (vs IV/nonce). A one bit change in the > encrypted data only changes one bit in the unencrypted data, as the data > is just XORd with the cipher stream. So random changes in one place > doesn't prevent disclosure in other parts of the data if the nonce > doesn't also change. And one can easily predict the effect of flipping > certain bits. Yeah, I wasn't talking about CTR mode there. I was just saying if we wanted to avoid ever repeating a write. > > Another case where this sort of thing might happen is a standby doing > > whatever the master did. I suppose that could be avoided if the > > standby always has its own encryption keys, but that forces a key > > rotation when you create a standby, and it doesn't seem like a lot of > > fun to insist on that. But the information leak seems minor. > > Which leaks seem minor? The "hole" issues leak all the prior contents of > the hole, without needing any complicated analysis of the data, because > one plain text is known (zeroes). No. You're confusing what I was saying here, in the contents of your comments about the limitations of AES-GCM-SIV, with the discussion with Bruce about nonce generation. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote: > The above article, at least, suggested encrypting the sector number > using the second key and then multiplying that times 2^(block number), > where those blocks were actually AES 128bit blocks. The article further > claims that this is what's used in things like Bitlocker, TrueCrypt, > VeraCrypt and OpenSSL. > > While the documentation isn't super clear, I'm taking that to mean that > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it > with a 256-bit key (twice the size of the AES key length function), and > you give it a 'tweak', that what you would actually be passing in would > be the "sector number" in the above method, or for us perhaps it would > be relfilenode+block number, or maybe just block number but it seems > like it'd be better to include the relfilenode to me. If you go in that direction, you should make sure pg_upgrade preserves what you use (it does not preserve relfilenode, just pg_class.oid), and CREATE DATABASE still works with a simple file copy. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote: > > The above article, at least, suggested encrypting the sector number > > using the second key and then multiplying that times 2^(block number), > > where those blocks were actually AES 128bit blocks. The article further > > claims that this is what's used in things like Bitlocker, TrueCrypt, > > VeraCrypt and OpenSSL. > > > > While the documentation isn't super clear, I'm taking that to mean that > > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it > > with a 256-bit key (twice the size of the AES key length function), and > > you give it a 'tweak', that what you would actually be passing in would > > be the "sector number" in the above method, or for us perhaps it would > > be relfilenode+block number, or maybe just block number but it seems > > like it'd be better to include the relfilenode to me. > > If you go in that direction, you should make sure pg_upgrade preserves > what you use (it does not preserve relfilenode, just pg_class.oid), and > CREATE DATABASE still works with a simple file copy. Ah, yes, good point, if we support in-place pg_upgrade of an encrypted cluster then the tweak has to be consistent between the old and new. I tend to agree with Andres that it'd be reasonable to make CREATE DATABASE do a bit more work for an encrypted cluster though, so I'm less concerned about that. Using pg_class.oid instead of relfilenode seems likely to complicate things like crash recovery though, wouldn't it? I wonder if there's something else we could use. Thanks, Stephen
Attachment
Hi, On 2021-05-27 16:55:29 -0400, Robert Haas wrote: > No. You're confusing what I was saying here, in the contents of your > comments about the limitations of AES-GCM-SIV, with the discussion > with Bruce about nonce generation. Ah. I think the focus on LSNs confused me a bit. FWIW: Nist guidance on IVs for AES GCM (surprisingly readable): https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf AES-GCM-SIV (harder to read): https://eprint.iacr.org/2017/168.pdf Greetings, Andres Freund
On Thu, May 27, 2021 at 11:12 PM Bruce Momjian <bruce@momjian.us> wrote:
Well, the code now does write full page images for hint bit changes, so
it should work fine.
Yes, indeed it works well and I'd tested it. But here I want to make clear my understanding of the argument, if there is any problem please help me correct it.
1. Why couldn't we just throw away the hint bit change? Just don't encrypt them?
Maybe we can expose the *pd_flags*, we needn't re-encrypt when it changed and there's no security risk. But there have many other changes that will call the function *MarkBufferDirtyHint* and we also needn't WAL log them too. We couldn't expose all of them, so the way "throw them away, don't encrypt them" is not feasible.
2. Why can we accept the performance degradation caused by checksum in this way, but TDE can't?
The checksum must be implemented in this way, but in TDE maybe we can try another way to avoid this harm.
3. Another benefit of using the special space is that it's also can be used for AES-GCM to support integrity.
I'm just a beginner of PG and may not have considered some obvious problems. But please let me put forward my rough idea again -- Why can't we simply use LSN+blockNum+checksum as nonce?
When the checksums are enabled, every time we call the *MarkBufferDirtyHint* will generate a new LSN. So we can simply use the LSN+blockNum+0000 as the nonce.
When the checksums are disabled, we can use these unused checksum values as a counter to make sure we have different nonce even if we don't write the new WAL record.
There is no royal road to learning.
HighGo Software Co.
On Fri, May 28, 2021 at 2:12 PM Neil Chen <carpenter.nail.cz@gmail.com> wrote:
When the checksums are disabled, we can use these unused checksum values as a counter to make sure we have different nonce even if we don't write the new WAL record.
Ah, well, I think I've figured it out for myself. In this way, we can't protect against torn pages...
There is no royal road to learning.
HighGo Software Co.
On Thu, May 27, 2021 at 04:36:23PM -0400, Stephen Frost wrote: > At this point I'm wondering if it's just: > > dboid/relfilenode:block-offset > > and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128) > (or -256, whatever we're using based on what was passed to initdb) > returns. FYI, the dboid is not preserved by pg_upgrade. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-05-27 17:00:23 -0400, Bruce Momjian wrote: > If you go in that direction, you should make sure pg_upgrade preserves > what you use (it does not preserve relfilenode, just pg_class.oid) Is there a reason for pg_upgrade not to maintain relfilenode, aside from implementation simplicity (which is a good reason!). The fact that the old and new clusters have different relfilenodes does make inspecting some things a bit harder. It'd be harder to adjust the relfilenode to match between old/new cluster if pg_upgrade needed to deal with relmapper using relations (i.e. ones where pg_class.relfilenode isn't used because they need to be accessed to read pg_class, or because they're shared), but it doesn't need to. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2021-05-27 17:00:23 -0400, Bruce Momjian wrote: > > If you go in that direction, you should make sure pg_upgrade preserves > > what you use (it does not preserve relfilenode, just pg_class.oid) > > Is there a reason for pg_upgrade not to maintain relfilenode, aside from > implementation simplicity (which is a good reason!). The fact that the old and > new clusters have different relfilenodes does make inspecting some things a > bit harder. This was discussed for a bit during the Unconference (though it was related to backups and major upgrades which involves replicas) and the general consensus seemed to be that, no, it wasn't for any specific reason beyond that pg_upgrade didn't need to preserve relfilenode and therefore didn't. There was a discussion around if there were possibly any pitfalls that we might run into, should we try to have pg_upgrade preserve relfilenodes but I don't *think* there were any actual show stoppers that came up. The simplest approach, I would think, would be to have it do the same thing that it does for OIDs today- basically have pg_dump in binary mode emit a function call to inform the backend of what relfilenode to use for the next CREATE statement. We would need to also pass into that function if the table should have a TOAST table and what the relfilenode for that should be too, for the base table. We'd need to also handle indexes, mat views, etc, of course. > It'd be harder to adjust the relfilenode to match between old/new cluster if > pg_upgrade needed to deal with relmapper using relations (i.e. ones where > pg_class.relfilenode isn't used because they need to be accessed to read > pg_class, or because they're shared), but it doesn't need to. Right, and we generally shouldn't need to worry about conflicts arising from relfilenodes used by catalog tables since the new cluster should be a freshly initdb'd cluster and everything in the fresh catalog should be below the relfilenode values we use for user relations. There did seem to generally be some usefulness to having relfilenodes preserved across major version upgrades beyond TDE and that's a pretty independent project that could be tackled independently of TDE efforts. Thanks, Stephen
Attachment
On Mon, May 31, 2021 at 4:16 PM Stephen Frost <sfrost@snowman.net> wrote: > There did seem to generally be some usefulness to having relfilenodes > preserved across major version upgrades beyond TDE and that's a pretty > independent project that could be tackled independently of TDE efforts. +1. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, May 31, 2021 at 04:16:52PM -0400, Stephen Frost wrote: > Greetings, > > * Andres Freund (andres@anarazel.de) wrote: > > On 2021-05-27 17:00:23 -0400, Bruce Momjian wrote: > > > If you go in that direction, you should make sure pg_upgrade preserves > > > what you use (it does not preserve relfilenode, just pg_class.oid) > > > > Is there a reason for pg_upgrade not to maintain relfilenode, aside from > > implementation simplicity (which is a good reason!). The fact that the old and > > new clusters have different relfilenodes does make inspecting some things a > > bit harder. > > This was discussed for a bit during the Unconference (though it was > related to backups and major upgrades which involves replicas) and the > general consensus seemed to be that, no, it wasn't for any specific > reason beyond that pg_upgrade didn't need to preserve relfilenode and > therefore didn't. Yes, David Steele wanted it so incremental backups after pg_upgrade were smaller, which makes sense. > There was a discussion around if there were possibly any pitfalls that > we might run into, should we try to have pg_upgrade preserve > relfilenodes but I don't *think* there were any actual show stoppers > that came up. The simplest approach, I would think, would be to have it > do the same thing that it does for OIDs today- basically have pg_dump in > binary mode emit a function call to inform the backend of what > relfilenode to use for the next CREATE statement. We would need to also > pass into that function if the table should have a TOAST table and what > the relfilenode for that should be too, for the base table. We'd need > to also handle indexes, mat views, etc, of course. Yes, exactly. The pg_upgrade.c paragraph says: * We control all assignments of pg_class.oid (and relfilenode) so toast * oids are the same between old and new clusters. This is important * because toast oids are stored as toast pointers in user tables. * * While pg_class.oid and pg_class.relfilenode are initially the same * in a cluster, they can diverge due to CLUSTER, REINDEX, or VACUUM * FULL. In the new cluster, pg_class.oid and pg_class.relfilenode will * be the same and will match the old pg_class.oid value. Because of * this, old/new pg_class.relfilenode values will not match if CLUSTER, * REINDEX, or VACUUM FULL have been performed in the old cluster. One tricky case is pg_largeobject, which is copied from the old to new cluster since it has user data. To preserve that relfilenode, you would need to have pg_upgrade perform cluster surgery in each database to renumber its relfilenode to match since it is created by initdb. I can't think of a case where pg_upgrade already does something like that. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > For these reasons, if we decide to go in the direction of using a > non-LSN nonce, I no longer plan to continue working on this feature. I > would rather work on things that have a more positive impact. Maybe a > non-LSN nonce is a better long-term plan, but there are too many > unknowns and complexity for me to feel comfortable with it. As stated above, I have no plans to continue working on this feature. I am attaching my final patches here in case anyone wants to make use of them; it passes check-world and all my private tests. I have removed my patches from the feature wiki page: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption and replaced it with a link to this email. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Attachment
- cfe-01-doc_over_master.diff.gz
- cfe-02-internaldoc_over_cfe-01-doc.diff.gz
- cfe-03-scripts_over_cfe-02-internaldoc.diff.gz
- cfe-04-common_over_cfe-03-scripts.diff.gz
- cfe-05-crypto_over_cfe-04-common.diff.gz
- cfe-06-backend_over_cfe-05-crypto.diff.gz
- cfe-07-bin_over_cfe-06-backend.diff.gz
- cfe-08-pg_alterckey_over_cfe-07-bin.diff.gz
- cfe-09-test_over_cfe-08-pg_alterckey.diff.gz
- cfe-10-hint_over_cfe-09-test.diff.gz
- cfe-11-gist_over_cfe-10-hint.diff
- cfe-12-rel_over_cfe-11-gist.diff.gz
On Sat, Jun 26, 2021 at 2:52 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > > For these reasons, if we decide to go in the direction of using a > > non-LSN nonce, I no longer plan to continue working on this feature. I > > would rather work on things that have a more positive impact. Maybe a > > non-LSN nonce is a better long-term plan, but there are too many > > unknowns and complexity for me to feel comfortable with it. > > As stated above, I have no plans to continue working on this feature. I > am attaching my final patches here in case anyone wants to make use of > them; it passes check-world and all my private tests. I have removed > my patches from the feature wiki page: > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption > > and replaced it with a link to this email. The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Regards, Vignesh
On Wed, Jul 14, 2021 at 09:45:12PM +0530, vignesh C wrote: > On Sat, Jun 26, 2021 at 2:52 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote: > > > For these reasons, if we decide to go in the direction of using a > > > non-LSN nonce, I no longer plan to continue working on this feature. I > > > would rather work on things that have a more positive impact. Maybe a > > > non-LSN nonce is a better long-term plan, but there are too many > > > unknowns and complexity for me to feel comfortable with it. > > > > As stated above, I have no plans to continue working on this feature. I > > am attaching my final patches here in case anyone wants to make use of > > them; it passes check-world and all my private tests. I have removed > > my patches from the feature wiki page: > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption > > > > and replaced it with a link to this email. > > The patch does not apply on Head anymore, could you rebase and post a > patch. I'm changing the status to "Waiting for Author". Oh, I forgot this was in the commitfast. I have marked it as Withdrawn. Sorry for the confusion. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Fri, May 28, 2021 at 2:39 AM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote: > > > The above article, at least, suggested encrypting the sector number > > > using the second key and then multiplying that times 2^(block number), > > > where those blocks were actually AES 128bit blocks. The article further > > > claims that this is what's used in things like Bitlocker, TrueCrypt, > > > VeraCrypt and OpenSSL. > > > > > > While the documentation isn't super clear, I'm taking that to mean that > > > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it > > > with a 256-bit key (twice the size of the AES key length function), and > > > you give it a 'tweak', that what you would actually be passing in would > > > be the "sector number" in the above method, or for us perhaps it would > > > be relfilenode+block number, or maybe just block number but it seems > > > like it'd be better to include the relfilenode to me. > > > > If you go in that direction, you should make sure pg_upgrade preserves > > what you use (it does not preserve relfilenode, just pg_class.oid), and > > CREATE DATABASE still works with a simple file copy. > > Ah, yes, good point, if we support in-place pg_upgrade of an encrypted > cluster then the tweak has to be consistent between the old and new. > > I tend to agree with Andres that it'd be reasonable to make CREATE > DATABASE do a bit more work for an encrypted cluster though, so I'm less > concerned about that. > > Using pg_class.oid instead of relfilenode seems likely to complicate > things like crash recovery though, wouldn't it? I wonder if there's > something else we could use. > Hi, I have extracted the preserving relfilenode and dboid from [1] and rebased on the current head. While tested I have found a few issues. - Variable' dbDumpId' was not initialized before passing to ArchiveEntry() in dumpDatabase() function due to which pg_upgrade was failing with 'bad dumpId' error - 'create_storage' flag was set as TRUE irrespective of relkind which resulted in hitting assert when the source cluster had TYPE in it. - In createdb() flow, ''dboid' was set to the preserved dboid in wrong place. It was eventually overwritten and caused problems while restoring the DB - Removed the restriction on dumping the postgres DB OID I have fixed all the issues and now the patch is working as expected. [1] https://www.postgresql.org/message-id/7082.1562337694@localhost Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Wed, Aug 11, 2021 at 3:41 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > I have fixed all the issues and now the patch is working as expected. Hi, I'm changing the subject line since the patch does something which was discussed on that thread but isn't really related to the old email subject. In general, I think this patch is uncontroversial and in reasonably good shape. However, there's one part that I'm not too sure about. If Tom Lane happens to be paying attention to this thread, I think his feedback would be particularly useful, since he knows a lot about the inner workings of pg_dump. Opinions from anybody else would be great, too. Anyway, here's the hunk that worries me: + + /* + * Need a separate entry, otherwise the command will be run in the + * same transaction as the CREATE DATABASE command, which is not + * allowed. + */ + ArchiveEntry(fout, + dbCatId, /* catalog ID */ + dbDumpId, /* dump ID */ + ARCHIVE_OPTS(.tag = datname, + .owner = dba, + .description = "SET_DB_OID", + .section = SECTION_PRE_DATA, + .createStmt = setDBIdQry->data, + .dropStmt = NULL)); + To me, adding a separate TOC entry for a thing that is not really a separate object seems like a scary hack that might come back to bite us. Unfortunately, I don't know enough about pg_dump to say exactly how it might come back to bite us, which leaves wide open the possibility that I am completely wrong.... I just think it's the intention that archive entries correspond to actual objects in the database, not commands that we want executed in some particular order. If that criticism is indeed correct, then my proposal would be to instead add a WITH OID = nnn option to CREATE DATABASE and allow it to be used only in binary upgrade mode. That has the disadvantage of being inconsistent with the way that we preserve OIDs everywhere else, but the only other alternatives are (1) do something like the above, (2) remove the requirement that CREATE DATABASE run in its own transaction, and (3) give up. (2) sounds hard and (3) is unappealing. The rest of this email will be detailed review comments on the patch as presented, and thus probably only interesting to someone actually working on the patch. Feel free to skip if that's not you. - I suggest splitting the patch into one portion that deals with database OID and another portion that deals with tablespace OID and relfilenode OID, or maybe splitting it all the way into three separate patches, one for each. This could allow the uncontroversial parts to get committed first while we're wondering what to do about the problem described above. - There are two places in the patch, one in dumpDatabase() and one in generate_old_dump() where blank lines are removed with no other changes. Please avoid whitespace-only hunks. - If possible, please try to pgindent the new code. It's pretty good what you did, but e.g. the declaration of binary_upgrade_next_pg_tablespace_oid probably has less whitespace than pgindent is going to want. - The comments in dumpDatabase() claim that "postgres" and "template1" are handled specially in some way, but there seems to be no code that matches those comments. - heap_create()'s logic around setting create_storage looks slightly redundant. I'm not positive what would be better, but ... suppose you just took the part that's currently gated by if (!IsBinaryUpgrade) and did it unconditionally. Then put if (IsBinaryUpgrade) around the else clause, but delete the last bit from there that sets create_storage. Maybe we'd still want a comment saying that it's intentional that create_storage = true even though it will be overwritten later, but then, I think, we wouldn't need to set create_storage in two different places. Maybe I'm wrong. - If we're not going to do that, then I think you should swap the if and else clauses and reverse the sense of the test. In createdb(), CreateTableSpace(), and a bunch of existing places, we do if (IsBinaryUpgrade) { ... } else { ... } so I don't think it makes sense for this one to instead do if (!IsBinaryUpgrade) { ... } else { ... }. - I'm not sure that I'd bother renaming binary_upgrade_set_pg_class_oids_and_relfilenodes(). It's such a long name, and a relfilenode is kind of an OID, so the current name isn't even really wrong. I'd probably drop the header comment too, since it seems rather obvious. But both of these things are judgement calls. - Inside that function, there is a comment that says "Indexes cannot have toast tables, so we need not make this probe in the index code path." However, you have moved the code from someplace where it didn't happen for indexes to someplace where it happens for both tables and indexes. Therefore the comment, which was true when the code was where it was before, is now false. So you need to update it. - It is not clear to me why pg_upgrade's Makefile needs to be changed to include -DFRONTEND in CPPFLAGS. All of the .c files in this directory include postgres_fe.h rather than postgres.h, and that file has #define FRONTEND 1. Moreover, there are no actual code changes in this directory, so why should the Makefile need any change? - A couple of comment changes - and the commit message - mention data encryption, but that's not a feature that this patch implements, nor are we committed to adding it in the immediate future (or ever, really). So I think those places should be revised to say that we do this because we want the filenames to match between the old and new clusters, and leave the reasons why that might be a good thing up to the reader's imagination. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > To me, adding a separate TOC entry for a thing that is not really a > separate object seems like a scary hack that might come back to bite > us. Unfortunately, I don't know enough about pg_dump to say exactly > how it might come back to bite us, which leaves wide open the > possibility that I am completely wrong.... I just think it's the > intention that archive entries correspond to actual objects in the > database, not commands that we want executed in some particular order. I agree, this seems like a moderately bad idea. It could get broken either by executing only one of the TOC entries during restore, or by executing them in the wrong order. The latter possibility could be forestalled by adding a dependency, which I do not see this hunk doing, which is clearly a bug. The former possibility would require user intervention, so maybe it's in the category of "if you break this you get to keep both pieces". Still, it's ugly. > If that criticism is indeed correct, then my proposal would be to > instead add a WITH OID = nnn option to CREATE DATABASE and allow it to > be used only in binary upgrade mode. If it's not too complicated to implement, that seems like an OK idea from here. I don't have any great love for the way we handle OID preservation in binary upgrade mode, so not doing it exactly the same way for databases doesn't seem like a disadvantage. regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > To me, adding a separate TOC entry for a thing that is not really a > > separate object seems like a scary hack that might come back to bite > > us. Unfortunately, I don't know enough about pg_dump to say exactly > > how it might come back to bite us, which leaves wide open the > > possibility that I am completely wrong.... I just think it's the > > intention that archive entries correspond to actual objects in the > > database, not commands that we want executed in some particular order. > > I agree, this seems like a moderately bad idea. It could get broken > either by executing only one of the TOC entries during restore, or > by executing them in the wrong order. The latter possibility could > be forestalled by adding a dependency, which I do not see this hunk > doing, which is clearly a bug. The former possibility would require > user intervention, so maybe it's in the category of "if you break > this you get to keep both pieces". Still, it's ugly. Yeah, agreed. > > If that criticism is indeed correct, then my proposal would be to > > instead add a WITH OID = nnn option to CREATE DATABASE and allow it to > > be used only in binary upgrade mode. > > If it's not too complicated to implement, that seems like an OK idea > from here. I don't have any great love for the way we handle OID > preservation in binary upgrade mode, so not doing it exactly the same > way for databases doesn't seem like a disadvantage. Also agreed on this, though I wonder- do we actually need to explicitly make CREATE DATABASE q WITH OID = 1234; only work during binary upgrade mode in the backend? That strikes me as perhaps doing more work than we really need to while also preventing something that users might actually like to do. Either way, we'll need to check that the OID given to us can be used for the database, I'd think. Having pg_dump only include it in binary upgrade mode is fine though. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes: > Also agreed on this, though I wonder- do we actually need to explicitly > make CREATE DATABASE q WITH OID = 1234; only work during binary upgrade > mode in the backend? That strikes me as perhaps doing more work than we > really need to while also preventing something that users might actually > like to do. There should be adequate defenses against a duplicate OID already, so +1 --- no reason to insist this only be used during binary upgrade. Actually though ... I've not read the patch, but what does it do about the fact that the postgres and template0 DBs do not have stable OIDs? I cannot imagine any way to force those to match across PG versions that would not be an unsustainable crock. regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 12:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Actually though ... I've not read the patch, but what does it do about > the fact that the postgres and template0 DBs do not have stable OIDs? > I cannot imagine any way to force those to match across PG versions > that would not be an unsustainable crock. Well, it's interesting that you mention that, because there's a comment in the patch that probably has to do with this: + /* + * Make sure that pg_upgrade does not change database OID. Don't care + * about "postgres" database, backend will assign it fixed OID anyway. + * ("template1" has fixed OID too but the value 1 should not collide with + * any other OID so backend pays no attention to it.) + */ I wasn't able to properly understand that comment, and to be honest I'm not sure I precisely understand your concern either. I don't quite see why the template0 database matters. I think that database isn't going to be dumped, or restored, so as far as pg_upgrade is concerned it might as well not exist in either cluster, and I don't see why pg_upgrade can't therefore just ignore it completely. But template1 and postgres are another matter. If I understand correctly, those databases are going to be created in the new cluster by initdb, but then pg_upgrade is going to populate them with data - including relation files - from the old cluster. And, yeah, I don't see how we could make those database OIDs match, which is not great. To be honest, what I'd be inclined to do about that is just nail down those OIDs for future releases. In fact, I'd probably go so far as to hardcode that in such a way that even if you drop those databases and recreate them, they get recreated with the same hard-coded OID. Now that doesn't do anything to create stability when people upgrade from an old release to a current one, but I don't really see that as an enormous problem. The only hard requirement for this feature is if we use the database OID for some kind of encryption or integrity checking or checksum type feature. Then, you want to avoid having the database OID change when you upgrade, so that the encryption or integrity check or checksum in question does not have to be recomputed for every page as part of pg_upgrade. But, that only matters if you're going between two releases that support that feature, which will not be the case if you're upgrading from some old release. Apart from that kind of feature, it still seems like a nice-to-have to keep database OIDs the same, but if those cases end up as exceptions, oh well. Does that seem reasonable, or am I missing something big? -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > I wasn't able to properly understand that comment, and to be honest > I'm not sure I precisely understand your concern either. I don't quite > see why the template0 database matters. I think that database isn't > going to be dumped, or restored, so as far as pg_upgrade is concerned > it might as well not exist in either cluster, and I don't see why > pg_upgrade can't therefore just ignore it completely. But template1 > and postgres are another matter. If I understand correctly, those > databases are going to be created in the new cluster by initdb, but > then pg_upgrade is going to populate them with data - including > relation files - from the old cluster. Right. If pg_upgrade explicitly ignores template0 then its OID need not be stable ... at least, not unless there's a chance it could conflict with some other database OID, which would become a live possibility if we let users get at "WITH OID = n". (Having said that, I'm not sure that pg_upgrade special-cases template0, or that it should do so.) > To be honest, what I'd be inclined to do about that is just nail down > those OIDs for future releases. Yeah, I was thinking along similar lines. > In fact, I'd probably go so far as to > hardcode that in such a way that even if you drop those databases and > recreate them, they get recreated with the same hard-coded OID. Less sure that this is a good idea, though. In particular, I do not think that you can make it work in the face of alter database template1 rename to oops; create database template1; > The only hard requirement for this feature is if we > use the database OID for some kind of encryption or integrity checking > or checksum type feature. It's fairly unclear to me why that is so important as to justify the amount of klugery that this line of thought seems to be bringing. regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The only hard requirement for this feature is if we >> use the database OID for some kind of encryption or integrity checking >> or checksum type feature. > It's fairly unclear to me why that is so important as to justify the > amount of klugery that this line of thought seems to be bringing. And, not to put too fine a point on it, how will you possibly do that without entirely breaking CREATE DATABASE? regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Aug 17, 2021 at 11:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 17, 2021 at 12:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Actually though ... I've not read the patch, but what does it do about > > the fact that the postgres and template0 DBs do not have stable OIDs? > > I cannot imagine any way to force those to match across PG versions > > that would not be an unsustainable crock. > > Well, it's interesting that you mention that, because there's a > comment in the patch that probably has to do with this: > > + /* > + * Make sure that pg_upgrade does not change database OID. Don't care > + * about "postgres" database, backend will assign it fixed OID anyway. > + * ("template1" has fixed OID too but the value 1 should not collide with > + * any other OID so backend pays no attention to it.) > + */ > In the original patch the author intended to avoid dumping the postgres DB OID like below: + if (dopt->binary_upgrade && dbCatId.oid != PostgresDbOid) Since postgres OID is not hardcoded/fixed I removed the check. My bad I missed updating the comment section. Sorry for the confusion. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Tue, Aug 17, 2021 at 11:56:30AM -0400, Robert Haas wrote: > On Wed, Aug 11, 2021 at 3:41 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > I have fixed all the issues and now the patch is working as expected. > > Hi, > > I'm changing the subject line since the patch does something which was > discussed on that thread but isn't really related to the old email > subject. In general, I think this patch is uncontroversial and in > reasonably good shape. However, there's one part that I'm not too sure > about. If Tom Lane happens to be paying attention to this thread, I > think his feedback would be particularly useful, since he knows a lot > about the inner workings of pg_dump. Opinions from anybody else would > be great, too. Anyway, here's the hunk that worries me: What is the value of preserving db/ts/relfilenode OIDs? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 1:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right. If pg_upgrade explicitly ignores template0 then its OID > need not be stable ... at least, not unless there's a chance it > could conflict with some other database OID, which would become > a live possibility if we let users get at "WITH OID = n". Well, that might be a good reason not to let them do that, then, at least for n<64k. > > In fact, I'd probably go so far as to > > hardcode that in such a way that even if you drop those databases and > > recreate them, they get recreated with the same hard-coded OID. > > Less sure that this is a good idea, though. In particular, I do not > think that you can make it work in the face of > alter database template1 rename to oops; > create database template1; That is a really good point. If we can't categorically force the OID of those databases to have a particular, fixed value, and based on this example that seems to be impossible, then there's always a possibility that we might find a value in the old cluster that doesn't happen to match what is present in the new cluster. Seen from that angle, the problem is really with databases that are pre-existent in the new cluster but whose contents still need to be dumped. Maybe we could (optionally? conditionally?) drop those databases from the new cluster and then recreate them with the OID that we want them to have. > > The only hard requirement for this feature is if we > > use the database OID for some kind of encryption or integrity checking > > or checksum type feature. > > It's fairly unclear to me why that is so important as to justify the > amount of klugery that this line of thought seems to be bringing. Well, I think it would make sense to figure out how small we can make the kludge first, and then decide whether it's larger than we can tolerate. From my point of view, I completely understand why people to whom those kinds of features are important want to include all the fields that make up a buffer tag in the checksum or other integrity check. Right now, if somebody copies a page from one place to another, or if the operating system fumbles things and switches some pages around, we have no direct way of detecting that anything bad has happened. This is not the only problem that would need to be solved in order to fix that, but it's one of them, and I don't particularly see why it's not a valid goal. It's not as if a 16-bit checksum that is computed in exactly the same way for every page in the cluster is such state-of-the-art technology that only fools question its surpassing excellence. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
> The rest of this email will be detailed review comments on the patch > as presented, and thus probably only interesting to someone actually > working on the patch. Feel free to skip if that's not you. > > - I suggest splitting the patch into one portion that deals with > database OID and another portion that deals with tablespace OID and > relfilenode OID, or maybe splitting it all the way into three separate > patches, one for each. This could allow the uncontroversial parts to > get committed first while we're wondering what to do about the problem > described above. Thanks Robert for your comments. I have split the patch into two portions. One that handles DB OID and the other that handles tablespace OID and relfilenode OID. > - There are two places in the patch, one in dumpDatabase() and one in > generate_old_dump() where blank lines are removed with no other > changes. Please avoid whitespace-only hunks. These changes are avoided. > - If possible, please try to pgindent the new code. It's pretty good > what you did, but e.g. the declaration of > binary_upgrade_next_pg_tablespace_oid probably has less whitespace > than pgindent is going to want. Taken care in latest patches > - The comments in dumpDatabase() claim that "postgres" and "template1" > are handled specially in some way, but there seems to be no code that > matches those comments. The comment is removed. > - heap_create()'s logic around setting create_storage looks slightly > redundant. I'm not positive what would be better, but ... suppose you > just took the part that's currently gated by if (!IsBinaryUpgrade) and > did it unconditionally. Then put if (IsBinaryUpgrade) around the else > clause, but delete the last bit from there that sets create_storage. > Maybe we'd still want a comment saying that it's intentional that > create_storage = true even though it will be overwritten later, but > then, I think, we wouldn't need to set create_storage in two different > places. Maybe I'm wrong. > > - If we're not going to do that, then I think you should swap the if > and else clauses and reverse the sense of the test. In createdb(), > CreateTableSpace(), and a bunch of existing places, we do if > (IsBinaryUpgrade) { ... } else { ... } so I don't think it makes sense > for this one to instead do if (!IsBinaryUpgrade) { ... } else { ... }. I have avoided the redundant code and removed the comment as it does not make sense now that we are setting the create_storage conditionally. (In the original patch, create_storage was set to TRUE by default for binary upgrade case which was wrong and was hitting assert in the following flow). > - I'm not sure that I'd bother renaming > binary_upgrade_set_pg_class_oids_and_relfilenodes(). It's such a long > name, and a relfilenode is kind of an OID, so the current name isn't > even really wrong. I'd probably drop the header comment too, since it > seems rather obvious. But both of these things are judgement calls. I agree. I have retained the old function name. > - Inside that function, there is a comment that says "Indexes cannot > have toast tables, so we need not make this probe in the index code > path." However, you have moved the code from someplace where it didn't > happen for indexes to someplace where it happens for both tables and > indexes. Therefore the comment, which was true when the code was where > it was before, is now false. So you need to update it. The comment is updated. > - It is not clear to me why pg_upgrade's Makefile needs to be changed > to include -DFRONTEND in CPPFLAGS. All of the .c files in this > directory include postgres_fe.h rather than postgres.h, and that file > has #define FRONTEND 1. Moreover, there are no actual code changes in > this directory, so why should the Makefile need any change? Makefile change is removed. > - A couple of comment changes - and the commit message - mention data > encryption, but that's not a feature that this patch implements, nor > are we committed to adding it in the immediate future (or ever, > really). So I think those places should be revised to say that we do > this because we want the filenames to match between the old and new > clusters, and leave the reasons why that might be a good thing up to > the reader's imagination. Taken care. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > Thanks Robert for your comments. > I have split the patch into two portions. One that handles DB OID and > the other that > handles tablespace OID and relfilenode OID. It's pretty clear from the discussion, I think, that the database OID one is going to need rework to be considered. Regarding the other one: - The comment in binary_upgrade_set_pg_class_oids() is still not accurate. You removed the sentence which says "Indexes cannot have toast tables, so we need not make this probe in the index code path" but the immediately preceding sentence is still inaccurate in at least two ways. First, it only talks about tables, but the code now applies to indexes. Second, it only talks about OIDs, but now also deals with refilenodes. It's really important to fully update every comment that might be affected by your changes! - The SQL query in that function isn't completely correct. There is a left join from pg_class to pg_index whose ON clause includes "c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's likely that is because it is possible, in corner cases, for a TOAST table to have multiple TOAST indexes. I forget exactly how that happens, but I think it might be like if a REINDEX CONCURRENTLY on the TOAST table fails midway through, or something of that sort. Now if that happens, the LEFT JOIN you added is going to cause the output to contain multiple rows, because you didn't replicate the i.indisvalid condition into that ON clause. And then it will fail. Apparently we don't have a pg_upgrade test case for this scenario; we probably should. Actually what I think would be even better than putting i.indisvalid into that ON clause would be to join off of i.indrelid rather than c.reltoastrelid. - The code that decodes the various columns of this query does so in a slightly different order than the query itself. It would be better to make it match. Perhaps put relkind first in both cases. I might also think about trying to make the column naming a bit more consistent, e.g. relkind, relfilenode, toast_oid, toast_relfilenode, toast_index_oid, toast_index_relfilenode. - In heap_create(), the wording of the error messages is not quite consistent. You have "relfilenode value not set when in binary upgrade mode", "toast relfilenode value not set when in binary upgrade mode", and "pg_class index relfilenode value not set when in binary upgrade mode". Why does the last one mention pg_class when the other two don't? - The code in heap_create() now has no comments whatsoever, which is a shame, because it's actually kind of a tricky bit of logic. Someone might wonder why we override the relfilenode inside that function instead of doing it at the same places where we absorb binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing down the relfilenode. I think the answer is that passing down the relfilenode from the caller would result in storage not actually being created, whereas in this case we want it to be created but just with the value we specify, and the reason we want that is because we need later DDL that happens after these statements but before the old cluster's relations are moved over to execute successfully, which it won't if the storage is altogether absent. However, that raises the question of whether this patch has even got the basic design right. Maybe we ought to actually be absorbing the relfilenode setting at the same places where we're doing so for the OID, and then passing an additional parameter to heap_create() like bool suppress_storage or something like that. Maybe, taking it even further, we ought to be changing the signatures of binary_upgrade_next_heap_pg_class_oid and friends to be two-argument functions, and pass down the OID and the relfilenode in the same call, rather than calling two separate functions. I'm not so much concerned about the cost of calling two functions as the potential for confusion. I'm not honestly sure that either of these changes are the right thing to do, but I am pretty strongly inclined to do at least the first part - trying to absorb reloid and relfilenode in the same places. If we're not going to do that we certainly need to explain why we're doing it the way we are in the comments. It's not really this patch's fault, but it would sure be nice if we had some better testing for this area. Suppose this patch somehow changed nothing from the present behavior. How would we know? Or suppose it managed to somehow set all the relfilenodes in the new cluster to random values rather than the intended one? There's no automated testing that would catch any of that, and it's not obvious how it could be added to test.sh. I suppose what we really need to do at some point is rewrite that as a TAP test, but that seems like a separate project from this patch. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Thanks Robert for your comments. > > I have split the patch into two portions. One that handles DB OID and > > the other that > > handles tablespace OID and relfilenode OID. > > It's pretty clear from the discussion, I think, that the database OID > one is going to need rework to be considered. Regarding that ... I have to wonder just what promises we feel we've made when it comes to what a user is expected to be able to do with the new cluster *before* pg_upgrade is run on it. For my part, I sure feel like it's "nothing", in which case it seems like we can do things that we can't do with a running system, like literally just DROP and recreate with the correct OID of any databases we need to, or even push that back to the user to do that at initdb time with some kind of error thrown by pg_upgrade during the --check phase. "Initial databases have non-standard OIDs, recreate destination cluster with initdb --with-oid=12341" or something along those lines. Also open to the idea of simply forcing 'template1' to always being OID=1 even if it's dropped/recreated and then just dropping/recreating the template0 and postgres databases if they've got different OIDs than what the old cluster did- after all, they should be getting entirely re-populated as part of the pg_upgrade process itself. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Mon, Aug 23, 2021 at 04:57:31PM -0400, Robert Haas wrote: > On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Thanks Robert for your comments. > > I have split the patch into two portions. One that handles DB OID and > > the other that > > handles tablespace OID and relfilenode OID. > > It's pretty clear from the discussion, I think, that the database OID > one is going to need rework to be considered. I assume this patch is not going to be applied until there is an actual use case for preserving these values. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Aug 24, 2021 at 5:59 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Mon, Aug 23, 2021 at 04:57:31PM -0400, Robert Haas wrote: > > On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > Thanks Robert for your comments. > > > I have split the patch into two portions. One that handles DB OID and > > > the other that > > > handles tablespace OID and relfilenode OID. > > > > It's pretty clear from the discussion, I think, that the database OID > > one is going to need rework to be considered. > > I assume this patch is not going to be applied until there is an actual > use case for preserving these values. JFI, I added an entry into commit fest for this patch. link: https://commitfest.postgresql.org/34/3296/
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Mon, Aug 23, 2021 at 5:12 PM Stephen Frost <sfrost@snowman.net> wrote: > Regarding that ... I have to wonder just what promises we feel we've > made when it comes to what a user is expected to be able to do with the > new cluster *before* pg_upgrade is run on it. For my part, I sure feel > like it's "nothing", in which case it seems like we can do things that > we can't do with a running system, like literally just DROP and recreate > with the correct OID of any databases we need to, or even push that back > to the user to do that at initdb time with some kind of error thrown by > pg_upgrade during the --check phase. "Initial databases have > non-standard OIDs, recreate destination cluster with initdb > --with-oid=12341" or something along those lines. Yeah, possibly. Honestly, I find it weird that pg_upgrade expects the new cluster to already exist. It seems like it would be more sensible if it created the cluster itself. That's not entirely trivial, because for example you have to create it with the correct locale settings and stuff. But if you require the cluster to exist already, then you run into the kinds of questions that you're asking here, and whether the answer is "nothing" as you propose here or something more than that, it's clearly not "whatever you want" nor anything close to that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote: > I assume this patch is not going to be applied until there is an actual > use case for preserving these values. My interpretation of the preceding discussion was that several people thought this change was a good idea regardless of whether anything ever happens with TDE, so I wasn't seeing a reason to wait. Personally, I've always thought that it was quite odd that pg_upgrade didn't preserve the relfilenode values, so I'm in favor of the change. I bet we could even make some simplifications to that code if we got all of this sorted out, which seems like it would be nice. I think it was also mentioned that this might be nice for pgBackRest, which apparently permits incremental backups across major version upgrades but likes filenames to match. That being said, if you or somebody else thinks that this is a bad idea or that the reasons offered up until now are insufficient, feel free to make that argument. I just work here... -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote: >> I assume this patch is not going to be applied until there is an actual >> use case for preserving these values. > ... > That being said, if you or somebody else thinks that this is a bad > idea or that the reasons offered up until now are insufficient, feel > free to make that argument. I just work here... Per upthread discussion, it seems impractical to fully guarantee that database OIDs match, which seems to mean that the whole premise collapses. Like Bruce, I want to see a plausible use case justifying any partial-guarantee scenario before we add more complication (= bugs) to pg_upgrade. regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Tue, Aug 24, 2021 at 12:04:00PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote: > >> I assume this patch is not going to be applied until there is an actual > >> use case for preserving these values. > > > ... > > > That being said, if you or somebody else thinks that this is a bad > > idea or that the reasons offered up until now are insufficient, feel > > free to make that argument. I just work here... > > Per upthread discussion, it seems impractical to fully guarantee > that database OIDs match, which seems to mean that the whole premise > collapses. Like Bruce, I want to see a plausible use case justifying > any partial-guarantee scenario before we add more complication (= bugs) > to pg_upgrade. Yes, pg_upgrade is already complex enough, so why add more complexity for some cosmetic value. (I think "cosmetic" flew out the window with pg_upgrade long ago. ;-) ) I know that pgBackRest has asked for stable relfilenodes to make incremental file system backups after pg_upgrade smaller, but if we want to make relfilenodes stable, we had better understand that is _why_ we are adding this complexity. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Tue, Aug 24, 2021 at 11:28:37AM -0400, Robert Haas wrote: > On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote: > > I assume this patch is not going to be applied until there is an actual > > use case for preserving these values. > > My interpretation of the preceding discussion was that several people > thought this change was a good idea regardless of whether anything > ever happens with TDE, so I wasn't seeing a reason to wait. > Personally, I've always thought that it was quite odd that pg_upgrade > didn't preserve the relfilenode values, so I'm in favor of the change. > I bet we could even make some simplifications to that code if we got > all of this sorted out, which seems like it would be nice. Yes, if this ends up being a cleanup with no added complexity, that would be nice, but I had not seen how that was possible in the psat. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Tue, Aug 24, 2021 at 11:24:21AM -0400, Robert Haas wrote: > On Mon, Aug 23, 2021 at 5:12 PM Stephen Frost <sfrost@snowman.net> wrote: > > Regarding that ... I have to wonder just what promises we feel we've > > made when it comes to what a user is expected to be able to do with the > > new cluster *before* pg_upgrade is run on it. For my part, I sure feel > > like it's "nothing", in which case it seems like we can do things that > > we can't do with a running system, like literally just DROP and recreate > > with the correct OID of any databases we need to, or even push that back > > to the user to do that at initdb time with some kind of error thrown by > > pg_upgrade during the --check phase. "Initial databases have > > non-standard OIDs, recreate destination cluster with initdb > > --with-oid=12341" or something along those lines. > > Yeah, possibly. Honestly, I find it weird that pg_upgrade expects the > new cluster to already exist. It seems like it would be more sensible > if it created the cluster itself. That's not entirely trivial, because > for example you have to create it with the correct locale settings and > stuff. But if you require the cluster to exist already, then you run > into the kinds of questions that you're asking here, and whether the > answer is "nothing" as you propose here or something more than that, > it's clearly not "whatever you want" nor anything close to that. Yes, it is a trade-off. If we had pg_upgrade create the new cluster, the pg_upgrade instructions would be simpler, but pg_upgrade would be more complex since it has to adjust _everything_ properly so pg_upgrade works --- I never got to that point, but I am willing to explore what would be required. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Aug 24, 2021 at 12:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Per upthread discussion, it seems impractical to fully guarantee > that database OIDs match, which seems to mean that the whole premise > collapses. Like Bruce, I want to see a plausible use case justifying > any partial-guarantee scenario before we add more complication (= bugs) > to pg_upgrade. I think you might be overlooking the emails from Stephen and I where we suggested how that could be made to work? -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Aug 23, 2021 at 5:12 PM Stephen Frost <sfrost@snowman.net> wrote: > > Regarding that ... I have to wonder just what promises we feel we've > > made when it comes to what a user is expected to be able to do with the > > new cluster *before* pg_upgrade is run on it. For my part, I sure feel > > like it's "nothing", in which case it seems like we can do things that > > we can't do with a running system, like literally just DROP and recreate > > with the correct OID of any databases we need to, or even push that back > > to the user to do that at initdb time with some kind of error thrown by > > pg_upgrade during the --check phase. "Initial databases have > > non-standard OIDs, recreate destination cluster with initdb > > --with-oid=12341" or something along those lines. > > Yeah, possibly. Honestly, I find it weird that pg_upgrade expects the > new cluster to already exist. It seems like it would be more sensible > if it created the cluster itself. That's not entirely trivial, because > for example you have to create it with the correct locale settings and > stuff. But if you require the cluster to exist already, then you run > into the kinds of questions that you're asking here, and whether the > answer is "nothing" as you propose here or something more than that, > it's clearly not "whatever you want" nor anything close to that. Yeah, I'd had a similar thought and also tend to agree that it'd make more sense for pg_upgrade to set up the new cluster too, and doing so in a way that makes sure that it matches the old cluster as that's rather important. Having the user do it also implies that there is some freedom for the user to mess around with the new cluster before running pg_upgrade, it seems to me anyway, and that's certainly not something that we've built anything into pg_upgrade to deal with cleanly.. It isn't like initdb takes all *that* long to run either, and reducing the number of steps that the user has to take to perform an upgrade sure seems like a good thing to do. Anyhow, just wanted to throw that out there as another way we might approach this. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Aug 24, 2021 at 12:43 PM Bruce Momjian <bruce@momjian.us> wrote: > Yes, it is a trade-off. If we had pg_upgrade create the new cluster, > the pg_upgrade instructions would be simpler, but pg_upgrade would be > more complex since it has to adjust _everything_ properly so pg_upgrade > works --- I never got to that point, but I am willing to explore what > would be required. It's probably a topic for another thread, rather than this one, but I think that would be very cool. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Aug 24, 2021 at 12:43 PM Bruce Momjian <bruce@momjian.us> wrote: > > Yes, it is a trade-off. If we had pg_upgrade create the new cluster, > > the pg_upgrade instructions would be simpler, but pg_upgrade would be > > more complex since it has to adjust _everything_ properly so pg_upgrade > > works --- I never got to that point, but I am willing to explore what > > would be required. > > It's probably a topic for another thread, rather than this one, but I > think that would be very cool. Yes, definite +1 on this. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Tue, Aug 24, 2021 at 12:43:20PM -0400, Bruce Momjian wrote: > Yes, it is a trade-off. If we had pg_upgrade create the new cluster, > the pg_upgrade instructions would be simpler, but pg_upgrade would be > more complex since it has to adjust _everything_ properly so pg_upgrade > works --- I never got to that point, but I am willing to explore what > would be required. One other issue --- the more that pg_upgrade preserves, the more likely pg_upgrade will break when some internal changes happen in Postgres. Therefore, if you want pg_upgrade to preserve something, you have to have a good reason --- even code simplicity might not be a sufficient reason. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Aug 24, 2021 at 2:16 PM Bruce Momjian <bruce@momjian.us> wrote: > One other issue --- the more that pg_upgrade preserves, the more likely > pg_upgrade will break when some internal changes happen in Postgres. > Therefore, if you want pg_upgrade to preserve something, you have to > have a good reason --- even code simplicity might not be a sufficient > reason. While I accept that as a general principle, I don't think it's really applicable in this case. pg_upgrade already knows all about relfilenodes; it has a source file called relfilenode.c. I don't see that a pg_upgrade that preserves relfilenodes is any more or less likely to break in the future than a pg_upgrade that renumbers all the files so that the relation OID and the relfilenode are equal. You've got about the same amount of reliance on the on-disk layout either way. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Tue, Aug 24, 2021 at 02:34:26PM -0400, Robert Haas wrote: > On Tue, Aug 24, 2021 at 2:16 PM Bruce Momjian <bruce@momjian.us> wrote: > > One other issue --- the more that pg_upgrade preserves, the more likely > > pg_upgrade will break when some internal changes happen in Postgres. > > Therefore, if you want pg_upgrade to preserve something, you have to > > have a good reason --- even code simplicity might not be a sufficient > > reason. > > While I accept that as a general principle, I don't think it's really > applicable in this case. pg_upgrade already knows all about > relfilenodes; it has a source file called relfilenode.c. I don't see > that a pg_upgrade that preserves relfilenodes is any more or less > likely to break in the future than a pg_upgrade that renumbers all the > files so that the relation OID and the relfilenode are equal. You've > got about the same amount of reliance on the on-disk layout either > way. I was making more of a general statement that preservation can be problematic and its impact must be researched. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Less sure that this is a good idea, though. In particular, I do not > > think that you can make it work in the face of > > alter database template1 rename to oops; > > create database template1; > > That is a really good point. If we can't categorically force the OID > of those databases to have a particular, fixed value, and based on > this example that seems to be impossible, then there's always a > possibility that we might find a value in the old cluster that doesn't > happen to match what is present in the new cluster. Seen from that > angle, the problem is really with databases that are pre-existent in > the new cluster but whose contents still need to be dumped. Maybe we > could (optionally? conditionally?) drop those databases from the new > cluster and then recreate them with the OID that we want them to have. Actually, we do that already. create_new_objects() runs pg_restore with --create for most databases, but with --clean --create for template1 and postgres. This means that template1 and postgres will always be recreated in the new cluster, and other databases are assumed not to exist in the new cluster and the upgrade will fail if they unexpectedly do. And the reason why pg_upgrade does that is that it wants to "propagate [the] database-level properties" of postgres and template1. So suppose we just make the database OID one of the database-level properties that we want to propagate. That should mostly just work, but where can things go wrong? The only real failure mode is we try to create a database in the new cluster and find out that the OID is already in use. If the new OID that collides >64k, then the user has messed with the new cluster before doing that. And since pg_upgrade is pretty clearly already assuming that you shouldn't do that, it's fine to also make that assumption in this case. We can disregard such cases as user error. If the new OID that collides is <64k, then it must be colliding with template0, template1, or postgres in the new cluster, because those are the only databases that can have such OIDs since, currently, we don't allow users to specify an OID for a new database. And the problem cannot be with template1, because we hard-code its OID to 1. If there is a database with OID 1 in either cluster, it must be template1, and if there is a database with OID 1 in both clusters, it must be template1 in both cases, and we'll just drop and recreate it with OID 1 and everything is fine. So we need only consider template0 and postgres, which are created with system-generated OIDs. And, it would be no issue if either of those databases had the same OID in the old and new cluster, so the only possible OID collision is one where the same system-generated OID was assigned to one of those databases in the old cluster and to the other in the new cluster. First consider the case where template0 has OID, say, 13000, in the old cluster, and postgres has that OID in the new cluster. No problem occurs, because template0 isn't transferred anyway. The reverse direction is a problem, though. If postgres had been assigned OID 13000 in the old cluster and, by sheer chance, template0 had that OID in the new cluster, then the upgrade would fail, because it wouldn't be able to recreate the postgres database with the correct OID. But that doesn't seem very difficult to fix. I think all we need to do is have initdb assign a fixed OID to template0 at creation time. Then, in any new release to which someone might be trying to upgrade, the system-generated OID assigned to postgres in the old release can't match the fixed OID assigned to template0 in the new release, so the one problem case is ruled out. We do need, however, to make sure that the assign-my-database-a-fixed-OID syntax is either entirely restricted to initdb & pg_upgrade or at least that OIDS < 64k can only be assigned in one of those modes. Otherwise, some creative person could manufacture new problem cases by setting up the source database so that the OID of one of their databases matches the fixed OID we gave to template0 or template1, or the system-generated OID for postgres in the new cluster. In short, as far as I can see, all we need to do to preserve database OIDs across pg_upgrade is: 1. Add a new syntax for creating a database with a given OID, and use it in pg_dump --binary-upgrade. 2. Don't let users use it at least for OIDs <64k, or maybe just don't let them use it at all. 3. But let initdb use it, and have initdb set the initial OID for template0 to a fixed value < 10000. If the user changes it later, no problem; the cluster into which they are upgrading won't contain any databases with high-numbered OIDs. Anyone see a flaw in that analysis? -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > Anyone see a flaw in that analysis? I am still waiting to hear the purpose of this preservation. As long as you don't apply the patch, I guess I will just stop asking. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Aug 26, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > Anyone see a flaw in that analysis? > > I am still waiting to hear the purpose of this preservation. As long as > you don't apply the patch, I guess I will just stop asking. You make it sound like I didn't answer that question the last time you asked it, but I did.[1] I went back to the previous thread and found that, in fact, there's at least one email *from you* appearing to endorse that concept for reasons unrelated to TDE[2] and another where you appear to agree that it would be useful for TDE to do it.[3] Stephen Frost also wrote up his discussion during the Unconference and some of his reasons for liking the idea.[4] If you've changed your mind about this being a good idea, or if you no longer think it's useful without TDE, that's fine. Everyone is entitled to change their opinion. But then please say that straight out. It baffles me why you're now acting as if it hasn't been discussed when it clearly has been, and both you and I were participants in that discussion. [1] https://www.postgresql.org/message-id/CA+Tgmob7msyh3VRaY87USr22UakvvSyy4zBaQw2AO2CfoUD3rA@mail.gmail.com [2] https://www.postgresql.org/message-id/20210601140949.GC22012@momjian.us [3] https://www.postgresql.org/message-id/20210527210023.GJ5646@momjian.us [4] https://www.postgresql.org/message-id/20210531201652.GY20766@tamriel.snowman.net -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > Anyone see a flaw in that analysis? > > I am still waiting to hear the purpose of this preservation. As long as > you don't apply the patch, I guess I will just stop asking. I'm a bit confused why this question keeps coming up as we've discussed multiple reasons (incremental backups, possible use for TDE which would make this required, general improved sanity when working with pg_upgrade is frankly a benefit in its own right too...). If the additional code was a huge burden or even a moderate one then that might be an argument against, but it hardly sounds like it will be given Robert's thorough analysis so far and the (admittedly not complete, but not that far from it based on the DB OID review) proposed patch. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Aug 17, 2021 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Less sure that this is a good idea, though. In particular, I do not > > > think that you can make it work in the face of > > > alter database template1 rename to oops; > > > create database template1; > > > > That is a really good point. If we can't categorically force the OID > > of those databases to have a particular, fixed value, and based on > > this example that seems to be impossible, then there's always a > > possibility that we might find a value in the old cluster that doesn't > > happen to match what is present in the new cluster. Seen from that > > angle, the problem is really with databases that are pre-existent in > > the new cluster but whose contents still need to be dumped. Maybe we > > could (optionally? conditionally?) drop those databases from the new > > cluster and then recreate them with the OID that we want them to have. > > Actually, we do that already. create_new_objects() runs pg_restore > with --create for most databases, but with --clean --create for > template1 and postgres. This means that template1 and postgres will > always be recreated in the new cluster, and other databases are > assumed not to exist in the new cluster and the upgrade will fail if > they unexpectedly do. And the reason why pg_upgrade does that is that > it wants to "propagate [the] database-level properties" of postgres > and template1. So suppose we just make the database OID one of the > database-level properties that we want to propagate. That should > mostly just work, but where can things go wrong? > > The only real failure mode is we try to create a database in the new > cluster and find out that the OID is already in use. If the new OID > that collides >64k, then the user has messed with the new cluster > before doing that. And since pg_upgrade is pretty clearly already > assuming that you shouldn't do that, it's fine to also make that > assumption in this case. We can disregard such cases as user error. > > If the new OID that collides is <64k, then it must be colliding with > template0, template1, or postgres in the new cluster, because those > are the only databases that can have such OIDs since, currently, we > don't allow users to specify an OID for a new database. And the > problem cannot be with template1, because we hard-code its OID to 1. > If there is a database with OID 1 in either cluster, it must be > template1, and if there is a database with OID 1 in both clusters, it > must be template1 in both cases, and we'll just drop and recreate it > with OID 1 and everything is fine. So we need only consider template0 > and postgres, which are created with system-generated OIDs. And, it > would be no issue if either of those databases had the same OID in the > old and new cluster, so the only possible OID collision is one where > the same system-generated OID was assigned to one of those databases > in the old cluster and to the other in the new cluster. > > First consider the case where template0 has OID, say, 13000, in the > old cluster, and postgres has that OID in the new cluster. No problem > occurs, because template0 isn't transferred anyway. The reverse > direction is a problem, though. If postgres had been assigned OID > 13000 in the old cluster and, by sheer chance, template0 had that OID > in the new cluster, then the upgrade would fail, because it wouldn't > be able to recreate the postgres database with the correct OID. > > But that doesn't seem very difficult to fix. I think all we need to do > is have initdb assign a fixed OID to template0 at creation time. Then, > in any new release to which someone might be trying to upgrade, the > system-generated OID assigned to postgres in the old release can't > match the fixed OID assigned to template0 in the new release, so the > one problem case is ruled out. We do need, however, to make sure that > the assign-my-database-a-fixed-OID syntax is either entirely > restricted to initdb & pg_upgrade or at least that OIDS < 64k can only > be assigned in one of those modes. Otherwise, some creative person > could manufacture new problem cases by setting up the source database > so that the OID of one of their databases matches the fixed OID we > gave to template0 or template1, or the system-generated OID for > postgres in the new cluster. > > In short, as far as I can see, all we need to do to preserve database > OIDs across pg_upgrade is: > > 1. Add a new syntax for creating a database with a given OID, and use > it in pg_dump --binary-upgrade. > 2. Don't let users use it at least for OIDs <64k, or maybe just don't > let them use it at all. > 3. But let initdb use it, and have initdb set the initial OID for > template0 to a fixed value < 10000. If the user changes it later, no > problem; the cluster into which they are upgrading won't contain any > databases with high-numbered OIDs. > > Anyone see a flaw in that analysis? This looks like a pretty good analysis to me. As it relates to the question about allowing users to specify an OID, I'd be inclined to allow it but only for OIDs >64k. We've certainly reserved things in the past and I don't see any issue with having that reservation here, but if we're going to build the capability to specify the OID into CREATE DATABASE then it seems a bit odd to disallow users from using it, as long as we're preventing them from causing problems with it. Are there issues that you see with allowing users to specify the OID even with the >64k restriction..? I can't think of one offhand but perhaps I'm missing something. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 11:35:01AM -0400, Robert Haas wrote: > On Thu, Aug 26, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > > Anyone see a flaw in that analysis? > > > > I am still waiting to hear the purpose of this preservation. As long as > > you don't apply the patch, I guess I will just stop asking. > > You make it sound like I didn't answer that question the last time you > asked it, but I did.[1] I went back to the previous thread and found > that, in fact, there's at least one email *from you* appearing to > endorse that concept for reasons unrelated to TDE[2] and another where > you appear to agree that it would be useful for TDE to do it.[3] > Stephen Frost also wrote up his discussion during the Unconference and > some of his reasons for liking the idea.[4] > > If you've changed your mind about this being a good idea, or if you no > longer think it's useful without TDE, that's fine. Everyone is > entitled to change their opinion. But then please say that straight > out. It baffles me why you're now acting as if it hasn't been > discussed when it clearly has been, and both you and I were > participants in that discussion. > > [1] https://www.postgresql.org/message-id/CA+Tgmob7msyh3VRaY87USr22UakvvSyy4zBaQw2AO2CfoUD3rA@mail.gmail.com > [2] https://www.postgresql.org/message-id/20210601140949.GC22012@momjian.us > [3] https://www.postgresql.org/message-id/20210527210023.GJ5646@momjian.us > [4] https://www.postgresql.org/message-id/20210531201652.GY20766@tamriel.snowman.net Yes, it would help incremental backup of pgBackRest, as reported by the developers. However, I have seen no discussion if this is useful enough reason to add the complexity to preserve this. The TODO list shows "Desirability" as the first item to be discussed, so I expected that to be discussed first. Also, with TDE not progressing (and my approach not even needing this), I have not seen a full discussion if this item is desirable based on its complexity. What I did see is this patch appear with no context of why it is useful given our current plans, except for pgBackRest, which I think I mentioned. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > > Anyone see a flaw in that analysis? > > > > I am still waiting to hear the purpose of this preservation. As long as > > you don't apply the patch, I guess I will just stop asking. > > I'm a bit confused why this question keeps coming up as we've discussed > multiple reasons (incremental backups, possible use for TDE which would I have not seen much explaination on pgBackRest, except me mentioning it. Is this something really useful? As far as TDE, I haven't seen any concrete plan for that, so why add this code for that reason? > make this required, general improved sanity when working with pg_upgrade > is frankly a benefit in its own right too...). If the additional code How? I am not aware of any advantage except cosmetic. > was a huge burden or even a moderate one then that might be an argument > against, but it hardly sounds like it will be given Robert's thorough > analysis so far and the (admittedly not complete, but not that far from > it based on the DB OID review) proposed patch. I am find to add it if it is minor, but I want to see the calculus of its value vs complexity, which I have not seen spelled out. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Aug 26, 2021 at 11:39 AM Stephen Frost <sfrost@snowman.net> wrote: > This looks like a pretty good analysis to me. As it relates to the > question about allowing users to specify an OID, I'd be inclined to > allow it but only for OIDs >64k. We've certainly reserved things in the > past and I don't see any issue with having that reservation here, but if > we're going to build the capability to specify the OID into CREATE > DATABASE then it seems a bit odd to disallow users from using it, as > long as we're preventing them from causing problems with it. > > Are there issues that you see with allowing users to specify the OID > even with the >64k restriction..? I can't think of one offhand but > perhaps I'm missing something. So I actually should have said 16k here, not 64k, as somebody already pointed out to me off-list. Whee! I don't know of a reason not to let people do that, other than that it seems like an attractive nuisance. People will do it and it will fail because they chose a duplicate OID, or they'll complain that a regular dump and restore didn't preserve their database OIDs, or maybe they'll expect that they can copy a database from one cluster to another because they gave it the same OID! That said, I don't see a great harm in it. It just seems to me like exposing knobs to users that don't seem to have any legitimate use may be borrowing trouble. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > > > Anyone see a flaw in that analysis? > > > > > > I am still waiting to hear the purpose of this preservation. As long as > > > you don't apply the patch, I guess I will just stop asking. > > > > I'm a bit confused why this question keeps coming up as we've discussed > > multiple reasons (incremental backups, possible use for TDE which would > > I have not seen much explaination on pgBackRest, except me mentioning > it. Is this something really useful? Being able to quickly perform a backup on a newly upgraded cluster would certainly be valuable and that's definitely not possible today due to all of the filenames changing. > As far as TDE, I haven't seen any concrete plan for that, so why add > this code for that reason? That this would help with TDE (of which there seems little doubt...) is an additional benefit to this. Specifically, taking the existing work that's already been done to allow block-by-block encryption and adjusting it for AES-XTS and then using the db-dir+relfileno+block number as the IV, just like many disk encryption systems do, avoids the concerns that were brought up about using LSN for the IV with CTR and it's certainly not difficult to do, but it does depend on this change. This was all discussed previously and it sure looks like a sensible approach to use that mirrors what many other systems already do successfully. > > make this required, general improved sanity when working with pg_upgrade > > is frankly a benefit in its own right too...). If the additional code > > How? I am not aware of any advantage except cosmetic. Having to resort to matching up inode numbers between the two clusters after a pg_upgrade to figure out what files are actually the same underneath is a pain that goes beyond just cosmetics imv. Removing that additional level that admins, and developers for that matter, have to go through would be a nice improvement on its own. > > was a huge burden or even a moderate one then that might be an argument > > against, but it hardly sounds like it will be given Robert's thorough > > analysis so far and the (admittedly not complete, but not that far from > > it based on the DB OID review) proposed patch. > > I am find to add it if it is minor, but I want to see the calculus of > its value vs complexity, which I have not seen spelled out. I feel that this, along with the prior discussions, spells it out sufficiently given the patch's complexity looks to be reasonably minor and very similar to the existing things that pg_upgrade already does. Had pg_upgrade done this in the first place, I don't think there would have been nearly this amount of discussion about it. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Aug 26, 2021 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote: > I am find to add it if it is minor, but I want to see the calculus of > its value vs complexity, which I have not seen spelled out. I don't think it's going to be all that complicated, but we're going to have to wait until we have something closer to a final patch before we can really evaluate that. I am honestly a little puzzled about why you think complexity is such a big issue for this patch in particular. I feel we do probably several hundred things every release cycle that are more complicated than this, so it doesn't seem like this is particularly extraordinary or needs a lot of extra scrutiny. I do think there is some risk that there are messy cases we can't handle cleanly, but if that becomes an issue then I'll abandon the effort until a solution can be found. I'm not trying to relentlessly drive something through that is a bad idea on principle. I agree with all Stephen's comments, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Aug 26, 2021 at 11:39 AM Stephen Frost <sfrost@snowman.net> wrote: > > This looks like a pretty good analysis to me. As it relates to the > > question about allowing users to specify an OID, I'd be inclined to > > allow it but only for OIDs >64k. We've certainly reserved things in the > > past and I don't see any issue with having that reservation here, but if > > we're going to build the capability to specify the OID into CREATE > > DATABASE then it seems a bit odd to disallow users from using it, as > > long as we're preventing them from causing problems with it. > > > > Are there issues that you see with allowing users to specify the OID > > even with the >64k restriction..? I can't think of one offhand but > > perhaps I'm missing something. > > So I actually should have said 16k here, not 64k, as somebody already > pointed out to me off-list. Whee! Hah, yes, of course. > I don't know of a reason not to let people do that, other than that it > seems like an attractive nuisance. People will do it and it will fail > because they chose a duplicate OID, or they'll complain that a regular > dump and restore didn't preserve their database OIDs, or maybe they'll > expect that they can copy a database from one cluster to another > because they gave it the same OID! That said, I don't see a great harm > in it. It just seems to me like exposing knobs to users that don't > seem to have any legitimate use may be borrowing trouble. We're going to have to gate this somehow to allow the OIDs under 16k to be used, so it seems like what you're suggesting is that we have that gate in place but then allow any OID to be used if you've crossed that gate? That is, if we do something like: SELECT pg_catalog.binary_upgrade_allow_setting_db_oid(); CREATE DATABASE blah WITH OID 1234; for pg_upgrade, well, users who are interested may well figure out how to do that themselves if they decide they want to set the OID, whereas if it 'just works' provided they don't try to use an OID too low then maybe they won't try to bypass the restriction against using system OIDs..? Ok, I'll give you that this is a stretch and I'm on the fence about if it's worthwhile or not to include and document and if, as you say, it's inviting trouble to allow users to set it. Users do seem to have a knack for finding things even when they aren't documented and then we get to deal with those complaints too. :) Perhaps others have some stronger feelings one way or another. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 12:34:56PM -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote: > > > * Bruce Momjian (bruce@momjian.us) wrote: > > > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > > > > Anyone see a flaw in that analysis? > > > > > > > > I am still waiting to hear the purpose of this preservation. As long as > > > > you don't apply the patch, I guess I will just stop asking. > > > > > > I'm a bit confused why this question keeps coming up as we've discussed > > > multiple reasons (incremental backups, possible use for TDE which would > > > > I have not seen much explaination on pgBackRest, except me mentioning > > it. Is this something really useful? > > Being able to quickly perform a backup on a newly upgraded cluster would > certainly be valuable and that's definitely not possible today due to > all of the filenames changing. You mean incremental backup, right? I was told this by the pgBackRest developers during PGCon, but I have not heard that stated publicly, so I hate to go just on what I heard rather than seeing that stated publicly. > > As far as TDE, I haven't seen any concrete plan for that, so why add > > this code for that reason? > > That this would help with TDE (of which there seems little doubt...) is > an additional benefit to this. Specifically, taking the existing work > that's already been done to allow block-by-block encryption and > adjusting it for AES-XTS and then using the db-dir+relfileno+block > number as the IV, just like many disk encryption systems do, avoids the > concerns that were brought up about using LSN for the IV with CTR and > it's certainly not difficult to do, but it does depend on this change. > This was all discussed previously and it sure looks like a sensible > approach to use that mirrors what many other systems already do > successfully. Well, I would think we would not add this for TDE until we were sure someone was working on adding TDE. > > > make this required, general improved sanity when working with pg_upgrade > > > is frankly a benefit in its own right too...). If the additional code > > > > How? I am not aware of any advantage except cosmetic. > > Having to resort to matching up inode numbers between the two clusters > after a pg_upgrade to figure out what files are actually the same > underneath is a pain that goes beyond just cosmetics imv. Removing that > additional level that admins, and developers for that matter, have to go > through would be a nice improvement on its own. OK, I was just not aware anyone did that, since I have never hard anyone complain about it before. > > > was a huge burden or even a moderate one then that might be an argument > > > against, but it hardly sounds like it will be given Robert's thorough > > > analysis so far and the (admittedly not complete, but not that far from > > > it based on the DB OID review) proposed patch. > > > > I am find to add it if it is minor, but I want to see the calculus of > > its value vs complexity, which I have not seen spelled out. > > I feel that this, along with the prior discussions, spells it out > sufficiently given the patch's complexity looks to be reasonably minor > and very similar to the existing things that pg_upgrade already does. > Had pg_upgrade done this in the first place, I don't think there would > have been nearly this amount of discussion about it. Well, there is a reason pg_upgrade didn't initially do this --- because it adds complexity, and potentially makes future changes to pg_upgrade necessary if the server behavior changes. I am not saying this change is wrong, but I think the reasons need to be stated in this thread, rather than just moving forward. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 12:37:19PM -0400, Robert Haas wrote: > On Thu, Aug 26, 2021 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote: > > I am find to add it if it is minor, but I want to see the calculus of > > its value vs complexity, which I have not seen spelled out. > > I don't think it's going to be all that complicated, but we're going > to have to wait until we have something closer to a final patch before > we can really evaluate that. I am honestly a little puzzled about why > you think complexity is such a big issue for this patch in particular. > I feel we do probably several hundred things every release cycle that > are more complicated than this, so it doesn't seem like this is > particularly extraordinary or needs a lot of extra scrutiny. I do > think there is some risk that there are messy cases we can't handle > cleanly, but if that becomes an issue then I'll abandon the effort > until a solution can be found. I'm not trying to relentlessly drive > something through that is a bad idea on principle. > > I agree with all Stephen's comments, too. I just don't want to add requirements/complexity to pg_upgrade without clearly stated reasons because future database changes will need to honor this new preservation behavior. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Aug 26, 2021 at 12:34:56PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote: > > > > * Bruce Momjian (bruce@momjian.us) wrote: > > > > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote: > > > > > > Anyone see a flaw in that analysis? > > > > > > > > > > I am still waiting to hear the purpose of this preservation. As long as > > > > > you don't apply the patch, I guess I will just stop asking. > > > > > > > > I'm a bit confused why this question keeps coming up as we've discussed > > > > multiple reasons (incremental backups, possible use for TDE which would > > > > > > I have not seen much explaination on pgBackRest, except me mentioning > > > it. Is this something really useful? > > > > Being able to quickly perform a backup on a newly upgraded cluster would > > certainly be valuable and that's definitely not possible today due to > > all of the filenames changing. > > You mean incremental backup, right? I was told this by the pgBackRest > developers during PGCon, but I have not heard that stated publicly, so I > hate to go just on what I heard rather than seeing that stated publicly. Yes, we're talking about either incremental (or perhaps differential) backup where only the files which are actually different would be backed up. Just like with PG, I can't provide any complete guarantees that we'd be able to actually make this possible after a major version with pgBackRest with this change, but it definitely isn't possible *without* this change. I can't see any reason why we wouldn't be able to do a checksum-based incremental backup though (which would be *much* faster than a regular backup) once this change is made and have that be a reliable and trustworthy backup. I'd want to think about it more and discuss it with David in some detail before saying if we could maybe perform a timestamp-based incremental backup (without checksum'ing the files, as we do in normal situations), but that would really just be a bonus. > > > As far as TDE, I haven't seen any concrete plan for that, so why add > > > this code for that reason? > > > > That this would help with TDE (of which there seems little doubt...) is > > an additional benefit to this. Specifically, taking the existing work > > that's already been done to allow block-by-block encryption and > > adjusting it for AES-XTS and then using the db-dir+relfileno+block > > number as the IV, just like many disk encryption systems do, avoids the > > concerns that were brought up about using LSN for the IV with CTR and > > it's certainly not difficult to do, but it does depend on this change. > > This was all discussed previously and it sure looks like a sensible > > approach to use that mirrors what many other systems already do > > successfully. > > Well, I would think we would not add this for TDE until we were sure > someone was working on adding TDE. That this would help with TDE is what I'd consider an added bonus. > > > > make this required, general improved sanity when working with pg_upgrade > > > > is frankly a benefit in its own right too...). If the additional code > > > > > > How? I am not aware of any advantage except cosmetic. > > > > Having to resort to matching up inode numbers between the two clusters > > after a pg_upgrade to figure out what files are actually the same > > underneath is a pain that goes beyond just cosmetics imv. Removing that > > additional level that admins, and developers for that matter, have to go > > through would be a nice improvement on its own. > > OK, I was just not aware anyone did that, since I have never hard anyone > complain about it before. I've certainly done it and I'd be kind of surprised if others haven't, but I've also played a lot with pg_dump in various modes, so perhaps that's not a great representation. I've definitely had to explain to clients why there's a whole different set of filenames after a pg_upgrade and why that is the case for an 'in place' upgrade before too. > > > > was a huge burden or even a moderate one then that might be an argument > > > > against, but it hardly sounds like it will be given Robert's thorough > > > > analysis so far and the (admittedly not complete, but not that far from > > > > it based on the DB OID review) proposed patch. > > > > > > I am find to add it if it is minor, but I want to see the calculus of > > > its value vs complexity, which I have not seen spelled out. > > > > I feel that this, along with the prior discussions, spells it out > > sufficiently given the patch's complexity looks to be reasonably minor > > and very similar to the existing things that pg_upgrade already does. > > Had pg_upgrade done this in the first place, I don't think there would > > have been nearly this amount of discussion about it. > > Well, there is a reason pg_upgrade didn't initially do this --- because > it adds complexity, and potentially makes future changes to pg_upgrade > necessary if the server behavior changes. I have a very hard time seeing what changes might happen in the server in this space that wouldn't have an impact on pg_upgrade, with or without this. > I am not saying this change is wrong, but I think the reasons need to be > stated in this thread, rather than just moving forward. Ok, they've been stated and it seems to at least Robert and myself that this is worthwhile to at least continue through to a concluded patch, after which we can contemplate that patch's complexity against these reasons. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 01:03:54PM -0400, Stephen Frost wrote: > Yes, we're talking about either incremental (or perhaps differential) > backup where only the files which are actually different would be backed > up. Just like with PG, I can't provide any complete guarantees that > we'd be able to actually make this possible after a major version with > pgBackRest with this change, but it definitely isn't possible *without* > this change. I can't see any reason why we wouldn't be able to do a > checksum-based incremental backup though (which would be *much* faster > than a regular backup) once this change is made and have that be a > reliable and trustworthy backup. I'd want to think about it more and > discuss it with David in some detail before saying if we could maybe > perform a timestamp-based incremental backup (without checksum'ing the > files, as we do in normal situations), but that would really just be a > bonus. Well, it would be nice to know exactly how it would help pgBackRest if that is one of the reasons we are adding this feature. > > > > As far as TDE, I haven't seen any concrete plan for that, so why add > > > > this code for that reason? > > > > > > That this would help with TDE (of which there seems little doubt...) is > > > an additional benefit to this. Specifically, taking the existing work > > > that's already been done to allow block-by-block encryption and > > > adjusting it for AES-XTS and then using the db-dir+relfileno+block > > > number as the IV, just like many disk encryption systems do, avoids the > > > concerns that were brought up about using LSN for the IV with CTR and > > > it's certainly not difficult to do, but it does depend on this change. > > > This was all discussed previously and it sure looks like a sensible > > > approach to use that mirrors what many other systems already do > > > successfully. > > > > Well, I would think we would not add this for TDE until we were sure > > someone was working on adding TDE. > > That this would help with TDE is what I'd consider an added bonus. Not if we have no plans to implement TDE, which was my point. Why not wait to see if we are actually going to implement TDE rather than adding it now. It is just so obvious, why do I have to state this? > > > > > make this required, general improved sanity when working with pg_upgrade > > > > > is frankly a benefit in its own right too...). If the additional code > > > > > > > > How? I am not aware of any advantage except cosmetic. > > > > > > Having to resort to matching up inode numbers between the two clusters > > > after a pg_upgrade to figure out what files are actually the same > > > underneath is a pain that goes beyond just cosmetics imv. Removing that > > > additional level that admins, and developers for that matter, have to go > > > through would be a nice improvement on its own. > > > > OK, I was just not aware anyone did that, since I have never hard anyone > > complain about it before. > > I've certainly done it and I'd be kind of surprised if others haven't, > but I've also played a lot with pg_dump in various modes, so perhaps > that's not a great representation. I've definitely had to explain to > clients why there's a whole different set of filenames after a > pg_upgrade and why that is the case for an 'in place' upgrade before > too. Uh, so I guess I am right that few people have mentioned this in the past. Why were users caring about the file names? > > > > > was a huge burden or even a moderate one then that might be an argument > > > > > against, but it hardly sounds like it will be given Robert's thorough > > > > > analysis so far and the (admittedly not complete, but not that far from > > > > > it based on the DB OID review) proposed patch. > > > > > > > > I am find to add it if it is minor, but I want to see the calculus of > > > > its value vs complexity, which I have not seen spelled out. > > > > > > I feel that this, along with the prior discussions, spells it out > > > sufficiently given the patch's complexity looks to be reasonably minor > > > and very similar to the existing things that pg_upgrade already does. > > > Had pg_upgrade done this in the first place, I don't think there would > > > have been nearly this amount of discussion about it. > > > > Well, there is a reason pg_upgrade didn't initially do this --- because > > it adds complexity, and potentially makes future changes to pg_upgrade > > necessary if the server behavior changes. > > I have a very hard time seeing what changes might happen in the server > in this space that wouldn't have an impact on pg_upgrade, with or > without this. I don't know, but I have to ask since I can't know the future, so any "preseration" has to be studied. > > I am not saying this change is wrong, but I think the reasons need to be > > stated in this thread, rather than just moving forward. > > Ok, they've been stated and it seems to at least Robert and myself that > this is worthwhile to at least continue through to a concluded patch, > after which we can contemplate that patch's complexity against these > reasons. OK, that works for me. What bothers me is that the Desirability of this changes has not be clearly stated in this thread. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Aug 26, 2021 at 12:51 PM Bruce Momjian <bruce@momjian.us> wrote: > I just don't want to add requirements/complexity to pg_upgrade without > clearly stated reasons because future database changes will need to > honor this new preservation behavior. Well, I agree that it's good to have reasons clearly stated and I hope that at this point you agree that they have been. Whether you agree with them is another question, but I hope you at least agree that they have been stated. As far as the other part of your concern, what I think makes this change pretty safe is that we are preserving more things rather than fewer. I can imagine some server behavior depending on something being the same between the old and the new clusters, but it is harder to imagine a dependency on something not being preserved. For example, we know that the OIDs of pg_type rows have to be the same in the old and new cluster because arrays are stored on disk with the type OIDs included. Therefore those need to be preserved. If in the future we changed things so that arrays - and other container types - did not include the type OIDs in the on-disk representation, then perhaps it would no longer be necessary to preserve the OIDs of pg_type rows across a pg_upgrade. However, it would not be harmful to do so. It just might not be required. So I think this proposed change is in the safe direction. If relfilenodes were currently preserved and we wanted to make them not be preserved, then I think you would be quite right to say "whoa, whoa, that could be a problem." Indeed it could. If anyone then in the future wanted to introduce a dependency on them staying the same, they would have a problem. However, nothing in the server itself can care about relfilenodes - or anything else - being *different* across a pg_upgrade. The whole point of pg_upgrade is to make it feel like you have the same database after you run it as you did before you ran it, even though under the hood a lot of surgery has been done. Barring bugs, you can never be sad about there being too LITTLE difference between the post-upgrade database and the pre-upgrade database. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Stephen Frost
Date:
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Aug 26, 2021 at 01:03:54PM -0400, Stephen Frost wrote: > > Yes, we're talking about either incremental (or perhaps differential) > > backup where only the files which are actually different would be backed > > up. Just like with PG, I can't provide any complete guarantees that > > we'd be able to actually make this possible after a major version with > > pgBackRest with this change, but it definitely isn't possible *without* > > this change. I can't see any reason why we wouldn't be able to do a > > checksum-based incremental backup though (which would be *much* faster > > than a regular backup) once this change is made and have that be a > > reliable and trustworthy backup. I'd want to think about it more and > > discuss it with David in some detail before saying if we could maybe > > perform a timestamp-based incremental backup (without checksum'ing the > > files, as we do in normal situations), but that would really just be a > > bonus. > > Well, it would be nice to know exactly how it would help pgBackRest if > that is one of the reasons we are adding this feature. pgBackRest keeps a manifest for every file in the PG data directory that is backed up and we identify that file by the filename. Further, we calculate a checksum for every file. If the filenames didn't change then we'd be able to compare the file in the new cluster against the file and checksum in the manifest in order to be able to perform the incremental/differential backup. We don't store the inodes in the manifest though, and we don't have any concept of looking at multiple data directories at the same time or anything like that (which would also mean that the old data directory would have to be kept around for that to even work, which seems like a good bit of additional complication and risk that someone might start up the old cluster by accident..). That's how it'd be very helpful to pgBackRest for the filenames to be preserved across pg_upgrade's. > > > > > As far as TDE, I haven't seen any concrete plan for that, so why add > > > > > this code for that reason? > > > > > > > > That this would help with TDE (of which there seems little doubt...) is > > > > an additional benefit to this. Specifically, taking the existing work > > > > that's already been done to allow block-by-block encryption and > > > > adjusting it for AES-XTS and then using the db-dir+relfileno+block > > > > number as the IV, just like many disk encryption systems do, avoids the > > > > concerns that were brought up about using LSN for the IV with CTR and > > > > it's certainly not difficult to do, but it does depend on this change. > > > > This was all discussed previously and it sure looks like a sensible > > > > approach to use that mirrors what many other systems already do > > > > successfully. > > > > > > Well, I would think we would not add this for TDE until we were sure > > > someone was working on adding TDE. > > > > That this would help with TDE is what I'd consider an added bonus. > > Not if we have no plans to implement TDE, which was my point. Why not > wait to see if we are actually going to implement TDE rather than adding > it now. It is just so obvious, why do I have to state this? There's been multiple years of effort put into implementing TDE and I'm sure hopeful that it continues as I'm trying to put effort into moving it forward myself. I'm a bit baffled by the idea that we're just suddenly going to stop putting effort into TDE as it is brought up time and time again by clients that I've talked to as one of the few reasons they haven't moved to PG yet- I can't believe that hasn't been experienced by folks at other organizations too, I mean, there's people maintaining forks of PG specifically for TDE ... Seems like maybe we were both seeing something as obvious to the other that wasn't actually the case. > > > > > > make this required, general improved sanity when working with pg_upgrade > > > > > > is frankly a benefit in its own right too...). If the additional code > > > > > > > > > > How? I am not aware of any advantage except cosmetic. > > > > > > > > Having to resort to matching up inode numbers between the two clusters > > > > after a pg_upgrade to figure out what files are actually the same > > > > underneath is a pain that goes beyond just cosmetics imv. Removing that > > > > additional level that admins, and developers for that matter, have to go > > > > through would be a nice improvement on its own. > > > > > > OK, I was just not aware anyone did that, since I have never hard anyone > > > complain about it before. > > > > I've certainly done it and I'd be kind of surprised if others haven't, > > but I've also played a lot with pg_dump in various modes, so perhaps > > that's not a great representation. I've definitely had to explain to > > clients why there's a whole different set of filenames after a > > pg_upgrade and why that is the case for an 'in place' upgrade before > > too. > > Uh, so I guess I am right that few people have mentioned this in the > past. Why were users caring about the file names? This is a bit baffling to me. Users and admins certainly care about what files their data is stored in and knowing how to find them. Covering the data directory structure is a commonly asked for part of the training that I regularly do for clients. > > > > > > was a huge burden or even a moderate one then that might be an argument > > > > > > against, but it hardly sounds like it will be given Robert's thorough > > > > > > analysis so far and the (admittedly not complete, but not that far from > > > > > > it based on the DB OID review) proposed patch. > > > > > > > > > > I am find to add it if it is minor, but I want to see the calculus of > > > > > its value vs complexity, which I have not seen spelled out. > > > > > > > > I feel that this, along with the prior discussions, spells it out > > > > sufficiently given the patch's complexity looks to be reasonably minor > > > > and very similar to the existing things that pg_upgrade already does. > > > > Had pg_upgrade done this in the first place, I don't think there would > > > > have been nearly this amount of discussion about it. > > > > > > Well, there is a reason pg_upgrade didn't initially do this --- because > > > it adds complexity, and potentially makes future changes to pg_upgrade > > > necessary if the server behavior changes. > > > > I have a very hard time seeing what changes might happen in the server > > in this space that wouldn't have an impact on pg_upgrade, with or > > without this. > > I don't know, but I have to ask since I can't know the future, so any > "preseration" has to be studied. We can gain, perhaps, some insight looking into the past and that seems to indicate that this is certainly a very stable part of the server code in the first place, which would imply that it's unlikely that there'll be much need to adjust this code in the future in the first place. > > > I am not saying this change is wrong, but I think the reasons need to be > > > stated in this thread, rather than just moving forward. > > > > Ok, they've been stated and it seems to at least Robert and myself that > > this is worthwhile to at least continue through to a concluded patch, > > after which we can contemplate that patch's complexity against these > > reasons. > > OK, that works for me. What bothers me is that the Desirability of this > changes has not be clearly stated in this thread. I hope that this email and the many many prior ones have gotten across the desirability of the change. Thanks, Stephen
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 01:20:38PM -0400, Robert Haas wrote: > So I think this proposed change is in the safe direction. If > relfilenodes were currently preserved and we wanted to make them not > be preserved, then I think you would be quite right to say "whoa, > whoa, that could be a problem." Indeed it could. If anyone then in the > future wanted to introduce a dependency on them staying the same, they > would have a problem. However, nothing in the server itself can care > about relfilenodes - or anything else - being *different* across a > pg_upgrade. The whole point of pg_upgrade is to make it feel like you > have the same database after you run it as you did before you ran it, > even though under the hood a lot of surgery has been done. Barring > bugs, you can never be sad about there being too LITTLE difference > between the post-upgrade database and the pre-upgrade database. Yes, this makes sense, and it is good we have stated the possible benefits now: * pgBackRest * pg_upgrade diagnostics * TDE (maybe) We can eventually evaluate the value of this based on those items. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Thu, Aug 26, 2021 at 01:24:46PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Thu, Aug 26, 2021 at 01:03:54PM -0400, Stephen Frost wrote: > > > Yes, we're talking about either incremental (or perhaps differential) > > > backup where only the files which are actually different would be backed > > > up. Just like with PG, I can't provide any complete guarantees that > > > we'd be able to actually make this possible after a major version with > > > pgBackRest with this change, but it definitely isn't possible *without* > > > this change. I can't see any reason why we wouldn't be able to do a > > > checksum-based incremental backup though (which would be *much* faster > > > than a regular backup) once this change is made and have that be a > > > reliable and trustworthy backup. I'd want to think about it more and > > > discuss it with David in some detail before saying if we could maybe > > > perform a timestamp-based incremental backup (without checksum'ing the > > > files, as we do in normal situations), but that would really just be a > > > bonus. > > > > Well, it would be nice to know exactly how it would help pgBackRest if > > that is one of the reasons we are adding this feature. > > pgBackRest keeps a manifest for every file in the PG data directory that > is backed up and we identify that file by the filename. Further, we > calculate a checksum for every file. If the filenames didn't change > then we'd be able to compare the file in the new cluster against the > file and checksum in the manifest in order to be able to perform the > incremental/differential backup. We don't store the inodes in the > manifest though, and we don't have any concept of looking at multiple > data directories at the same time or anything like that (which would > also mean that the old data directory would have to be kept around for > that to even work, which seems like a good bit of additional > complication and risk that someone might start up the old cluster by > accident..). > > That's how it'd be very helpful to pgBackRest for the filenames to be > preserved across pg_upgrade's. OK, that is clear. > > > > > > As far as TDE, I haven't seen any concrete plan for that, so why add > > > > > > this code for that reason? > > > > > > > > > > That this would help with TDE (of which there seems little doubt...) is > > > > > an additional benefit to this. Specifically, taking the existing work > > > > > that's already been done to allow block-by-block encryption and > > > > > adjusting it for AES-XTS and then using the db-dir+relfileno+block > > > > > number as the IV, just like many disk encryption systems do, avoids the > > > > > concerns that were brought up about using LSN for the IV with CTR and > > > > > it's certainly not difficult to do, but it does depend on this change. > > > > > This was all discussed previously and it sure looks like a sensible > > > > > approach to use that mirrors what many other systems already do > > > > > successfully. > > > > > > > > Well, I would think we would not add this for TDE until we were sure > > > > someone was working on adding TDE. > > > > > > That this would help with TDE is what I'd consider an added bonus. > > > > Not if we have no plans to implement TDE, which was my point. Why not > > wait to see if we are actually going to implement TDE rather than adding > > it now. It is just so obvious, why do I have to state this? > > There's been multiple years of effort put into implementing TDE and I'm > sure hopeful that it continues as I'm trying to put effort into moving > it forward myself. I'm a bit baffled by the idea that we're just Well, this is the first time I am hearing this publicly. > suddenly going to stop putting effort into TDE as it is brought up time > and time again by clients that I've talked to as one of the few reasons > they haven't moved to PG yet- I can't believe that hasn't been > experienced by folks at other organizations too, I mean, there's people > maintaining forks of PG specifically for TDE ... Agreed. > > > I've certainly done it and I'd be kind of surprised if others haven't, > > > but I've also played a lot with pg_dump in various modes, so perhaps > > > that's not a great representation. I've definitely had to explain to > > > clients why there's a whole different set of filenames after a > > > pg_upgrade and why that is the case for an 'in place' upgrade before > > > too. > > > > Uh, so I guess I am right that few people have mentioned this in the > > past. Why were users caring about the file names? > > This is a bit baffling to me. Users and admins certainly care about > what files their data is stored in and knowing how to find them. > Covering the data directory structure is a commonly asked for part of > the training that I regularly do for clients. I just never thought people cared about the file names, since I have never heard a complaint about how pg_upgrade works all these years. > > > I have a very hard time seeing what changes might happen in the server > > > in this space that wouldn't have an impact on pg_upgrade, with or > > > without this. > > > > I don't know, but I have to ask since I can't know the future, so any > > "preseration" has to be studied. > > We can gain, perhaps, some insight looking into the past and that seems > to indicate that this is certainly a very stable part of the server code > in the first place, which would imply that it's unlikely that there'll > be much need to adjust this code in the future in the first place. Good, it have to ask. > > > > I am not saying this change is wrong, but I think the reasons need to be > > > > stated in this thread, rather than just moving forward. > > > > > > Ok, they've been stated and it seems to at least Robert and myself that > > > this is worthwhile to at least continue through to a concluded patch, > > > after which we can contemplate that patch's complexity against these > > > reasons. > > > > OK, that works for me. What bothers me is that the Desirability of this > > changes has not be clearly stated in this thread. > > I hope that this email and the many many prior ones have gotten across > the desirability of the change. Yes, I think we are in a better position now to evaluate this. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, community, It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM(-SIV). I want to say something that we don't think about. For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN, block number or something) is predictable. There are many CVE related to AES-CBC with a predictable IV. https://cwe.mitre.org/data/definitions/329.html For AES-XTS, use block number or any fixed variable as tweak still has weaknesses similar to IV reuse (in CBC not GCM). the attacker can decrypt one block if he knows a kind of plaintext of this block. In Luks/BitLocker/HardwareBasedSolution, the physical location is not available to the user. filesystem running in kernel space. and they not do encrypt when filesystem allocating a data block. But in PostgreSQL, the attacker can capture an encrypted 'ALL-ZERO' page in `mdextend`, with this, the attacker can decode the ciphertext of all following data in this block. For AES-GCM, a predictable IV is fine. I think we can decrypt and re-encrypt the user data in pg_upgrade. this will allows us to use relfile oid + block number as nonce.
Attachment
在 2021/9/5 下午10:51, Sasasu 写道: > > For AES-GCM, a predictable IV is fine. I think we can decrypt and > re-encrypt the user data in pg_upgrade. this will allows us to use > relfile oid + block number as nonce. relfile oid + block number + some counter for heap table IV. I mean.
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Aug 24, 2021 at 2:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > It's pretty clear from the discussion, I think, that the database OID > one is going to need rework to be considered. > > Regarding the other one: > > - The comment in binary_upgrade_set_pg_class_oids() is still not > accurate. You removed the sentence which says "Indexes cannot have > toast tables, so we need not make this probe in the index code path" > but the immediately preceding sentence is still inaccurate in at least > two ways. First, it only talks about tables, but the code now applies > to indexes. Second, it only talks about OIDs, but now also deals with > refilenodes. It's really important to fully update every comment that > might be affected by your changes! The comment is updated. > - The SQL query in that function isn't completely correct. There is a > left join from pg_class to pg_index whose ON clause includes > "c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's > likely that is because it is possible, in corner cases, for a TOAST > table to have multiple TOAST indexes. I forget exactly how that > happens, but I think it might be like if a REINDEX CONCURRENTLY on the > TOAST table fails midway through, or something of that sort. Now if > that happens, the LEFT JOIN you added is going to cause the output to > contain multiple rows, because you didn't replicate the i.indisvalid > condition into that ON clause. And then it will fail. Apparently we > don't have a pg_upgrade test case for this scenario; we probably > should. Actually what I think would be even better than putting > i.indisvalid into that ON clause would be to join off of i.indrelid > rather than c.reltoastrelid. The SQL query will not result in duplicate rows because the first join filters the duplicate rows if any with the on clause ' i.indisvalid' on it. The result of the first join is further left joined with pg_class and pg_class will not have duplicate rows for a given oid. > - The code that decodes the various columns of this query does so in a > slightly different order than the query itself. It would be better to > make it match. Perhaps put relkind first in both cases. I might also > think about trying to make the column naming a bit more consistent, > e.g. relkind, relfilenode, toast_oid, toast_relfilenode, > toast_index_oid, toast_index_relfilenode. Fixed. > - In heap_create(), the wording of the error messages is not quite > consistent. You have "relfilenode value not set when in binary upgrade > mode", "toast relfilenode value not set when in binary upgrade mode", > and "pg_class index relfilenode value not set when in binary upgrade > mode". Why does the last one mention pg_class when the other two > don't? The error message is made consistent. This code chuck is moved to a different place as a part of another review comment fix. > - The code in heap_create() now has no comments whatsoever, which is a > shame, because it's actually kind of a tricky bit of logic. Someone > might wonder why we override the relfilenode inside that function > instead of doing it at the same places where we absorb > binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing > down the relfilenode. I think the answer is that passing down the > relfilenode from the caller would result in storage not actually being > created, whereas in this case we want it to be created but just with > the value we specify, and the reason we want that is because we need > later DDL that happens after these statements but before the old > cluster's relations are moved over to execute successfully, which it > won't if the storage is altogether absent. > However, that raises the question of whether this patch has even got > the basic design right. Maybe we ought to actually be absorbing the > relfilenode setting at the same places where we're doing so for the > OID, and then passing an additional parameter to heap_create() like > bool suppress_storage or something like that. Maybe, taking it even > further, we ought to be changing the signatures of > binary_upgrade_next_heap_pg_class_oid and friends to be two-argument > functions, and pass down the OID and the relfilenode in the same call, > rather than calling two separate functions. I'm not so much concerned > about the cost of calling two functions as the potential for > confusion. I'm not honestly sure that either of these changes are the > right thing to do, but I am pretty strongly inclined to do at least > the first part - trying to absorb reloid and relfilenode in the same > places. If we're not going to do that we certainly need to explain why > we're doing it the way we are in the comments. As per your suggestion, reloid and relfilenode are absorbed in the same place. An additional parameter called 'suppress_storage' is passed to heap_create() which indicates whether or not to create the storage when the caller passed a valid relfilenode. I did not make the changes to set the oid and relfilenode in the same call. I feel the uniformity w.r.t the other function signatures in pg_upgrade_support.c will be lost because currently each function sets only one attribute. Also, renaming the applicable function names to represent that they set both oid and relfilenode will make the function name even longer. We may opt to not include the relfilenode in the function name instead use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but then we will end up with some functions that set two attributes and some functions that set one attribute. > It's not really this patch's fault, but it would sure be nice if we > had some better testing for this area. Suppose this patch somehow > changed nothing from the present behavior. How would we know? Or > suppose it managed to somehow set all the relfilenodes in the new > cluster to random values rather than the intended one? There's no > automated testing that would catch any of that, and it's not obvious > how it could be added to test.sh. I suppose what we really need to do > at some point is rewrite that as a TAP test, but that seems like a > separate project from this patch. I have verified the table, index, toast table and toast index relfilenode and DB oid in old and new cluster manually and it is working as expected. I have also attached the patch to preserve the DB oid. As discussed, template0 will be created with a fixed oid during initdb. I am using OID 2 for template0. Even though oid 2 is already in use for the 'pg_am' catalog I see no harm in using it for template0 DB because oid doesn’t have to be unique across the database - it has to be unique for the particular catalog table. Kindly let me know if I am missing something? Apparently, if we did decide to pick an unused oid for template0 then I see a challenge in removing that oid from the unused oid list. I could not come up with a feasible solution for handling it. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Wed, Sep 22, 2021 at 3:07 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > - The comment in binary_upgrade_set_pg_class_oids() is still not > > accurate. You removed the sentence which says "Indexes cannot have > > toast tables, so we need not make this probe in the index code path" > > but the immediately preceding sentence is still inaccurate in at least > > two ways. First, it only talks about tables, but the code now applies > > to indexes. Second, it only talks about OIDs, but now also deals with > > refilenodes. It's really important to fully update every comment that > > might be affected by your changes! > > The comment is updated. Looks good. > The SQL query will not result in duplicate rows because the first join > filters the duplicate rows if any with the on clause ' i.indisvalid' > on it. The result of the first join is further left joined with pg_class and > pg_class will not have duplicate rows for a given oid. Oh, you're right. My mistake. > As per your suggestion, reloid and relfilenode are absorbed in the same place. > An additional parameter called 'suppress_storage' is passed to heap_create() > which indicates whether or not to create the storage when the caller > passed a valid relfilenode. I find it confusing to have both suppress_storage and create_storage with one basically as the negation of the other. To avoid that sort of thing I generally have a policy that variables and options should say whether something should happen, rather than whether it should be prevented from happening. Sometimes there are good reasons - such as strong existing precedent - to deviate from this practice but I think it's good to follow when possible. So my proposal is to always have create_storage and never suppress_storage, and if some function needs to adjust the value of create_storage that was passed to it then OK. > I did not make the changes to set the oid and relfilenode in the same call. > I feel the uniformity w.r.t the other function signatures in > pg_upgrade_support.c will be lost because currently each function sets > only one attribute. > Also, renaming the applicable function names to represent that they > set both oid and relfilenode will make the function name even longer. > We may opt to not include the relfilenode in the function name instead > use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but > then > we will end up with some functions that set two attributes and some > functions that set one attribute. OK. > I have also attached the patch to preserve the DB oid. As discussed, > template0 will be created with a fixed oid during initdb. I am using > OID 2 for template0. Even though oid 2 is already in use for the > 'pg_am' catalog I see no harm in using it for template0 DB because oid > doesn’t have to be unique across the database - it has to be unique > for the particular catalog table. Kindly let me know if I am missing > something? > Apparently, if we did decide to pick an unused oid for template0 then > I see a challenge in removing that oid from the unused oid list. I > could not come up with a feasible solution for handling it. You are correct that there is no intrinsic reason why the same OID can't be used in various different catalogs. We have a policy of not doing that, though; I'm not clear on the reason. Maybe it'd be OK to deviate from that policy here, but another option would be to simply change the unused_oids script (and maybe some of the others). It already has: my $FirstGenbkiObjectId = Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId'); push @{$oids}, $FirstGenbkiObjectId; Presumably it could be easily adapted to push the value of some other defined symbol into @{$oids} also, thus making that OID in effect used. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Sep 5, 2021 at 10:51:42PM +0800, Sasasu wrote: > Hi, community, > > It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM(-SIV). > I want to say something that we don't think about. > > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN, > block number or something) is predictable. There are many CVE related to > AES-CBC with a predictable IV. The LSN would change every time the page is modified, so while the LSN could be predicted, it would not be reused. However, there is currently no work being done on page-level encryption of Postgres. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Sep 5, 2021 at 10:51:42PM +0800, Sasasu wrote:
> Hi, community,
>
> It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM(-SIV).
> I want to say something that we don't think about.
>
> For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
> block number or something) is predictable. There are many CVE related to
> AES-CBC with a predictable IV.
The LSN would change every time the page is modified, so while the LSN
could be predicted, it would not be reused. However, there is currently
no work being done on page-level encryption of Postgres.
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Fri, Sep 24, 2021 at 12:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Sep 22, 2021 at 3:07 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > - The comment in binary_upgrade_set_pg_class_oids() is still not > > > accurate. You removed the sentence which says "Indexes cannot have > > > toast tables, so we need not make this probe in the index code path" > > > but the immediately preceding sentence is still inaccurate in at least > > > two ways. First, it only talks about tables, but the code now applies > > > to indexes. Second, it only talks about OIDs, but now also deals with > > > refilenodes. It's really important to fully update every comment that > > > might be affected by your changes! > > > > The comment is updated. > > Looks good. > > > The SQL query will not result in duplicate rows because the first join > > filters the duplicate rows if any with the on clause ' i.indisvalid' > > on it. The result of the first join is further left joined with pg_class and > > pg_class will not have duplicate rows for a given oid. > > Oh, you're right. My mistake. > > > As per your suggestion, reloid and relfilenode are absorbed in the same place. > > An additional parameter called 'suppress_storage' is passed to heap_create() > > which indicates whether or not to create the storage when the caller > > passed a valid relfilenode. > > I find it confusing to have both suppress_storage and create_storage > with one basically as the negation of the other. To avoid that sort of > thing I generally have a policy that variables and options should say > whether something should happen, rather than whether it should be > prevented from happening. Sometimes there are good reasons - such as > strong existing precedent - to deviate from this practice but I think > it's good to follow when possible. So my proposal is to always have > create_storage and never suppress_storage, and if some function needs > to adjust the value of create_storage that was passed to it then OK. Sure, I agree. In the latest patch, only 'create_storage' is used. > > I did not make the changes to set the oid and relfilenode in the same call. > > I feel the uniformity w.r.t the other function signatures in > > pg_upgrade_support.c will be lost because currently each function sets > > only one attribute. > > Also, renaming the applicable function names to represent that they > > set both oid and relfilenode will make the function name even longer. > > We may opt to not include the relfilenode in the function name instead > > use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but > > then > > we will end up with some functions that set two attributes and some > > functions that set one attribute. > > OK. > > > I have also attached the patch to preserve the DB oid. As discussed, > > template0 will be created with a fixed oid during initdb. I am using > > OID 2 for template0. Even though oid 2 is already in use for the > > 'pg_am' catalog I see no harm in using it for template0 DB because oid > > doesn’t have to be unique across the database - it has to be unique > > for the particular catalog table. Kindly let me know if I am missing > > something? > > Apparently, if we did decide to pick an unused oid for template0 then > > I see a challenge in removing that oid from the unused oid list. I > > could not come up with a feasible solution for handling it. > > You are correct that there is no intrinsic reason why the same OID > can't be used in various different catalogs. We have a policy of not > doing that, though; I'm not clear on the reason. Maybe it'd be OK to > deviate from that policy here, but another option would be to simply > change the unused_oids script (and maybe some of the others). It > already has: > > my $FirstGenbkiObjectId = > Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId'); > push @{$oids}, $FirstGenbkiObjectId; > > Presumably it could be easily adapted to push the value of some other > defined symbol into @{$oids} also, thus making that OID in effect > used. Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4) is fixed for the template0 and the same is removed from unused oid list. In addition to the review comment fixes, I have removed some code that is no longer needed/doesn't make sense since we preserve the OIDs. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > On Sun, Sep 5, 2021 at 10:51:42PM +0800, Sasasu wrote: > > Hi, community, > > > > It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM > (-SIV). > > I want to say something that we don't think about. > > > > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN, > > block number or something) is predictable. There are many CVE related to > > AES-CBC with a predictable IV. > > The LSN would change every time the page is modified, so while the LSN > could be predicted, it would not be reused. However, there is currently > no work being done on page-level encryption of Postgres. > > > We are still working on our TDE patch. Right now the focus is on refactoring > temporary file access to make the TDE patch itself smaller. Reconsidering > encryption mode choices given concerns expressed is next. Currently a viable > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > issue with predictable IV and isn't totally broken in case of IV reuse. Sounds great, thanks! -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Ants Aasma (ants@cybertec.at) wrote: > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > On Sun, Sep 5, 2021 at 10:51:42PM +0800, Sasasu wrote: > > > It looks like we are still considering AES-CBC, AES-XTS, and > > AES-GCM(-SIV). > > > I want to say something that we don't think about. > > > > > > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN, > > > block number or something) is predictable. There are many CVE related to > > > AES-CBC with a predictable IV. > > > > The LSN would change every time the page is modified, so while the LSN > > could be predicted, it would not be reused. However, there is currently > > no work being done on page-level encryption of Postgres. > > > > We are still working on our TDE patch. Right now the focus is on > refactoring temporary file access to make the TDE patch itself smaller. > Reconsidering encryption mode choices given concerns expressed is next. > Currently a viable option seems to be AES-XTS with LSN added into the IV. > XTS doesn't have an issue with predictable IV and isn't totally broken in > case of IV reuse. Probably worth a distinct thread to discuss this, just to be clear. I do want to point out, as I think I did when we discussed this but want to be sure it's also captured here- I don't think that temporary file access should be forced to be block-oriented when it's naturally (in very many cases) sequential. To that point, I'm thinking that we need a temp file access API through which various systems work that's sequential and therefore relatively similar to the existing glibc, et al, APIs, but by going through our own internal API (which more consistently works with the glibc APIs and provides better error reporting in the event of issues, etc) we can then extend it to work as an encrypted stream instead. Happy to discuss in more detail if you'd like but wanted to just bring up this particular point, in case it got lost. Thanks! Stephen
Attachment
On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote: > I do want to point out, as I think I did when we discussed this but want > to be sure it's also captured here- I don't think that temporary file > access should be forced to be block-oriented when it's naturally (in > very many cases) sequential. To that point, I'm thinking that we need a > temp file access API through which various systems work that's > sequential and therefore relatively similar to the existing glibc, et > al, APIs, but by going through our own internal API (which more > consistently works with the glibc APIs and provides better error > reporting in the event of issues, etc) we can then extend it to work as > an encrypted stream instead. Regarding this, would it use block-oriented access on the backend? I agree that we need a better API layer through which all filesystem access is routed. One of the notable weaknesses of the Cybertec patch is that it has too large a code footprint, -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 5, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote: > > I do want to point out, as I think I did when we discussed this but want > > to be sure it's also captured here- I don't think that temporary file > > access should be forced to be block-oriented when it's naturally (in > > very many cases) sequential. To that point, I'm thinking that we need a > > temp file access API through which various systems work that's > > sequential and therefore relatively similar to the existing glibc, et > > al, APIs, but by going through our own internal API (which more > > consistently works with the glibc APIs and provides better error > > reporting in the event of issues, etc) we can then extend it to work as > > an encrypted stream instead. > > Regarding this, would it use block-oriented access on the backend? > > I agree that we need a better API layer through which all filesystem > access is routed. One of the notable weaknesses of the Cybertec patch > is that it has too large a code footprint, (sent too soon) ...precisely because PostgreSQL doesn't have such a layer. But I think ultimately we do want to encrypt and decrypt in blocks, so if we create such a layer, it should expose byte-oriented APIs but combine the actual I/Os somehow. That's also good for cutting down the number of system calls, which is a benefit unto itself. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 5, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote: > > > I do want to point out, as I think I did when we discussed this but want > > > to be sure it's also captured here- I don't think that temporary file > > > access should be forced to be block-oriented when it's naturally (in > > > very many cases) sequential. To that point, I'm thinking that we need a > > > temp file access API through which various systems work that's > > > sequential and therefore relatively similar to the existing glibc, et > > > al, APIs, but by going through our own internal API (which more > > > consistently works with the glibc APIs and provides better error > > > reporting in the event of issues, etc) we can then extend it to work as > > > an encrypted stream instead. > > > > Regarding this, would it use block-oriented access on the backend? > > > > I agree that we need a better API layer through which all filesystem > > access is routed. One of the notable weaknesses of the Cybertec patch > > is that it has too large a code footprint, > > (sent too soon) > > ...precisely because PostgreSQL doesn't have such a layer. I'm just trying to make our changes to buffile.c less invasive. Or do you mean that this module should be reworked regardless the encryption? > But I think ultimately we do want to encrypt and decrypt in blocks, so > if we create such a layer, it should expose byte-oriented APIs but > combine the actual I/Os somehow. That's also good for cutting down the > number of system calls, which is a benefit unto itself. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > We are still working on our TDE patch. Right now the focus is on refactoring > temporary file access to make the TDE patch itself smaller. Reconsidering > encryption mode choices given concerns expressed is next. Currently a viable > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > issue with predictable IV and isn't totally broken in case of IV reuse. Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous 16-byte blocks affect later blocks, meaning that hint bit changes would also affect later blocks. I think this means we would need to write WAL full page images for hint bit changes to avoid torn pages. Right now hint bit (single bit) changes can be lost without causing torn pages. This was another of the advantages of using a stream cipher like CTR. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > We are still working on our TDE patch. Right now the focus is on refactoring > > temporary file access to make the TDE patch itself smaller. Reconsidering > > encryption mode choices given concerns expressed is next. Currently a viable > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > issue with predictable IV and isn't totally broken in case of IV reuse. > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > 16-byte blocks affect later blocks, meaning that hint bit changes would > also affect later blocks. I think this means we would need to write WAL > full page images for hint bit changes to avoid torn pages. Right now > hint bit (single bit) changes can be lost without causing torn pages. > This was another of the advantages of using a stream cipher like CTR. Another problem caused by block mode ciphers is that to use the LSN as part of the nonce, the LSN must not be encrypted, but you then have to find a 16-byte block in the page that you don't need to encrypt. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > We are still working on our TDE patch. Right now the focus is on refactoring > > temporary file access to make the TDE patch itself smaller. Reconsidering > > encryption mode choices given concerns expressed is next. Currently a viable > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > issue with predictable IV and isn't totally broken in case of IV reuse. > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > 16-byte blocks affect later blocks, meaning that hint bit changes would > also affect later blocks. I think this means we would need to write WAL > full page images for hint bit changes to avoid torn pages. Right now > hint bit (single bit) changes can be lost without causing torn pages. > This was another of the advantages of using a stream cipher like CTR. The above text isn't very clear. What I am saying is that currently torn pages can be tolerated by hint bit writes because only a single byte is changing. If we use a block cipher like AES-XTS, later 16-byte encrypted blocks would be changed by hint bit changes, meaning torn pages could not be tolerated. This means we would have to use full page writes for hint bit changes, perhaps making this feature have unacceptable performance overhead. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Oct 5, 2021 at 1:55 PM Antonin Houska <ah@cybertec.at> wrote: > I'm just trying to make our changes to buffile.c less invasive. Or do you mean > that this module should be reworked regardless the encryption? I wasn't thinking of buffile.c specifically. I think improving that might be a really good idea, although I'm not 100% sure I know what that would look like. I was thinking that it's unfortunate that there are so many different ways that I/O happens overall. Like, there are direct write() and pg_pwrite() calls in various places, for example. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > We are still working on our TDE patch. Right now the focus is on refactoring > > temporary file access to make the TDE patch itself smaller. Reconsidering > > encryption mode choices given concerns expressed is next. Currently a viable > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > issue with predictable IV and isn't totally broken in case of IV reuse. > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > 16-byte blocks affect later blocks, meaning that hint bit changes would > also affect later blocks. I think this means we would need to write WAL > full page images for hint bit changes to avoid torn pages. Right now > hint bit (single bit) changes can be lost without causing torn pages. > This was another of the advantages of using a stream cipher like CTR. This seems wrong to me. CTR requires that you not reuse the IV. If you re-encrypt the page with a different IV, torn pages are a problem. If you re-encrypt it with the same IV, then it's not secure any more. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 6, 2021 at 9:35 AM Bruce Momjian <bruce@momjian.us> wrote: > The above text isn't very clear. What I am saying is that currently > torn pages can be tolerated by hint bit writes because only a single > byte is changing. If we use a block cipher like AES-XTS, later 16-byte > encrypted blocks would be changed by hint bit changes, meaning torn > pages could not be tolerated. This means we would have to use full page > writes for hint bit changes, perhaps making this feature have > unacceptable performance overhead. Actually, I think this would have *less* performance overhead than your patch. If you enable checksums or set wal_log_hints=on, then you might incur a some write-ahead log records that would otherwise be avoided, and those records will include full page images. This can happen once per page per checkpoint cycle. However, if the first modification to a particular page within a given checkpoint cycle is a regular WAL-logged operation rather than a hint bit change, then the extra WAL record and full-page image are not needed so the overhead is zero. Also, if the first modification is a hint bit change, and then the page is evicted, prompting a full page write, but a regular WAL-logged operation occurs later within the same checkpoint, the later operation no longer needs a full page write. So you still paid the cost of an extra WAL record, but you didn't pay the cost of an extra full page write. In other words, enabling checksums or turning wal_log_hints=on has a relatively low cost except when you have pages that incur only hint-type changes, and no regular changes, within the course of a single checkpoint cycle. On the other hand, in order to avoid IV reuse, your patch needed to bump the page LSN for every change, or at least for every eviction. That means you could potentially incur the overhead of an extra full page write multiple times per checkpoint cycle, and even if there were non-hint changes to that page in the same checkpoint cycle. Now you could say, well, let's not bump the page LSN for every hint-type change, and then your patch would have lower overhead than an approach based on XTS, but I think that also loses a ton of security, because now you're reusing IVs with an encryption system that is documented not to tolerate the reuse of IVs. I'm not here to try to pretend that encryption is going to be cheap. I just don't believe this particular argument about why AES-XTS should be more expensive. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Oct 5, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote: > > > I do want to point out, as I think I did when we discussed this but want > > > to be sure it's also captured here- I don't think that temporary file > > > access should be forced to be block-oriented when it's naturally (in > > > very many cases) sequential. To that point, I'm thinking that we need a > > > temp file access API through which various systems work that's > > > sequential and therefore relatively similar to the existing glibc, et > > > al, APIs, but by going through our own internal API (which more > > > consistently works with the glibc APIs and provides better error > > > reporting in the event of issues, etc) we can then extend it to work as > > > an encrypted stream instead. > > > > Regarding this, would it use block-oriented access on the backend? > > > > I agree that we need a better API layer through which all filesystem > > access is routed. One of the notable weaknesses of the Cybertec patch > > is that it has too large a code footprint, > > (sent too soon) > > ...precisely because PostgreSQL doesn't have such a layer. > > But I think ultimately we do want to encrypt and decrypt in blocks, so > if we create such a layer, it should expose byte-oriented APIs but > combine the actual I/Os somehow. That's also good for cutting down the > number of system calls, which is a benefit unto itself. I have to say that this seems to be moving the goalposts quite far down the road from just developing a layer to allow for sequential reading and writing to files that allows us to get away from bare write() calls. While I agree that we want to encrypt/decrypt in blocks when we're working with our own blocks, I don't know that it's really necessary to do for these kinds of use cases. I appreciate the goal of reducing the number of syscalls though. Part of my concern here is that a patch which changes all of our existing sequential access using write() and friends to work in a block manner instead ends up probably being just as big and invasive as those parts of the TDE patch which did the same, and it isn't actually necessary as there are stream ciphers which we could certainly use for, well, stream-based access patterns. No, that doesn't improve the situation around the number of syscalls, but it also doesn't make that any worse than it is today. Perhaps this is all too meta and we need to work through some specific ideas around just what this would look like. In particular, thinking about what this API would look like and how it would be used by reorderbuffer.c, which builds up changes in memory and then does a bare write() call, seems like a main use-case to consider. The gist there being "can we come up with an API to do all these things that doesn't require entirely rewriting ReorderBufferSerializeChange()?" Seems like it'd be easier to achieve that by having something that looks very close to how write() looks, but just happens to have the option to run the data through a stream cipher and maybe does better error handling for us. Making that layer also do block-based access to the files underneath seems like a much larger effort that, sure, may make some things better too but if we could do that with the same API then it could also be done later if someone's interested in that. Thanks, Stephen
Attachment
On Wed, Oct 6, 2021 at 11:22 AM Stephen Frost <sfrost@snowman.net> wrote: > Seems like it'd be easier to achieve that by having something that looks > very close to how write() looks, but just happens to have the option to > run the data through a stream cipher and maybe does better error > handling for us. Making that layer also do block-based access to the > files underneath seems like a much larger effort that, sure, may make > some things better too but if we could do that with the same API then it > could also be done later if someone's interested in that. Yeah, it's possible that is the best option, but I'm not really convinced. I think the places that are doing I/O in small chunks are pretty questionable. Like look at this code from pgstat.c, with block comments omitted for brevity: rc = fwrite(&format_id, sizeof(format_id), 1, fpout); (void) rc; /* we'll check for error with ferror */ rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout); (void) rc; /* we'll check for error with ferror */ rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout); (void) rc; /* we'll check for error with ferror */ rc = fwrite(&walStats, sizeof(walStats), 1, fpout); (void) rc; /* we'll check for error with ferror */ rc = fwrite(slruStats, sizeof(slruStats), 1, fpout); (void) rc; /* we'll check for error with ferror */ I don't know exactly what the best way to write this code is, but I'm fairly sure this isn't it. I suppose that whoever wrote this code chose to use fwrite() rather than write() to get buffering, but that had the effect of delaying the error checking by an amount that I would consider unacceptable in new code -- we do all the fwrite() calls to generate the entire file and then only check ferror() once at the very end! If we did our own buffering, we could do this a lot better. And if we used that same buffering layer everywhere, it might not be too hard to make it use a block cipher rather than a stream cipher. Now I don't intrinsically have strong feeling about whether block ciphers or stream ciphers are better, but I think it's going to be easier to analyze the security of the system and to maintain it across future developments in cryptography if we can use the same kind of cipher everywhere. If we use block ciphers for some things and stream ciphers for other things, it is more complicated. Perhaps that is unavoidable and I just shouldn't worry about it. It may work out that we'll end up needing to do that anyway for one reason or another. But all things being equal, I think it's nice if we make all the places where we do I/O look more like each other, not specifically because of TDE but because that's just better in general. For example Andres is working on async I/O. Maybe this particular piece of code is moot in terms of that project because I think Andres is hoping to get the shared memory stats collector patch committed. But, say that doesn't happen. The more all of the I/O looks the same, the easier it will be to make all of it use whatever async I/O infrastructure he creates. The more every module does things in its own way, the harder it is. And batching work together into reasonable-sized blocks is probably necessary for async I/O too. I just can't look at code like that shown above and think anything other than "blech". -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 6, 2021 at 11:01:25AM -0400, Robert Haas wrote: > On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > > We are still working on our TDE patch. Right now the focus is on refactoring > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > encryption mode choices given concerns expressed is next. Currently a viable > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > also affect later blocks. I think this means we would need to write WAL > > full page images for hint bit changes to avoid torn pages. Right now > > hint bit (single bit) changes can be lost without causing torn pages. > > This was another of the advantages of using a stream cipher like CTR. > > This seems wrong to me. CTR requires that you not reuse the IV. If you > re-encrypt the page with a different IV, torn pages are a problem. If > you re-encrypt it with the same IV, then it's not secure any more. We were not changing the IV for hint bit changes, meaning the hint bit changes were visible if you compared the blocks. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, Oct 6, 2021 at 12:54:49PM -0400, Bruce Momjian wrote: > On Wed, Oct 6, 2021 at 11:01:25AM -0400, Robert Haas wrote: > > On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote: > > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > > > We are still working on our TDE patch. Right now the focus is on refactoring > > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > > encryption mode choices given concerns expressed is next. Currently a viable > > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > > also affect later blocks. I think this means we would need to write WAL > > > full page images for hint bit changes to avoid torn pages. Right now > > > hint bit (single bit) changes can be lost without causing torn pages. > > > This was another of the advantages of using a stream cipher like CTR. > > > > This seems wrong to me. CTR requires that you not reuse the IV. If you > > re-encrypt the page with a different IV, torn pages are a problem. If > > you re-encrypt it with the same IV, then it's not secure any more. > > We were not changing the IV for hint bit changes, meaning the hint bit > changes were visible if you compared the blocks. Oops, I was wrong above, and my patch docs prove it: Hint Bits - - - - - For hint bit changes, the LSN normally doesn't change, which is a problem. By enabling wal_log_hints, you get full page writes to the WAL after the first hint bit change of the checkpoint. This is useful for two reasons. First, it generates a new LSN, which is needed for the IV to be secure. Second, full page images protect against torn pages, which is an even bigger requirement for encryption because the new LSN is re-encrypting the entire page, not just the hint bit changes. You can safely lose the hint bit changes, but you need to use the same LSN to decrypt the entire page, so a torn page with an LSN change cannot be decrypted. To prevent this, wal_log_hints guarantees that the pre-hint-bit version (and previous LSN version) of the page is restored. However, if a hint-bit-modified page is written to the file system during a checkpoint, and there is a later hint bit change switching the same page from clean to dirty during the same checkpoint, we need a new LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this is to update the page LSN by writing a dummy WAL record via xloginsert.c::LSNForEncryption() in such cases. Seems my performance concerns were unfounded. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, Oct 6, 2021 at 11:17:59AM -0400, Robert Haas wrote: > If you enable checksums or set wal_log_hints=on, then you might incur > a some write-ahead log records that would otherwise be avoided, and > those records will include full page images. This can happen once per > page per checkpoint cycle. However, if the first modification to a > particular page within a given checkpoint cycle is a regular > WAL-logged operation rather than a hint bit change, then the extra WAL > record and full-page image are not needed so the overhead is zero. > Also, if the first modification is a hint bit change, and then the > page is evicted, prompting a full page write, but a regular WAL-logged > operation occurs later within the same checkpoint, the later operation > no longer needs a full page write. So you still paid the cost of an > extra WAL record, but you didn't pay the cost of an extra full page > write. In other words, enabling checksums or turning wal_log_hints=on > has a relatively low cost except when you have pages that incur only > hint-type changes, and no regular changes, within the course of a > single checkpoint cycle. > > On the other hand, in order to avoid IV reuse, your patch needed to > bump the page LSN for every change, or at least for every eviction. > That means you could potentially incur the overhead of an extra full > page write multiple times per checkpoint cycle, and even if there were > non-hint changes to that page in the same checkpoint cycle. Now you > could say, well, let's not bump the page LSN for every hint-type > change, and then your patch would have lower overhead than an approach > based on XTS, but I think that also loses a ton of security, because > now you're reusing IVs with an encryption system that is documented > not to tolerate the reuse of IVs. > > I'm not here to try to pretend that encryption is going to be cheap. I > just don't believe this particular argument about why AES-XTS should > be more expensive. OK, good to know. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > > We are still working on our TDE patch. Right now the focus is on refactoring > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > encryption mode choices given concerns expressed is next. Currently a viable > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > also affect later blocks. I think this means we would need to write WAL > > full page images for hint bit changes to avoid torn pages. Right now > > hint bit (single bit) changes can be lost without causing torn pages. > > This was another of the advantages of using a stream cipher like CTR. > > Another problem caused by block mode ciphers is that to use the LSN as > part of the nonce, the LSN must not be encrypted, but you then have to > find a 16-byte block in the page that you don't need to encrypt. With AES-XTS, we don't need to use the LSN as part of the nonce though, so I don't think this argument is actually valid..? As discussed previously regarding AES-XTS, the general idea was to use the path to the file and the filename itself plus the block number as the IV, and that works fine for XTS because it's ok to reuse it (unlike with CTR). Thanks, Stephen
Attachment
On Wed, Oct 6, 2021 at 03:17:00PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > > > We are still working on our TDE patch. Right now the focus is on refactoring > > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > > encryption mode choices given concerns expressed is next. Currently a viable > > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an ----------------------------------------------------- > > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > > also affect later blocks. I think this means we would need to write WAL > > > full page images for hint bit changes to avoid torn pages. Right now > > > hint bit (single bit) changes can be lost without causing torn pages. > > > This was another of the advantages of using a stream cipher like CTR. > > > > Another problem caused by block mode ciphers is that to use the LSN as > > part of the nonce, the LSN must not be encrypted, but you then have to > > find a 16-byte block in the page that you don't need to encrypt. > > With AES-XTS, we don't need to use the LSN as part of the nonce though, > so I don't think this argument is actually valid..? As discussed > previously regarding AES-XTS, the general idea was to use the path to > the file and the filename itself plus the block number as the IV, and > that works fine for XTS because it's ok to reuse it (unlike with CTR). Yes, I would prefer we don't use the LSN. I only mentioned it since Ants Aasma mentioned LSN use above. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Mon, Oct 4, 2021 at 12:44 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4) > is fixed for the template0 and the same is removed from unused oid > list. > > In addition to the review comment fixes, I have removed some code that > is no longer needed/doesn't make sense since we preserve the OIDs. This is not a full review, but I'm wondering about this bit of code: - if (!RELKIND_HAS_STORAGE(relkind) || OidIsValid(relfilenode)) + if (!RELKIND_HAS_STORAGE(relkind) || (OidIsValid(relfilenode) && !create_storage)) create_storage = false; else { create_storage = true; - relfilenode = relid; + + /* + * Create the storage with oid same as relid if relfilenode is + * unspecified by the caller + */ + if (!OidIsValid(relfilenode)) + relfilenode = relid; } This seems hard to understand, and I wonder if perhaps it can be simplified. If !RELKIND_HAS_STORAGE(relkind), then we're going to set create_storage to false if it was previously true, and otherwise just do nothing. Otherwise, if !create_storage, we'll enter the create_storage = false branch which effectively does nothing. Otherwise, if !OidIsValid(relfilenode), we'll set relfilenode = relid. So couldn't we express that like this? if (!RELKIND_HAS_STORAGE(relkind)) create_storage = false; else if (create_storage && !OidIsValid(relfilenode)) relfilenode = relid; If so, that seems more clear. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Thu, Oct 7, 2021 at 2:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 4, 2021 at 12:44 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4) > > is fixed for the template0 and the same is removed from unused oid > > list. > > > > In addition to the review comment fixes, I have removed some code that > > is no longer needed/doesn't make sense since we preserve the OIDs. > > This is not a full review, but I'm wondering about this bit of code: > > - if (!RELKIND_HAS_STORAGE(relkind) || OidIsValid(relfilenode)) > + if (!RELKIND_HAS_STORAGE(relkind) || (OidIsValid(relfilenode) > && !create_storage)) > create_storage = false; > else > { > create_storage = true; > - relfilenode = relid; > + > + /* > + * Create the storage with oid same as relid if relfilenode is > + * unspecified by the caller > + */ > + if (!OidIsValid(relfilenode)) > + relfilenode = relid; > } > > This seems hard to understand, and I wonder if perhaps it can be > simplified. If !RELKIND_HAS_STORAGE(relkind), then we're going to set > create_storage to false if it was previously true, and otherwise just > do nothing. Otherwise, if !create_storage, we'll enter the > create_storage = false branch which effectively does nothing. > Otherwise, if !OidIsValid(relfilenode), we'll set relfilenode = relid. > So couldn't we express that like this? > > if (!RELKIND_HAS_STORAGE(relkind)) > create_storage = false; > else if (create_storage && !OidIsValid(relfilenode)) > relfilenode = relid; > > If so, that seems more clear. 'create_storage' flag says whether or not to create the storage when a valid relfilenode is passed. 'create_storage' flag alone cannot make the storage creation decision in heap_create(). Only binary upgrade flow sets the 'create_storage' flag to true and expects storage gets created with specified relfilenode. Every other caller/flow passes false for 'create_storage' and we still need to create storage in heap_create() if relkind has storage. That's why I have explicitly set 'create_storage = true' in the else flow and initialize relfilenode on need basis. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > > We are still working on our TDE patch. Right now the focus is on refactoring > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > encryption mode choices given concerns expressed is next. Currently a viable > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > also affect later blocks. I think this means we would need to write WAL > > full page images for hint bit changes to avoid torn pages. Right now > > hint bit (single bit) changes can be lost without causing torn pages. > > This was another of the advantages of using a stream cipher like CTR. > > The above text isn't very clear. What I am saying is that currently > torn pages can be tolerated by hint bit writes because only a single > byte is changing. If we use a block cipher like AES-XTS, later 16-byte > encrypted blocks would be changed by hint bit changes, meaning torn > pages could not be tolerated. This means we would have to use full page > writes for hint bit changes, perhaps making this feature have > unacceptable performance overhead. IIRC, in the XTS scheme, a change of a single byte in the 16-byte block causes the whole encrypted block to be different after the next encryption, however the following blocks are not affected. CBC (cipher-block chaining) is the mode where the change in one block does affect the encryption of the following block. I'm not sure if this fact is important from the hint bit perspective though. It would be an important difference if there was a guarantee that the 16-byte blocks are consitent even on torn page - does e.g. proper alignment of pages guarantee that? Nevertheless, the absence of the chaining may be a reason to prefer CBC to XTS anyway. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 3:24 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > Every other > caller/flow passes false for 'create_storage' and we still need to > create storage in heap_create() if relkind has storage. That seems surprising. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Wed, Oct 6, 2021 at 03:17:00PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > > > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote: > > > > > We are still working on our TDE patch. Right now the focus is on refactoring > > > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > > > encryption mode choices given concerns expressed is next. Currently a viable > > > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > ----------------------------------------------------- > > > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > > > also affect later blocks. I think this means we would need to write WAL > > > > full page images for hint bit changes to avoid torn pages. Right now > > > > hint bit (single bit) changes can be lost without causing torn pages. > > > > This was another of the advantages of using a stream cipher like CTR. > > > > > > Another problem caused by block mode ciphers is that to use the LSN as > > > part of the nonce, the LSN must not be encrypted, but you then have to > > > find a 16-byte block in the page that you don't need to encrypt. > > > > With AES-XTS, we don't need to use the LSN as part of the nonce though, > > so I don't think this argument is actually valid..? As discussed > > previously regarding AES-XTS, the general idea was to use the path to > > the file and the filename itself plus the block number as the IV, and > > that works fine for XTS because it's ok to reuse it (unlike with CTR). > > Yes, I would prefer we don't use the LSN. I only mentioned it since > Ants Aasma mentioned LSN use above. Ohhh, apologies for missing that, makes more sense now. Thanks! Stephen
Attachment
On Wed, Oct 6, 2021 at 3:17 PM Stephen Frost <sfrost@snowman.net> wrote: > With AES-XTS, we don't need to use the LSN as part of the nonce though, > so I don't think this argument is actually valid..? As discussed > previously regarding AES-XTS, the general idea was to use the path to > the file and the filename itself plus the block number as the IV, and > that works fine for XTS because it's ok to reuse it (unlike with CTR). However, there's also the option of storing a nonce in each page, as suggested by the subject of this thread. I think that's probably a pretty workable approach, as demonstrated by the patch that started this thread. We'd need to think a bit carefully about whether any of the compile-time calculations the patch moves to runtime are expensive enough to matter and whether any such impacts can be mitigated, but I think there is a good chance that such issues are manageable. I'm a little concerned by the email from "Sasasu" saying that even in XTS reusing the IV is not cryptographically weak. I don't know enough about these different encryption modes to know if he's right, but if he is then perhaps we need to consider his suggestion of using AES-GCM. Or, uh, something else. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 7, 2021 at 10:28:55AM -0400, Robert Haas wrote: > However, there's also the option of storing a nonce in each page, as > suggested by the subject of this thread. I think that's probably a > pretty workable approach, as demonstrated by the patch that started > this thread. We'd need to think a bit carefully about whether any of > the compile-time calculations the patch moves to runtime are expensive > enough to matter and whether any such impacts can be mitigated, but I > think there is a good chance that such issues are manageable. > > I'm a little concerned by the email from "Sasasu" saying that even in > XTS reusing the IV is not cryptographically weak. I don't know enough > about these different encryption modes to know if he's right, but if > he is then perhaps we need to consider his suggestion of using > AES-GCM. Or, uh, something else. I continue to be concerned that a page format change will decrease the desirability of this feature by making migration complex and increasing its code complexity. I am unclear if it is necessary. I think the big question is whether XTS with db/relfilenode/blocknumber is sufficient as an IV without a nonce that changes for updates. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Oct 6, 2021 at 3:17 PM Stephen Frost <sfrost@snowman.net> wrote: > > With AES-XTS, we don't need to use the LSN as part of the nonce though, > > so I don't think this argument is actually valid..? As discussed > > previously regarding AES-XTS, the general idea was to use the path to > > the file and the filename itself plus the block number as the IV, and > > that works fine for XTS because it's ok to reuse it (unlike with CTR). > > However, there's also the option of storing a nonce in each page, as > suggested by the subject of this thread. I think that's probably a > pretty workable approach, as demonstrated by the patch that started > this thread. We'd need to think a bit carefully about whether any of > the compile-time calculations the patch moves to runtime are expensive > enough to matter and whether any such impacts can be mitigated, but I > think there is a good chance that such issues are manageable. I agree with this in general, though I would think we'd use that for GCM or another authenticated encryption mode (perhaps GCM-SIV with the LSN as the IV) at some point off in the future. Alternatively, we could use that technique to just provide a better per-page checksum than what we have today. Maybe we could figure out how to leverage that to move to 64bit transaction IDs with some page-level epoch. Definitely a lot of possibilities. Ultimately though, regarding TDE at least, I would think we'd rather start with something that's block level and doesn't require a page format change. > I'm a little concerned by the email from "Sasasu" saying that even in > XTS reusing the IV is not cryptographically weak. I don't know enough > about these different encryption modes to know if he's right, but if > he is then perhaps we need to consider his suggestion of using > AES-GCM. Or, uh, something else. Think you meant 'strong' above (or maybe omit the 'not', either way the oppostie of the double-negative that seems to be what was written). As I understand it, XTS isn't great for dealing with someone who has ongoing access to watch writes over time, just in general, but that wasn't what it is generally used to address (and isn't what we would be looking for it to address either). Perhaps there's other modes which don't require that we change the page format to support them besides XTS (in particular, as our pages are multiples of 16 bytes, it's possible we don't really need XTS since there aren't any partial blocks and could simply use XEX instead..) Thanks, Stephen
Attachment
On Thu, Oct 7, 2021 at 10:27:15AM +0200, Antonin Houska wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > The above text isn't very clear. What I am saying is that currently > > torn pages can be tolerated by hint bit writes because only a single > > byte is changing. If we use a block cipher like AES-XTS, later 16-byte > > encrypted blocks would be changed by hint bit changes, meaning torn > > pages could not be tolerated. This means we would have to use full page > > writes for hint bit changes, perhaps making this feature have > > unacceptable performance overhead. > > IIRC, in the XTS scheme, a change of a single byte in the 16-byte block causes > the whole encrypted block to be different after the next encryption, however > the following blocks are not affected. CBC (cipher-block chaining) is the mode > where the change in one block does affect the encryption of the following > block. Oh, good point. I was not aware of that. It means XTS does not feed the previous block as part of the nonce to the next block. > I'm not sure if this fact is important from the hint bit perspective > though. It would be an important difference if there was a guarantee that the > 16-byte blocks are consitent even on torn page - does e.g. proper alignment of > pages guarantee that? Nevertheless, the absence of the chaining may be a > reason to prefer CBC to XTS anyway. Uh, technically most drives use 512-byte sectors, but I don't know if there is any guarantee that 512-byte sectors will not be torn --- I have a feeling there isn't. I think we get away with the hint bit case because you can't tear a single bit. ;-) However, my patch created a full page write for hint bit changes. If we don't use the LSN, those full page writes will only happen once per checkpoint, which seems acceptable, at least to Robert. Interesting on the CBC idea which would force the rest of the page to change --- not sure if that is valuable. I know stream ciphers can be diff'ed to see data because they are xor'ing the data --- I don't remember if block ciphers have similar weaknesses. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On October 7, 2021 8:54:54 AM PDT, Bruce Momjian <bruce@momjian.us> wrote: >Uh, technically most drives use 512-byte sectors, but I don't know if >there is any guarantee that 512-byte sectors will not be torn --- I have >a feeling there isn't. I think we get away with the hint bit case >because you can't tear a single bit. ;-) We rely on it today, e.g. for the control file. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Oct 7, 2021 at 11:45 AM Bruce Momjian <bruce@momjian.us> wrote: > I continue to be concerned that a page format change will decrease the > desirability of this feature by making migration complex and increasing > its code complexity. I am unclear if it is necessary. > > I think the big question is whether XTS with db/relfilenode/blocknumber > is sufficient as an IV without a nonce that changes for updates. Those are fair concerns. I think I agree with everything you say here. There was some discussion earlier (not sure if it was on this thread) about integrity verification. And I don't think that there's any way we can do that without storing some kind of integrity verifier in each page. And if we're doing that anyway to support that feature, then there's no problem if it also includes the IV. I had read Stephen's previous comments to indicate that he thought we should go this way, and it sounded cool to me, too. However, it does make migrations somewhat more complex, because you would then have to actually dump-and-reload, rather than, perhaps, just encrypting all the existing pages while the cluster was offline. Personally, I'm not that fussed about that problem, but I'm also rarely the one who has to help people migrate to new releases, so I may not be as sympathetic to those problems there as I should be. If we don't care about the integrity verification features, then as you say the next question is whether it's acceptable to use a predictable nonce that is computing from values that can be known without looking at the block contents. If so, we can forget about $SUBJECT and save ourselves some engineering work. If not, then I think we need to do $SUBJECT anyway. And so far I am not really convinced that we know which of those two things is the case. I don't, anyway. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote: > We rely on it today, e.g. for the control file. I think that's the only place, though. We can't rely on it for data files because base backups don't go through shared buffers, so reads and writes can get torn in memory and not just on sector boundaries. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 7, 2021 at 12:29:04PM -0400, Robert Haas wrote: > On Thu, Oct 7, 2021 at 11:45 AM Bruce Momjian <bruce@momjian.us> wrote: > > I continue to be concerned that a page format change will decrease the > > desirability of this feature by making migration complex and increasing > > its code complexity. I am unclear if it is necessary. > > > > I think the big question is whether XTS with db/relfilenode/blocknumber > > is sufficient as an IV without a nonce that changes for updates. > > Those are fair concerns. I think I agree with everything you say here. > > There was some discussion earlier (not sure if it was on this thread) > about integrity verification. And I don't think that there's any way > we can do that without storing some kind of integrity verifier in each > page. And if we're doing that anyway to support that feature, then > there's no problem if it also includes the IV. I had read Stephen's Agreed. > previous comments to indicate that he thought we should go this way, > and it sounded cool to me, too. However, it does make migrations Uh, what has not been publicly stated yet is that there was a meeting, prompted by Stephen, with him, Cybertec staff, and myself on September 16 at the Cybertec office in Vienna to discuss this. After vigorous discussion, it was agreed that a simpliied version of this feature would be implemented that would not have temper detection (beyond encrypted checksums) and would use XTS so that the LSN would not need to be used. > If we don't care about the integrity verification features, then as > you say the next question is whether it's acceptable to use a > predictable nonce that is computing from values that can be known > without looking at the block contents. If so, we can forget about > $SUBJECT and save ourselves some engineering work. If not, then I Yes, that is now the question. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 09:26:26AM -0700, Andres Freund wrote: > Hi, > > On October 7, 2021 8:54:54 AM PDT, Bruce Momjian <bruce@momjian.us> wrote: > > >Uh, technically most drives use 512-byte sectors, but I don't know if > >there is any guarantee that 512-byte sectors will not be torn --- I have > >a feeling there isn't. I think we get away with the hint bit case > >because you can't tear a single bit. ;-) > > We rely on it today, e.g. for the control file. OK, good to know, and we can be sure the 16-byte blocks will terminate on 512-byte boundaries. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 12:32:16PM -0400, Robert Haas wrote: > On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote: > > We rely on it today, e.g. for the control file. > > I think that's the only place, though. We can't rely on it for data > files because base backups don't go through shared buffers, so reads > and writes can get torn in memory and not just on sector boundaries. Uh, do backups get torn and later used? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 11:45 AM Bruce Momjian <bruce@momjian.us> wrote: > > I continue to be concerned that a page format change will decrease the > > desirability of this feature by making migration complex and increasing > > its code complexity. I am unclear if it is necessary. > > > > I think the big question is whether XTS with db/relfilenode/blocknumber > > is sufficient as an IV without a nonce that changes for updates. > > Those are fair concerns. I think I agree with everything you say here. > > There was some discussion earlier (not sure if it was on this thread) > about integrity verification. And I don't think that there's any way > we can do that without storing some kind of integrity verifier in each > page. And if we're doing that anyway to support that feature, then > there's no problem if it also includes the IV. I had read Stephen's > previous comments to indicate that he thought we should go this way, > and it sounded cool to me, too. However, it does make migrations > somewhat more complex, because you would then have to actually > dump-and-reload, rather than, perhaps, just encrypting all the > existing pages while the cluster was offline. Personally, I'm not that > fussed about that problem, but I'm also rarely the one who has to help > people migrate to new releases, so I may not be as sympathetic to > those problems there as I should be. Yes, for integrity verification (also known as 'authenticated encryption') we'd definitely need to store a larger nonce value. In the very, very long term, I think it'd be great to have that, and the patch proposed on this thread seems really cool as a way to get us there. > If we don't care about the integrity verification features, then as > you say the next question is whether it's acceptable to use a > predictable nonce that is computing from values that can be known > without looking at the block contents. If so, we can forget about > $SUBJECT and save ourselves some engineering work. If not, then I > think we need to do $SUBJECT anyway. And so far I am not really > convinced that we know which of those two things is the case. I don't, > anyway. Having TDE, even without authenticated encryption, is certainly valuable. Reducing the amount of engineering required to get there is worthwhile. Implementing TDE w/ XTS or similar, provided we do agree that we can do so with an IV that we don't need to find additional space for, would avoid that page-level format change. I agree we should do some research to make sure we at least have a reasonable answer to that question. I've spent a bit of time on that and haven't gotten to a sure answer one way or the other as yet, but will continue to look. Thanks, Stephen
Attachment
On Thu, Oct 7, 2021 at 12:56:22PM -0400, Bruce Momjian wrote: > On Thu, Oct 7, 2021 at 12:32:16PM -0400, Robert Haas wrote: > > On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote: > > > We rely on it today, e.g. for the control file. > > > > I think that's the only place, though. We can't rely on it for data > > files because base backups don't go through shared buffers, so reads > > and writes can get torn in memory and not just on sector boundaries. > > Uh, do backups get torn and later used? Are you saying a base backup could read a page from the file system and see a partial write, even though the write is written as 8k? I had not thought about that. I think this whole discussion is about whether we need full page images for hint bit changes. I think we do if we use the LSN for the nonce (in the old patch), and probably need it for hint bit changes when using block cipher modes (XTS) if we feel basebackup could read only part of a 16-byte page change. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 12:56 PM Bruce Momjian <bruce@momjian.us> wrote: > Uh, do backups get torn and later used? Yep. That's why base backup mode forces full_page_writes on temporarily even if it's off in general. Crazy, right? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 6 Oct 2021 at 23:08, Bruce Momjian <bruce@momjian.us> wrote:
Yes, I would prefer we don't use the LSN. I only mentioned it since
Ants Aasma mentioned LSN use above.
Is there a particular reason why you would prefer not to use LSN? I suggested it because in my view having a variable tweak is still better than not having it even if we deem the risks of XTS tweak reuse not important for our use case. The comment was made under the assumption that requiring wal_log_hints for encryption is acceptable.
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote: > > We rely on it today, e.g. for the control file. > > I think that's the only place, though. We can't rely on it for data > files because base backups don't go through shared buffers, so reads > and writes can get torn in memory and not just on sector boundaries. There was a recent discussion with Munro, as I recall, that actually points out how we probably shouldn't be relying on that even for the control file and proposed having multiple control files (something which I generally agree with as a good idea), particularly due to SSD technology, as I recall. Thanks, Stephen
Attachment
On Thu, Oct 7, 2021 at 09:38:45PM +0300, Ants Aasma wrote: > On Wed, 6 Oct 2021 at 23:08, Bruce Momjian <bruce@momjian.us> wrote: > > Yes, I would prefer we don't use the LSN. I only mentioned it since > Ants Aasma mentioned LSN use above. > > > Is there a particular reason why you would prefer not to use LSN? I suggested > it because in my view having a variable tweak is still better than not having > it even if we deem the risks of XTS tweak reuse not important for our use case. > The comment was made under the assumption that requiring wal_log_hints for > encryption is acceptable. Well, using the LSN means we have to store the LSN unencrypted, and that means we have to carve out a 16-byte block on the page that is not encrypted. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 1:09 PM Bruce Momjian <bruce@momjian.us> wrote: > Are you saying a base backup could read a page from the file system and > see a partial write, even though the write is written as 8k? I had not > thought about that. Yes; see my other response. > I think this whole discussion is about whether we need full page images > for hint bit changes. I think we do if we use the LSN for the nonce (in > the old patch), and probably need it for hint bit changes when using > block cipher modes (XTS) if we feel basebackup could read only part of a > 16-byte page change. I think all the encryption modes that we're still considering have the (very desirable) property that changing a single bit of the unencrypted page perturbs the entire output. But that just means that encrypted clusters will have to run in the same mode as clusters with checksums, or clusters with wal_log_hints=on, features which the community has already accepted as having reasonable overhead. I have in the past expressed skepticism about whether that overhead is really small enough to be considered acceptable, but if I recall correctly, the test results posted to the list suggest that you need a working set just a little bit large than shared_buffers to make it really sting. And that's not a super-common thing to do. Anyway, if people aren't screaming about the overhead of that system now, they're not likely to complain about applying it to some new situation either. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 12:56 PM Bruce Momjian <bruce@momjian.us> wrote: > > Uh, do backups get torn and later used? > > Yep. That's why base backup mode forces full_page_writes on > temporarily even if it's off in general. Right, so this shouldn't be an issue as any such torn pages will have an FPI in the WAL that will be replayed as part of restoring that backup. Thanks, Stephen
Attachment
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Oct 7, 2021 at 09:38:45PM +0300, Ants Aasma wrote: > > On Wed, 6 Oct 2021 at 23:08, Bruce Momjian <bruce@momjian.us> wrote: > > > > Yes, I would prefer we don't use the LSN. I only mentioned it since > > Ants Aasma mentioned LSN use above. > > > > > > Is there a particular reason why you would prefer not to use LSN? I suggested > > it because in my view having a variable tweak is still better than not having > > it even if we deem the risks of XTS tweak reuse not important for our use case. > > The comment was made under the assumption that requiring wal_log_hints for > > encryption is acceptable. > > Well, using the LSN means we have to store the LSN unencrypted, and that > means we have to carve out a 16-byte block on the page that is not > encrypted. With XTS this isn't actually the case though, is it..? Part of the point of XTS is that the last block doesn't have to be a full 16 bytes. What you're saying is true for XEX, but that's also why XEX isn't used for FDE in a lot of cases, because disk sectors aren't typically divisible by 16. https://en.wikipedia.org/wiki/Disk_encryption_theory Assuming that's correct, and I don't see any reason to doubt it, then perhaps it would make sense to have the LSN be unencrypted and include it in the tweak as that would limit the risk from re-use of the same tweak over time. Thanks, Stephen
Attachment
On Thu, Oct 7, 2021 at 02:44:43PM -0400, Robert Haas wrote: > > I think this whole discussion is about whether we need full page images > > for hint bit changes. I think we do if we use the LSN for the nonce (in > > the old patch), and probably need it for hint bit changes when using > > block cipher modes (XTS) if we feel basebackup could read only part of a > > 16-byte page change. > > I think all the encryption modes that we're still considering have the > (very desirable) property that changing a single bit of the > unencrypted page perturbs the entire output. But that just means that Well, XTS perturbs the 16-byte block, while CBC changes the rest of the page. > encrypted clusters will have to run in the same mode as clusters with > checksums, or clusters with wal_log_hints=on, features which the > community has already accepted as having reasonable overhead. I have > in the past expressed skepticism about whether that overhead is really > small enough to be considered acceptable, but if I recall correctly, > the test results posted to the list suggest that you need a working > set just a little bit large than shared_buffers to make it really > sting. And that's not a super-common thing to do. Anyway, if people > aren't screaming about the overhead of that system now, they're not > likely to complain about applying it to some new situation either. Yes, agreed, good conclusions. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 1:09 PM Bruce Momjian <bruce@momjian.us> wrote: > > Are you saying a base backup could read a page from the file system and > > see a partial write, even though the write is written as 8k? I had not > > thought about that. > > Yes; see my other response. Yes, that is something that has been seen before. > > I think this whole discussion is about whether we need full page images > > for hint bit changes. I think we do if we use the LSN for the nonce (in > > the old patch), and probably need it for hint bit changes when using > > block cipher modes (XTS) if we feel basebackup could read only part of a > > 16-byte page change. > > I think all the encryption modes that we're still considering have the > (very desirable) property that changing a single bit of the > unencrypted page perturbs the entire output. But that just means that > encrypted clusters will have to run in the same mode as clusters with > checksums, or clusters with wal_log_hints=on, features which the > community has already accepted as having reasonable overhead. I have > in the past expressed skepticism about whether that overhead is really > small enough to be considered acceptable, but if I recall correctly, > the test results posted to the list suggest that you need a working > set just a little bit large than shared_buffers to make it really > sting. And that's not a super-common thing to do. Anyway, if people > aren't screaming about the overhead of that system now, they're not > likely to complain about applying it to some new situation either. Agreed. Thanks, Stephen
Attachment
On Thu, 7 Oct 2021 at 21:52, Stephen Frost <sfrost@snowman.net> wrote:
With XTS this isn't actually the case though, is it..? Part of the
point of XTS is that the last block doesn't have to be a full 16 bytes.
What you're saying is true for XEX, but that's also why XEX isn't used
for FDE in a lot of cases, because disk sectors aren't typically
divisible by 16.
https://en.wikipedia.org/wiki/Disk_encryption_theory
Assuming that's correct, and I don't see any reason to doubt it, then
perhaps it would make sense to have the LSN be unencrypted and include
it in the tweak as that would limit the risk from re-use of the same
tweak over time.
Right, my thought was to leave the first 8 bytes of pages, the LSN, unencrypted and include the value in the tweak. Just tested that OpenSSL aes-256-xts handles non multiple-of-16 messages just fine.
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
On Thu, Oct 7, 2021 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote: > Yes, for integrity verification (also known as 'authenticated > encryption') we'd definitely need to store a larger nonce value. In the > very, very long term, I think it'd be great to have that, and the patch > proposed on this thread seems really cool as a way to get us there. OK. I'm not sure why that has to be relegated to the very, very long term, but I'm really very happy to hear that you think the approach is cool. > Having TDE, even without authenticated encryption, is certainly > valuable. Reducing the amount of engineering required to get there is > worthwhile. Implementing TDE w/ XTS or similar, provided we do agree > that we can do so with an IV that we don't need to find additional space > for, would avoid that page-level format change. I agree we should do > some research to make sure we at least have a reasonable answer to that > question. I've spent a bit of time on that and haven't gotten to a sure > answer one way or the other as yet, but will continue to look. I mean, I feel like this meeting that Bruce was talking about was perhaps making decisions in the wrong order. We have to decide which encryption mode is secure enough for our needs FIRST, and then AFTER that we can decide whether we need to store a nonce in the page. Now if it turns out that we can do either with or without a nonce in the page, then I'm just as happy as anyone else to start with the method that works without a nonce in the page, because like you say, that's less work. But unless we've established that such a method is actually going to survive scrutiny by smart cryptographers, we can't really decide that storing the nonce is off the table. And it doesn't seem like we've established that yet. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 7, 2021 at 02:52:07PM -0400, Stephen Frost wrote: > > > Is there a particular reason why you would prefer not to use LSN? I suggested > > > it because in my view having a variable tweak is still better than not having > > > it even if we deem the risks of XTS tweak reuse not important for our use case. > > > The comment was made under the assumption that requiring wal_log_hints for > > > encryption is acceptable. > > > > Well, using the LSN means we have to store the LSN unencrypted, and that > > means we have to carve out a 16-byte block on the page that is not > > encrypted. > > With XTS this isn't actually the case though, is it..? Part of the > point of XTS is that the last block doesn't have to be a full 16 bytes. > What you're saying is true for XEX, but that's also why XEX isn't used > for FDE in a lot of cases, because disk sectors aren't typically > divisible by 16. Oh, I was not aware of that XTS feature. Nice. > https://en.wikipedia.org/wiki/Disk_encryption_theory > > Assuming that's correct, and I don't see any reason to doubt it, then > perhaps it would make sense to have the LSN be unencrypted and include > it in the tweak as that would limit the risk from re-use of the same > tweak over time. Yes, seems like a plan. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 09:59:31PM +0300, Ants Aasma wrote: > On Thu, 7 Oct 2021 at 21:52, Stephen Frost <sfrost@snowman.net> wrote: > > With XTS this isn't actually the case though, is it..? Part of the > point of XTS is that the last block doesn't have to be a full 16 bytes. > What you're saying is true for XEX, but that's also why XEX isn't used > for FDE in a lot of cases, because disk sectors aren't typically > divisible by 16. > > https://en.wikipedia.org/wiki/Disk_encryption_theory > > Assuming that's correct, and I don't see any reason to doubt it, then > perhaps it would make sense to have the LSN be unencrypted and include > it in the tweak as that would limit the risk from re-use of the same > tweak over time. > > > Right, my thought was to leave the first 8 bytes of pages, the LSN, unencrypted > and include the value in the tweak. Just tested that OpenSSL aes-256-xts > handles non multiple-of-16 messages just fine. Great. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost <sfrost@snowman.net> wrote: > Assuming that's correct, and I don't see any reason to doubt it, then > perhaps it would make sense to have the LSN be unencrypted and include > it in the tweak as that would limit the risk from re-use of the same > tweak over time. Talking about things like "limiting the risk" makes me super-nervous. Maybe we're all on the same page here, but just to make my assumptions explicit: I think we have to approach this feature with the idea in mind that there are going to be very smart people actively attacking any TDE implementation we ship. I expect that if you are lucky enough to get your hands on a PostgreSQL cluster's data files and they happen to be encrypted, your best option for handling that situation is not going to be attacking the encryption, but rather something like calling the person who has the password and pretending to be someone to whom they ought to disclose it. However, I also believe that PostgreSQL is a sufficiently high-profile project that security researchers will find it a tempting target. And if they manage to write a shell script or tool that breaks our encryption without too much difficulty, it will generate a ton of negative PR for the project. This will be especially true if the problem can't be fixed without re-engineering the whole thing, because we're not realistically going to be able to re-engineer the whole thing in a minor release, and thus will be saddled with the defective implementation for many years. Now none of that is to say that we shouldn't limit risk - I mean less risk is always better than more. But we need to be sure this is not like a 90% thing, where we're pretty sure it works. We can get by with that for a lot of things, but I think here we had better try extra-hard to make sure that we don't have any exposures. We probably will anyway, but at least if they're just bugs and not architectural deficiencies, we can hope to be able to patch them as they are discovered. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 7, 2021 at 12:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost <sfrost@snowman.net> wrote:
> Assuming that's correct, and I don't see any reason to doubt it, then
> perhaps it would make sense to have the LSN be unencrypted and include
> it in the tweak as that would limit the risk from re-use of the same
> tweak over time.
Talking about things like "limiting the risk" makes me super-nervous.
Maybe we're all on the same page here, but just to make my assumptions
explicit: I think we have to approach this feature with the idea in
mind that there are going to be very smart people actively attacking
any TDE implementation we ship. I expect that if you are lucky enough
to get your hands on a PostgreSQL cluster's data files and they happen
to be encrypted, your best option for handling that situation is not
going to be attacking the encryption, but rather something like
calling the person who has the password and pretending to be someone
to whom they ought to disclose it. However, I also believe that
PostgreSQL is a sufficiently high-profile project that security
researchers will find it a tempting target. And if they manage to
write a shell script or tool that breaks our encryption without too
much difficulty, it will generate a ton of negative PR for the
project. This will be especially true if the problem can't be fixed
without re-engineering the whole thing, because we're not
realistically going to be able to re-engineer the whole thing in a
minor release, and thus will be saddled with the defective
implementation for many years.
Now none of that is to say that we shouldn't limit risk - I mean less
risk is always better than more. But we need to be sure this is not
like a 90% thing, where we're pretty sure it works. We can get by with
that for a lot of things, but I think here we had better try
extra-hard to make sure that we don't have any exposures. We probably
will anyway, but at least if they're just bugs and not architectural
deficiencies, we can hope to be able to patch them as they are
discovered.
Not at all knowledgeable on security topics (bravely using terms and recommendation), can we approach decisions like AES-XTS vs AES-GCM (which in turn decides whether we need to store nonce or not) based on which compliance it can achieve or not. Like can using AES-XTS make it FIPS 140-2 compliant or not?
On Thu, Oct 7, 2021 at 3:31 PM Ashwin Agrawal <ashwinstar@gmail.com> wrote: > Not at all knowledgeable on security topics (bravely using terms and recommendation), can we approach decisions like AES-XTSvs AES-GCM (which in turn decides whether we need to store nonce or not) based on which compliance it can achieveor not. Like can using AES-XTS make it FIPS 140-2 compliant or not? To the best of my knowledge, the encryption mode doesn't have much to do with whether such compliance can be achieved. The encryption algorithm could matter, but I assume everyone still thinks AES is acceptable. (We should assume that will eventually change.) The encryption mode is, at least as I understand, more of an internal thing that you have to get right to avoid having people break your encryption and write papers about how they did it. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost <sfrost@snowman.net> wrote: > > Assuming that's correct, and I don't see any reason to doubt it, then > > perhaps it would make sense to have the LSN be unencrypted and include > > it in the tweak as that would limit the risk from re-use of the same > > tweak over time. > > Talking about things like "limiting the risk" makes me super-nervous. All of this is about limiting risks. :) > Maybe we're all on the same page here, but just to make my assumptions > explicit: I think we have to approach this feature with the idea in > mind that there are going to be very smart people actively attacking > any TDE implementation we ship. I expect that if you are lucky enough > to get your hands on a PostgreSQL cluster's data files and they happen > to be encrypted, your best option for handling that situation is not > going to be attacking the encryption, but rather something like > calling the person who has the password and pretending to be someone > to whom they ought to disclose it. However, I also believe that > PostgreSQL is a sufficiently high-profile project that security > researchers will find it a tempting target. And if they manage to > write a shell script or tool that breaks our encryption without too > much difficulty, it will generate a ton of negative PR for the > project. This will be especially true if the problem can't be fixed > without re-engineering the whole thing, because we're not > realistically going to be able to re-engineer the whole thing in a > minor release, and thus will be saddled with the defective > implementation for many years. While I certainly also appreciate that we want to get this as right as we possibly can from the start, I strongly suspect we'll have one of two reactions- either we'll be more-or-less ignored and it'll be crickets from the security folks, or we're going to get beat up by them for $reasons, almost regardless of what we actually do. Best bet to limit the risk ( ;) ) of the latter happening would be to try our best to do what existing solutions already do- such as by using XTS. There's things we can do to limit the risk of known-plaintext attacks, like simply not encrypting empty pages, or about possible known-IV risks, like using the LSN as part of the IV/tweak. Will we get everything? Probably not, but I don't think that we're going to really go wrong by using XTS as it's quite popularly used today and it's explicitly used for cases where you haven't got a place to store the extra nonce that you would need for AEAD encryption schemes. > Now none of that is to say that we shouldn't limit risk - I mean less > risk is always better than more. But we need to be sure this is not > like a 90% thing, where we're pretty sure it works. We can get by with > that for a lot of things, but I think here we had better try > extra-hard to make sure that we don't have any exposures. We probably > will anyway, but at least if they're just bugs and not architectural > deficiencies, we can hope to be able to patch them as they are > discovered. As long as we're clear that this initial version of TDE is with XTS then I really don't think we'll end up with anyone showing up and saying we screwed up by not generating a per-page nonce to store with it- the point of XTS is that you don't need that. Thanks, Stephen
Attachment
On Thu, Oct 7, 2021 at 03:38:58PM -0400, Stephen Frost wrote: > > Now none of that is to say that we shouldn't limit risk - I mean less > > risk is always better than more. But we need to be sure this is not > > like a 90% thing, where we're pretty sure it works. We can get by with > > that for a lot of things, but I think here we had better try > > extra-hard to make sure that we don't have any exposures. We probably > > will anyway, but at least if they're just bugs and not architectural > > deficiencies, we can hope to be able to patch them as they are > > discovered. > > As long as we're clear that this initial version of TDE is with XTS then > I really don't think we'll end up with anyone showing up and saying we > screwed up by not generating a per-page nonce to store with it- the point > of XTS is that you don't need that. I am sold. ;-) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: > While I certainly also appreciate that we want to get this as right as > we possibly can from the start, I strongly suspect we'll have one of two > reactions- either we'll be more-or-less ignored and it'll be crickets > from the security folks, or we're going to get beat up by them for > $reasons, almost regardless of what we actually do. Best bet to > limit the risk ( ;) ) of the latter happening would be to try our best > to do what existing solutions already do- such as by using XTS. > There's things we can do to limit the risk of known-plaintext attacks, > like simply not encrypting empty pages, or about possible known-IV > risks, like using the LSN as part of the IV/tweak. Will we get > everything? Probably not, but I don't think that we're going to really > go wrong by using XTS as it's quite popularly used today and it's > explicitly used for cases where you haven't got a place to store the > extra nonce that you would need for AEAD encryption schemes. I agree that using a popular approach is a good way to go. If we do what other people do, then hopefully our stuff won't be significantly more broken than their stuff, and whatever is can be fixed. > As long as we're clear that this initial version of TDE is with XTS then > I really don't think we'll end up with anyone showing up and saying we > screwed up by not generating a per-page nonce to store with it- the point > of XTS is that you don't need that. I agree that we shouldn't really catch flack for any weaknesses of the underlying algorithm: if XTS turns out to be secure even when used properly, and we use it properly, the resulting weakness is somebody else's fault. On the other hand, if we use it improperly, that's our fault, so we need to be really sure that we understand what guarantees we need to provide from our end, and that we are providing them. Like if we pick an encryption mode that requires nonces to be unique, we will be at fault if they aren't; if it requires nonces to be unpredictable, we will be at fault if they aren't; and so on. So that's what is making me nervous here ... it doesn't seem likely we have complete unanimity about whether XTS is the right thing, though that does seem to be the majority position certainly, and it is not really clear to me that any of us can speak with authority about what the requirements are around the nonces in particular. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 3:31 PM Ashwin Agrawal <ashwinstar@gmail.com> wrote: > > Not at all knowledgeable on security topics (bravely using terms and recommendation), can we approach decisions likeAES-XTS vs AES-GCM (which in turn decides whether we need to store nonce or not) based on which compliance it can achieveor not. Like can using AES-XTS make it FIPS 140-2 compliant or not? > > To the best of my knowledge, the encryption mode doesn't have much to > do with whether such compliance can be achieved. The encryption > algorithm could matter, but I assume everyone still thinks AES is > acceptable. (We should assume that will eventually change.) The > encryption mode is, at least as I understand, more of an internal > thing that you have to get right to avoid having people break your > encryption and write papers about how they did it. The issue regarding FIPS 140-2 specifically is actually about the encryption used (AES-XTS is approved) *and* about the actual library which is doing the encryption, which isn't really anything to do with us but rather is OpenSSL (or perhaps NSS if we can get that finished and included), or maybe some third party that implements one of those APIs that you decide to use (of which there's a few, some of which have FIPS 140-2 certification). So, can you have a FIPS 140-2 compliant system with AES-XTS? Yes, as it's approved: https://csrc.nist.gov/csrc/media/projects/cryptographic-module-validation-program/documents/fips140-2/fips1402ig.pdf Will your system be FIPS 140-2 certified? That's a big "it depends" and will involve you actually taking your fully built system through a testing lab to get it certified. I certainly don't think we can make any promises that taking it through such a test would be successful the first time around, or even ever. First step though would be to get something implemented so that $someone can try and can provide feedback. Thanks, Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: > > While I certainly also appreciate that we want to get this as right as > > we possibly can from the start, I strongly suspect we'll have one of two > > reactions- either we'll be more-or-less ignored and it'll be crickets > > from the security folks, or we're going to get beat up by them for > > $reasons, almost regardless of what we actually do. Best bet to > > limit the risk ( ;) ) of the latter happening would be to try our best > > to do what existing solutions already do- such as by using XTS. > > There's things we can do to limit the risk of known-plaintext attacks, > > like simply not encrypting empty pages, or about possible known-IV > > risks, like using the LSN as part of the IV/tweak. Will we get > > everything? Probably not, but I don't think that we're going to really > > go wrong by using XTS as it's quite popularly used today and it's > > explicitly used for cases where you haven't got a place to store the > > extra nonce that you would need for AEAD encryption schemes. > > I agree that using a popular approach is a good way to go. If we do > what other people do, then hopefully our stuff won't be significantly > more broken than their stuff, and whatever is can be fixed. Right. > > As long as we're clear that this initial version of TDE is with XTS then > > I really don't think we'll end up with anyone showing up and saying we > > screwed up by not generating a per-page nonce to store with it- the point > > of XTS is that you don't need that. > > I agree that we shouldn't really catch flack for any weaknesses of the > underlying algorithm: if XTS turns out to be secure even when used > properly, and we use it properly, the resulting weakness is somebody > else's fault. On the other hand, if we use it improperly, that's our > fault, so we need to be really sure that we understand what guarantees > we need to provide from our end, and that we are providing them. Like > if we pick an encryption mode that requires nonces to be unique, we > will be at fault if they aren't; if it requires nonces to be > unpredictable, we will be at fault if they aren't; and so on. Sure, I get that. Would be awesome if all these things were clearly documented somewhere but I've never been able to find it quite as explicitly laid out as one would like. > So that's what is making me nervous here ... it doesn't seem likely we > have complete unanimity about whether XTS is the right thing, though > that does seem to be the majority position certainly, and it is not > really clear to me that any of us can speak with authority about what > the requirements are around the nonces in particular. The authority to look at, in my view anyway, are NIST publications. Following a bit more digging, I came across something which makes sense to me as intuitive but explains it in a way that might help everyone understand a bit better what's going on here: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf specifically: Appendix C: Tweaks Quoting a couple of paragraphs from that appendix: """ In general, if there is information that is available and statically associated with a plaintext, it is recommended to use that information as a tweak for the plaintext. Ideally, the non-secret tweak associated with a plaintext is associated only with that plaintext. Extensive tweaking means that fewer plaintexts are encrypted under any given tweak. This corresponds, in the security model that is described in [1], to fewer queries to the target instance of the encryption. """ The gist of this being- the more diverse the tweaking being used, the better. That's where I was going with my "limit the risk" comment. If we can make the tweak vary more for a given encryption invokation, that's going to be better, pretty much by definition, and as explained in publications by NIST. That isn't to say that using the same tweak for the same block over and over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads directly to plaintext being recoverable), but it does mean that an observer who can see the block writes over time could see what parts are changing (and which aren't) and may be able to derive insight from that. Now, as I mentioned before, that particular case isn't something that XTS is particularly good at and that's generally accepted, yet lots of folks use XTS anyway because the concern isn't "someone has root access on the box and is watching all block writes" but rather "laptop was stolen" where the attacker doesn't get to see multiple writes where the same key+tweak has been used, and the latter is really the attack vector we're looking to address with XTS too. Thanks, Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Oct 7, 2021 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote: > > Yes, for integrity verification (also known as 'authenticated > > encryption') we'd definitely need to store a larger nonce value. In the > > very, very long term, I think it'd be great to have that, and the patch > > proposed on this thread seems really cool as a way to get us there. > > OK. I'm not sure why that has to be relegated to the very, very long > term, but I'm really very happy to hear that you think the approach is > cool. Folks are shy about a page format change and I get that. > > Having TDE, even without authenticated encryption, is certainly > > valuable. Reducing the amount of engineering required to get there is > > worthwhile. Implementing TDE w/ XTS or similar, provided we do agree > > that we can do so with an IV that we don't need to find additional space > > for, would avoid that page-level format change. I agree we should do > > some research to make sure we at least have a reasonable answer to that > > question. I've spent a bit of time on that and haven't gotten to a sure > > answer one way or the other as yet, but will continue to look. > > I mean, I feel like this meeting that Bruce was talking about was > perhaps making decisions in the wrong order. We have to decide which > encryption mode is secure enough for our needs FIRST, and then AFTER > that we can decide whether we need to store a nonce in the page. Now > if it turns out that we can do either with or without a nonce in the > page, then I'm just as happy as anyone else to start with the method > that works without a nonce in the page, because like you say, that's > less work. But unless we've established that such a method is actually > going to survive scrutiny by smart cryptographers, we can't really > decide that storing the nonce is off the table. And it doesn't seem > like we've established that yet. Part of the meeting was specifically about "why are we doing this?" and there were a few different answers- first and foremost was "because people are asking for it", from which followed that, yes, in many cases it's to satisfy an audit or similar requirement which any of the proposed methods would address. There was further discussion that we could address *more* cases by providing something better, but the page format changes were weighed against that and the general consensus was that we should attack the simpler problem first and, potentially, gain a solution for 90% of the folks asking for it, and then later see if there's enough interest and desire to attack the remaining 10%. As such, it's just not so simple as "what is 'secure enough'" because it depends on who you're talking to. Based on the collective discussion at the meeting, XTS is 'secure enough' for the needs of probably 90% of those asking, while the other 10% want better (an AEAD method such as GCM or GCM-SIV). Therefore, what should we do? Spend all of the extra resources and engineering effort to address the 10% and maybe not get anything because of the level of difficulty, or go the simpler route first and get the 90%? Through that lense, the choice seemed reasonably clear, at least to me, hence why I agreed that we should work on an XTS based approach first. (Admittedly, the overall discussion wasn't quite as specific as XTS vs. GCM-SIV, but the gist was "page format change" vs. "no page format change" and that seems to equate, based on this subsequent discussion to the choice between XTS and GCM/GCM-SIV.) Thanks! Stephen
Attachment
Stephen Frost <sfrost@snowman.net> wrote: > Greetings, > > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: > > > While I certainly also appreciate that we want to get this as right as > > > we possibly can from the start, I strongly suspect we'll have one of two > > > reactions- either we'll be more-or-less ignored and it'll be crickets > > > from the security folks, or we're going to get beat up by them for > > > $reasons, almost regardless of what we actually do. Best bet to > > > limit the risk ( ;) ) of the latter happening would be to try our best > > > to do what existing solutions already do- such as by using XTS. > > > There's things we can do to limit the risk of known-plaintext attacks, > > > like simply not encrypting empty pages, or about possible known-IV > > > risks, like using the LSN as part of the IV/tweak. Will we get > > > everything? Probably not, but I don't think that we're going to really > > > go wrong by using XTS as it's quite popularly used today and it's > > > explicitly used for cases where you haven't got a place to store the > > > extra nonce that you would need for AEAD encryption schemes. > > > > I agree that using a popular approach is a good way to go. If we do > > what other people do, then hopefully our stuff won't be significantly > > more broken than their stuff, and whatever is can be fixed. > > Right. > > > > As long as we're clear that this initial version of TDE is with XTS then > > > I really don't think we'll end up with anyone showing up and saying we > > > screwed up by not generating a per-page nonce to store with it- the point > > > of XTS is that you don't need that. > > > > I agree that we shouldn't really catch flack for any weaknesses of the > > underlying algorithm: if XTS turns out to be secure even when used > > properly, and we use it properly, the resulting weakness is somebody > > else's fault. On the other hand, if we use it improperly, that's our > > fault, so we need to be really sure that we understand what guarantees > > we need to provide from our end, and that we are providing them. Like > > if we pick an encryption mode that requires nonces to be unique, we > > will be at fault if they aren't; if it requires nonces to be > > unpredictable, we will be at fault if they aren't; and so on. > > Sure, I get that. Would be awesome if all these things were clearly > documented somewhere but I've never been able to find it quite as > explicitly laid out as one would like. > > > So that's what is making me nervous here ... it doesn't seem likely we > > have complete unanimity about whether XTS is the right thing, though > > that does seem to be the majority position certainly, and it is not > > really clear to me that any of us can speak with authority about what > > the requirements are around the nonces in particular. > > The authority to look at, in my view anyway, are NIST publications. > Following a bit more digging, I came across something which makes sense > to me as intuitive but explains it in a way that might help everyone > understand a bit better what's going on here: > > https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf > > specifically: Appendix C: Tweaks > > Quoting a couple of paragraphs from that appendix: > > """ > In general, if there is information that is available and statically > associated with a plaintext, it is recommended to use that information > as a tweak for the plaintext. Ideally, the non-secret tweak associated > with a plaintext is associated only with that plaintext. > > Extensive tweaking means that fewer plaintexts are encrypted under any > given tweak. This corresponds, in the security model that is described > in [1], to fewer queries to the target instance of the encryption. > """ > > The gist of this being- the more diverse the tweaking being used, the > better. That's where I was going with my "limit the risk" comment. If > we can make the tweak vary more for a given encryption invokation, > that's going to be better, pretty much by definition, and as explained > in publications by NIST. > > That isn't to say that using the same tweak for the same block over and > over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads > directly to plaintext being recoverable), but it does mean that an > observer who can see the block writes over time could see what parts are > changing (and which aren't) and may be able to derive insight from that. This reminds me of Joe Conway's response to me email earlier: https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com In the document he recommended https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf specifically, in the Appendix C I read: """ For the CBC and CFB modes, the IVs must be unpredictable. In particular, for any given plaintext, it must not be possible to predict the IV that will be associated to the plaintext in advance of the generation of the IV. There are two recommended methods for generating unpredictable IVs. The first method is to apply the forward cipher function, under the same key that is used for the encryption of the plaintext, to a nonce. The nonce must be a data block that is unique to each execution of the encryption operation. For example, the nonce may be a counter, as described in Appendix B, or a message number. The second method is to generate a random data block using a FIPS- approved random number generator. """ This is about modes that include CBC, while the documend you refer to seems to deal with some other modes. So if we want to be confident that we use the XTS mode correctly, more research is probably needed. > Now, as I mentioned before, that particular case isn't something that > XTS is particularly good at and that's generally accepted, yet lots of > folks use XTS anyway because the concern isn't "someone has root access > on the box and is watching all block writes" but rather "laptop was > stolen" where the attacker doesn't get to see multiple writes where the > same key+tweak has been used, and the latter is really the attack vector > we're looking to address with XTS too. I've heard a few times that database running in a cloud is also a valid use case for the TDE. In that case I think it should be expected that "someone has root access on the box and is watching all block writes". -- Antonin Houska Web: https://www.cybertec-postgresql.com
Greetings, * Antonin Houska (ah@cybertec.at) wrote: > Stephen Frost <sfrost@snowman.net> wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: > > > > While I certainly also appreciate that we want to get this as right as > > > > we possibly can from the start, I strongly suspect we'll have one of two > > > > reactions- either we'll be more-or-less ignored and it'll be crickets > > > > from the security folks, or we're going to get beat up by them for > > > > $reasons, almost regardless of what we actually do. Best bet to > > > > limit the risk ( ;) ) of the latter happening would be to try our best > > > > to do what existing solutions already do- such as by using XTS. > > > > There's things we can do to limit the risk of known-plaintext attacks, > > > > like simply not encrypting empty pages, or about possible known-IV > > > > risks, like using the LSN as part of the IV/tweak. Will we get > > > > everything? Probably not, but I don't think that we're going to really > > > > go wrong by using XTS as it's quite popularly used today and it's > > > > explicitly used for cases where you haven't got a place to store the > > > > extra nonce that you would need for AEAD encryption schemes. > > > > > > I agree that using a popular approach is a good way to go. If we do > > > what other people do, then hopefully our stuff won't be significantly > > > more broken than their stuff, and whatever is can be fixed. > > > > Right. > > > > > > As long as we're clear that this initial version of TDE is with XTS then > > > > I really don't think we'll end up with anyone showing up and saying we > > > > screwed up by not generating a per-page nonce to store with it- the point > > > > of XTS is that you don't need that. > > > > > > I agree that we shouldn't really catch flack for any weaknesses of the > > > underlying algorithm: if XTS turns out to be secure even when used > > > properly, and we use it properly, the resulting weakness is somebody > > > else's fault. On the other hand, if we use it improperly, that's our > > > fault, so we need to be really sure that we understand what guarantees > > > we need to provide from our end, and that we are providing them. Like > > > if we pick an encryption mode that requires nonces to be unique, we > > > will be at fault if they aren't; if it requires nonces to be > > > unpredictable, we will be at fault if they aren't; and so on. > > > > Sure, I get that. Would be awesome if all these things were clearly > > documented somewhere but I've never been able to find it quite as > > explicitly laid out as one would like. > > > > > So that's what is making me nervous here ... it doesn't seem likely we > > > have complete unanimity about whether XTS is the right thing, though > > > that does seem to be the majority position certainly, and it is not > > > really clear to me that any of us can speak with authority about what > > > the requirements are around the nonces in particular. > > > > The authority to look at, in my view anyway, are NIST publications. > > Following a bit more digging, I came across something which makes sense > > to me as intuitive but explains it in a way that might help everyone > > understand a bit better what's going on here: > > > > https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf > > > > specifically: Appendix C: Tweaks > > > > Quoting a couple of paragraphs from that appendix: > > > > """ > > In general, if there is information that is available and statically > > associated with a plaintext, it is recommended to use that information > > as a tweak for the plaintext. Ideally, the non-secret tweak associated > > with a plaintext is associated only with that plaintext. > > > > Extensive tweaking means that fewer plaintexts are encrypted under any > > given tweak. This corresponds, in the security model that is described > > in [1], to fewer queries to the target instance of the encryption. > > """ > > > > The gist of this being- the more diverse the tweaking being used, the > > better. That's where I was going with my "limit the risk" comment. If > > we can make the tweak vary more for a given encryption invokation, > > that's going to be better, pretty much by definition, and as explained > > in publications by NIST. > > > > That isn't to say that using the same tweak for the same block over and > > over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads > > directly to plaintext being recoverable), but it does mean that an > > observer who can see the block writes over time could see what parts are > > changing (and which aren't) and may be able to derive insight from that. > > This reminds me of Joe Conway's response to me email earlier: > > https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com > > In the document he recommended > > https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf > > specifically, in the Appendix C I read: > > """ > For the CBC and CFB modes, the IVs must be unpredictable. In particular, for > any given plaintext, it must not be possible to predict the IV that will be > associated to the plaintext in advance of the generation of the IV. > > There are two recommended methods for generating unpredictable IVs. The first > method is to apply the forward cipher function, under the same key that is > used for the encryption of the plaintext, to a nonce. The nonce must be a > data block that is unique to each execution of the encryption operation. For > example, the nonce may be a counter, as described in Appendix B, or a message > number. The second method is to generate a random data block using a FIPS- > approved random number generator. > """ > > This is about modes that include CBC, while the documend you refer to seems to > deal with some other modes. So if we want to be confident that we use the XTS > mode correctly, more research is probably needed. What I think is missing from this discussion is the fact that, with XTS (and XEX, on which XTS is built), the IV *is* run through a forward cipher function, just as suggested above needs to be done for CBC. I don't see any reason to doubt that OpenSSL is correctly doing that. This article shows this pretty clearly: https://en.wikipedia.org/wiki/Disk_encryption_theory I don't think that changes the fact that, if we're able to, we should be varying the tweak/IV as often as we can, and including the LSN seems like a good way to do just that. Now, all that said, I'm all for looking at what others do to inform us as to the right way to go about things and the above article lists a number of users of XTS which we could go look at: XTS is supported by BestCrypt, Botan, NetBSD's cgd,[13] dm-crypt, FreeOTFE, TrueCrypt, VeraCrypt,[14] DiskCryptor, FreeBSD's geli, OpenBSD softraid disk encryption software, OpenSSL, Mac OS X Lion's FileVault 2, Windows 10's BitLocker[15] and wolfCrypt. > > Now, as I mentioned before, that particular case isn't something that > > XTS is particularly good at and that's generally accepted, yet lots of > > folks use XTS anyway because the concern isn't "someone has root access > > on the box and is watching all block writes" but rather "laptop was > > stolen" where the attacker doesn't get to see multiple writes where the > > same key+tweak has been used, and the latter is really the attack vector > > we're looking to address with XTS too. > > I've heard a few times that database running in a cloud is also a valid use > case for the TDE. In that case I think it should be expected that "someone has > root access on the box and is watching all block writes". Except that it isn't. If you're using someone else's computer, they're going to be able to look into shared buffers at tons of unencrypted data, including the keys to decrypt everything. That doesn't mean we shouldn't try to be good about using a different IV to make it harder on someone who has somehow gotten access to watch the writes go by, but TDE isn't a solution to protect someone from their cloud provider gaining access to their data. Thanks, Stephen
Attachment
On 2021/10/6 23:01, Robert Haas wrote: > This seems wrong to me. CTR requires that you not reuse the IV. If you > re-encrypt the page with a different IV, torn pages are a problem. If > you re-encrypt it with the same IV, then it's not secure any more. for CBC if the IV is predictable will case "dictionary attack". and for CBC and GCM reuse IV will case "known plaintext attack". XTS works like CBC but adds a tweak step. the tweak step does not add randomness. It means XTS still has "known plaintext attack", due to the same reason from CBC. many mails before this mail do a clear explanation, I just repeat. :> On 2021/10/7 22:28, Robert Haas wrote: > I'm a little concerned by the email from "Sasasu" saying that even in > XTS reusing the IV is not cryptographically weak. I don't know enough > about these different encryption modes to know if he's right, but if > he is then perhaps we need to consider his suggestion of using > AES-GCM. Or, uh, something else. a cryptography algorithm always lists some attack method (the scope), if the algorithm can defend against this attack, the algorithm is good. If software uses this algorithm but is attacked by the method not on that list. It is the software using this algorithm incorrectly, or should not use this algorithm. On 2021/10/8 03:38, Stephen Frost wrote: > I strongly suspect we'll have one of two > reactions- either we'll be more-or-less ignored and it'll be crickets > from the security folks, or we're going to get beat up by them for > $reasons, almost regardless of what we actually do. Best bet to > limit the risk (;) ) of the latter happening would be to try our best > to do what existing solutions already do- such as by using XTS. If using an existing wonderful algorithm but out of the design scope, cryptographers will laugh at you. On 2021/10/9 02:34, Stephen Frost wrote: > Greetings, > > * Antonin Houska (ah@cybertec.at) wrote: >> Stephen Frost <sfrost@snowman.net> wrote: >>> * Robert Haas (robertmhaas@gmail.com) wrote: >>>> On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: >>>>> While I certainly also appreciate that we want to get this as right as >>>>> we possibly can from the start, I strongly suspect we'll have one of two >>>>> reactions- either we'll be more-or-less ignored and it'll be crickets >>>>> from the security folks, or we're going to get beat up by them for >>>>> $reasons, almost regardless of what we actually do. Best bet to >>>>> limit the risk ( ;) ) of the latter happening would be to try our best >>>>> to do what existing solutions already do- such as by using XTS. >>>>> There's things we can do to limit the risk of known-plaintext attacks, >>>>> like simply not encrypting empty pages, or about possible known-IV >>>>> risks, like using the LSN as part of the IV/tweak. Will we get >>>>> everything? Probably not, but I don't think that we're going to really >>>>> go wrong by using XTS as it's quite popularly used today and it's >>>>> explicitly used for cases where you haven't got a place to store the >>>>> extra nonce that you would need for AEAD encryption schemes. >>>> >>>> I agree that using a popular approach is a good way to go. If we do >>>> what other people do, then hopefully our stuff won't be significantly >>>> more broken than their stuff, and whatever is can be fixed. >>> >>> Right. >>> >>>>> As long as we're clear that this initial version of TDE is with XTS then >>>>> I really don't think we'll end up with anyone showing up and saying we >>>>> screwed up by not generating a per-page nonce to store with it- the point >>>>> of XTS is that you don't need that. >>>> >>>> I agree that we shouldn't really catch flack for any weaknesses of the >>>> underlying algorithm: if XTS turns out to be secure even when used >>>> properly, and we use it properly, the resulting weakness is somebody >>>> else's fault. On the other hand, if we use it improperly, that's our >>>> fault, so we need to be really sure that we understand what guarantees >>>> we need to provide from our end, and that we are providing them. Like >>>> if we pick an encryption mode that requires nonces to be unique, we >>>> will be at fault if they aren't; if it requires nonces to be >>>> unpredictable, we will be at fault if they aren't; and so on. >>> >>> Sure, I get that. Would be awesome if all these things were clearly >>> documented somewhere but I've never been able to find it quite as >>> explicitly laid out as one would like. >>> >>>> So that's what is making me nervous here ... it doesn't seem likely we >>>> have complete unanimity about whether XTS is the right thing, though >>>> that does seem to be the majority position certainly, and it is not >>>> really clear to me that any of us can speak with authority about what >>>> the requirements are around the nonces in particular. >>> >>> The authority to look at, in my view anyway, are NIST publications. >>> Following a bit more digging, I came across something which makes sense >>> to me as intuitive but explains it in a way that might help everyone >>> understand a bit better what's going on here: >>> >>> https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf >>> >>> specifically: Appendix C: Tweaks >>> >>> Quoting a couple of paragraphs from that appendix: >>> >>> """ >>> In general, if there is information that is available and statically >>> associated with a plaintext, it is recommended to use that information >>> as a tweak for the plaintext. Ideally, the non-secret tweak associated >>> with a plaintext is associated only with that plaintext. >>> >>> Extensive tweaking means that fewer plaintexts are encrypted under any >>> given tweak. This corresponds, in the security model that is described >>> in [1], to fewer queries to the target instance of the encryption. >>> """ >>> >>> The gist of this being- the more diverse the tweaking being used, the >>> better. That's where I was going with my "limit the risk" comment. If >>> we can make the tweak vary more for a given encryption invokation, >>> that's going to be better, pretty much by definition, and as explained >>> in publications by NIST. >>> >>> That isn't to say that using the same tweak for the same block over and >>> over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads >>> directly to plaintext being recoverable), but it does mean that an >>> observer who can see the block writes over time could see what parts are >>> changing (and which aren't) and may be able to derive insight from that. >> >> This reminds me of Joe Conway's response to me email earlier: >> >> https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com >> >> In the document he recommended >> >> https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf >> >> specifically, in the Appendix C I read: >> >> """ >> For the CBC and CFB modes, the IVs must be unpredictable. In particular, for >> any given plaintext, it must not be possible to predict the IV that will be >> associated to the plaintext in advance of the generation of the IV. >> >> There are two recommended methods for generating unpredictable IVs. The first >> method is to apply the forward cipher function, under the same key that is >> used for the encryption of the plaintext, to a nonce. The nonce must be a >> data block that is unique to each execution of the encryption operation. For >> example, the nonce may be a counter, as described in Appendix B, or a message >> number. The second method is to generate a random data block using a FIPS- >> approved random number generator. >> """ >> >> This is about modes that include CBC, while the documend you refer to seems to >> deal with some other modes. So if we want to be confident that we use the XTS >> mode correctly, more research is probably needed. > > What I think is missing from this discussion is the fact that, with XTS > (and XEX, on which XTS is built), the IV *is* run through a forward > cipher function, just as suggested above needs to be done for CBC. I > don't see any reason to doubt that OpenSSL is correctly doing that. > > This article shows this pretty clearly: > > https://en.wikipedia.org/wiki/Disk_encryption_theory > > I don't think that changes the fact that, if we're able to, we should be > varying the tweak/IV as often as we can, and including the LSN seems > like a good way to do just that. > > Now, all that said, I'm all for looking at what others do to inform us > as to the right way to go about things and the above article lists a > number of users of XTS which we could go look at: > > XTS is supported by BestCrypt, Botan, NetBSD's cgd,[13] dm-crypt, > FreeOTFE, TrueCrypt, VeraCrypt,[14] DiskCryptor, FreeBSD's geli, OpenBSD > softraid disk encryption software, OpenSSL, Mac OS X Lion's FileVault 2, > Windows 10's BitLocker[15] and wolfCrypt. > >>> Now, as I mentioned before, that particular case isn't something that >>> XTS is particularly good at and that's generally accepted, yet lots of >>> folks use XTS anyway because the concern isn't "someone has root access >>> on the box and is watching all block writes" but rather "laptop was >>> stolen" where the attacker doesn't get to see multiple writes where the >>> same key+tweak has been used, and the latter is really the attack vector >>> we're looking to address with XTS too. >> >> I've heard a few times that database running in a cloud is also a valid use >> case for the TDE. In that case I think it should be expected that "someone has >> root access on the box and is watching all block writes". > > Except that it isn't. If you're using someone else's computer, they're > going to be able to look into shared buffers at tons of unencrypted > data, including the keys to decrypt everything. That doesn't mean we > shouldn't try to be good about using a different IV to make it harder on > someone who has somehow gotten access to watch the writes go by, but TDE > isn't a solution to protect someone from their cloud provider gaining > access to their data. > > Thanks, > > Stephen >
Attachment
On Thu, Oct 7, 2021 at 11:32:07PM -0400, Stephen Frost wrote: > Part of the meeting was specifically about "why are we doing this?" and > there were a few different answers- first and foremost was "because > people are asking for it", from which followed that, yes, in many cases > it's to satisfy an audit or similar requirement which any of the > proposed methods would address. There was further discussion that we Yes, Cybertec's experience with their TDE patch's adoption supported this. > could address *more* cases by providing something better, but the page > format changes were weighed against that and the general consensus was > that we should attack the simpler problem first and, potentially, gain > a solution for 90% of the folks asking for it, and then later see if > there's enough interest and desire to attack the remaining 10%. It is more than just the page format --- it would also be the added code, possible performance impact, and later code maintenance to allow for are a more complex or two different page formats. As an example, I think the online checksum patch failed because it wasn't happy with that 90% and went for the extra 10% of restartability, but once you saw the 100% solution, the patch was too big and was rejected. > As such, it's just not so simple as "what is 'secure enough'" because it > depends on who you're talking to. Based on the collective discussion at > the meeting, XTS is 'secure enough' for the needs of probably 90% of > those asking, while the other 10% want better (an AEAD method such as > GCM or GCM-SIV). Therefore, what should we do? Spend all of the extra > resources and engineering effort to address the 10% and maybe not get > anything because of the level of difficulty, or go the simpler route > first and get the 90%? Through that lense, the choice seemed reasonably > clear, at least to me, hence why I agreed that we should work on an XTS > based approach first. Yes, that was the conclusion. I think it helped to have the discussion verbally with everyone hearing every word, rather than via email where people jump into the discussion not hearing earlier points. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Oct 7, 2021 at 11:32:07PM -0400, Stephen Frost wrote: > > Part of the meeting was specifically about "why are we doing this?" and > > there were a few different answers- first and foremost was "because > > people are asking for it", from which followed that, yes, in many cases > > it's to satisfy an audit or similar requirement which any of the > > proposed methods would address. There was further discussion that we > > Yes, Cybertec's experience with their TDE patch's adoption supported > this. > > > could address *more* cases by providing something better, but the page > > format changes were weighed against that and the general consensus was > > that we should attack the simpler problem first and, potentially, gain > > a solution for 90% of the folks asking for it, and then later see if > > there's enough interest and desire to attack the remaining 10%. > > It is more than just the page format --- it would also be the added > code, possible performance impact, and later code maintenance to allow > for are a more complex or two different page formats. Yes, there is more to it than just the page format, I agree. I'm still of the mind that it's something we're going to get to eventually, if for no other reason than that our current page format is certainly not perfect and it'd be pretty awesome if we could make improvements to it (independently of TDE or anything else discussed currently). > As an example, I think the online checksum patch failed because it > wasn't happy with that 90% and went for the extra 10% of restartability, > but once you saw the 100% solution, the patch was too big and was > rejected. I'm, at least, still hopeful that we get the online checksum patch done. I'm not sure that I agree that this was 'the' reason it didn't make it in, but I don't think it'd be helpful to tangent this thread to discussing some other patch. > > As such, it's just not so simple as "what is 'secure enough'" because it > > depends on who you're talking to. Based on the collective discussion at > > the meeting, XTS is 'secure enough' for the needs of probably 90% of > > those asking, while the other 10% want better (an AEAD method such as > > GCM or GCM-SIV). Therefore, what should we do? Spend all of the extra > > resources and engineering effort to address the 10% and maybe not get > > anything because of the level of difficulty, or go the simpler route > > first and get the 90%? Through that lense, the choice seemed reasonably > > clear, at least to me, hence why I agreed that we should work on an XTS > > based approach first. > > Yes, that was the conclusion. I think it helped to have the discussion > verbally with everyone hearing every word, rather than via email where > people jump into the discussion not hearing earlier points. Yes, agreed. Certainly am hopeful that we are able to have more of those in the (relatively) near future too! Thanks! Stephen
Attachment
On Fri, Oct 8, 2021 at 02:34:20PM -0400, Stephen Frost wrote: > What I think is missing from this discussion is the fact that, with XTS > (and XEX, on which XTS is built), the IV *is* run through a forward > cipher function, just as suggested above needs to be done for CBC. I > don't see any reason to doubt that OpenSSL is correctly doing that. > > This article shows this pretty clearly: > > https://en.wikipedia.org/wiki/Disk_encryption_theory > > I don't think that changes the fact that, if we're able to, we should be > varying the tweak/IV as often as we can, and including the LSN seems > like a good way to do just that. Keep in mind that in our existiing code (not my patch), the LSN is zero for unlogged relations, a fixed value for some GiST index pages, and unchanged for some hint bit changes. Therefore, while we can include the LSN in the IV because it _might_ help, we can't rely on it. We probably need to have a discussion about whether LSN and checksum should be encrypted on the page. I think we are currently leaning to no encryption for LSN because we can use it as part of the nonce (where is it is variable) and encrypting the checksum for rudimenary integrity checking. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Oct 11, 2021 at 01:01:08PM -0400, Stephen Frost wrote: > > It is more than just the page format --- it would also be the added > > code, possible performance impact, and later code maintenance to allow > > for are a more complex or two different page formats. > > Yes, there is more to it than just the page format, I agree. I'm still > of the mind that it's something we're going to get to eventually, if for > no other reason than that our current page format is certainly not > perfect and it'd be pretty awesome if we could make improvements to it > (independently of TDE or anything else discussed currently). Yes, 100% agree on that. The good part is that TDE would not be paying the cost for that. ;-) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Oct 8, 2021 at 02:34:20PM -0400, Stephen Frost wrote: > > What I think is missing from this discussion is the fact that, with XTS > > (and XEX, on which XTS is built), the IV *is* run through a forward > > cipher function, just as suggested above needs to be done for CBC. I > > don't see any reason to doubt that OpenSSL is correctly doing that. > > > > This article shows this pretty clearly: > > > > https://en.wikipedia.org/wiki/Disk_encryption_theory > > > > I don't think that changes the fact that, if we're able to, we should be > > varying the tweak/IV as often as we can, and including the LSN seems > > like a good way to do just that. > > Keep in mind that in our existiing code (not my patch), the LSN is zero > for unlogged relations, a fixed value for some GiST index pages, and > unchanged for some hint bit changes. Therefore, while we can include > the LSN in the IV because it _might_ help, we can't rely on it. Regarding unlogged LSNs at least, I would think that we'd want to actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd out. The fixed value for GiST index pages is just during the index build process, as I recall, and that's perhaps less of a concern. Part of the point of using XTS is to avoid the issue of the LSN not being changed when hint bits are, or more generally not being unique in various cases. > We probably need to have a discussion about whether LSN and checksum > should be encrypted on the page. I think we are currently leaning to no > encryption for LSN because we can use it as part of the nonce (where is > it is variable) and encrypting the checksum for rudimenary integrity > checking. Yes, that's the direction that I was thinking also and specifically with XTS as the encryption algorithm to allow us to exclude the LSN but keep everything else, and to address the concern around the nonce/tweak/etc being the same sometimes across multiple writes. Another thing to consider is if we want to encrypt zero'd page. There was a point brought up that if we do then we are encrypting a fair bit of very predictable bytes and that's not great (though there's a fair bit about our pages that someone could quite possibly predict anyway based on table structures and such...). I would think that if it's easy enough to not encrypt zero'd pages that we should avoid doing so. Don't recall offhand which way zero'd pages were being handled already but thought it made sense to mention that as part of this discussion. Thanks, Stephen
Attachment
On Mon, Oct 11, 2021 at 01:30:38PM -0400, Stephen Frost wrote: > Greetings, > > > Keep in mind that in our existing code (not my patch), the LSN is zero > > for unlogged relations, a fixed value for some GiST index pages, and > > unchanged for some hint bit changes. Therefore, while we can include > > the LSN in the IV because it _might_ help, we can't rely on it. > > Regarding unlogged LSNs at least, I would think that we'd want to > actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd > out. The fixed value for GiST index pages is just during the index Good idea. For my patch I had to use a WAL-logged dummy LSN, but for our use, re-using a fake LSN after a crash seems fine, so we can just use the existing GetFakeLSNForUnloggedRel(). However, we might need to use the part of my patch that removes the assumption that unlogged relations always have zero LSNs, because right now they are only used for GiST indexes --- I would have to research that more. > Yes, that's the direction that I was thinking also and specifically with > XTS as the encryption algorithm to allow us to exclude the LSN but keep > everything else, and to address the concern around the nonce/tweak/etc > being the same sometimes across multiple writes. Another thing to > consider is if we want to encrypt zero'd page. There was a point > brought up that if we do then we are encrypting a fair bit of very > predictable bytes and that's not great (though there's a fair bit about > our pages that someone could quite possibly predict anyway based on > table structures and such...). I would think that if it's easy enough > to not encrypt zero'd pages that we should avoid doing so. Don't recall > offhand which way zero'd pages were being handled already but thought it > made sense to mention that as part of this discussion. Yeah, I wanted to mention that. I don't see any security difference between fully-zero pages, pages with headers and no tuples, and pages with headers and only a few tuples. If any of those are insecure, they all are. Therefore, I don't see any reason to treat them differently. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote:
> Yes, that's the direction that I was thinking also and specifically with
> XTS as the encryption algorithm to allow us to exclude the LSN but keep
> everything else, and to address the concern around the nonce/tweak/etc
> being the same sometimes across multiple writes. Another thing to
> consider is if we want to encrypt zero'd page. There was a point
> brought up that if we do then we are encrypting a fair bit of very
> predictable bytes and that's not great (though there's a fair bit about
> our pages that someone could quite possibly predict anyway based on
> table structures and such...). I would think that if it's easy enough
> to not encrypt zero'd pages that we should avoid doing so. Don't recall
> offhand which way zero'd pages were being handled already but thought it
> made sense to mention that as part of this discussion.
Yeah, I wanted to mention that. I don't see any security difference
between fully-zero pages, pages with headers and no tuples, and pages
with headers and only a few tuples. If any of those are insecure, they
all are. Therefore, I don't see any reason to treat them differently.
We had to special case zero pages and not encrypt them because as far as I can tell, there is no atomic way to extend a file and initialize it to Enc(zero) in the same step.
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote: > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote: > > > Yes, that's the direction that I was thinking also and specifically with > > XTS as the encryption algorithm to allow us to exclude the LSN but keep > > everything else, and to address the concern around the nonce/tweak/etc > > being the same sometimes across multiple writes. Another thing to > > consider is if we want to encrypt zero'd page. There was a point > > brought up that if we do then we are encrypting a fair bit of very > > predictable bytes and that's not great (though there's a fair bit about > > our pages that someone could quite possibly predict anyway based on > > table structures and such...). I would think that if it's easy enough > > to not encrypt zero'd pages that we should avoid doing so. Don't recall > > offhand which way zero'd pages were being handled already but thought it > > made sense to mention that as part of this discussion. > > Yeah, I wanted to mention that. I don't see any security difference > between fully-zero pages, pages with headers and no tuples, and pages > with headers and only a few tuples. If any of those are insecure, they > all are. Therefore, I don't see any reason to treat them differently. > > > We had to special case zero pages and not encrypt them because as far as I can > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in > the same step. Oh, good point. Yeah, we will need to handle that. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote: > > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote: > > > > > Yes, that's the direction that I was thinking also and specifically with > > > XTS as the encryption algorithm to allow us to exclude the LSN but keep > > > everything else, and to address the concern around the nonce/tweak/etc > > > being the same sometimes across multiple writes. Another thing to > > > consider is if we want to encrypt zero'd page. There was a point > > > brought up that if we do then we are encrypting a fair bit of very > > > predictable bytes and that's not great (though there's a fair bit about > > > our pages that someone could quite possibly predict anyway based on > > > table structures and such...). I would think that if it's easy enough > > > to not encrypt zero'd pages that we should avoid doing so. Don't recall > > > offhand which way zero'd pages were being handled already but thought it > > > made sense to mention that as part of this discussion. > > > > Yeah, I wanted to mention that. I don't see any security difference > > between fully-zero pages, pages with headers and no tuples, and pages > > with headers and only a few tuples. If any of those are insecure, they > > all are. Therefore, I don't see any reason to treat them differently. > > > > > > We had to special case zero pages and not encrypt them because as far as I can > > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in > > the same step. > > Oh, good point. Yeah, we will need to handle that. Not sure what's meant here by 'handle that', but I don't see any particular reason to avoid doing exactly the same for zero pages with TDE in core..? I don't think there's any reason we need to make things complicated to ensure that we encrypt entirely empty pages. Thanks, Stephen
Attachment
On Tue, Oct 12, 2021 at 08:25:52AM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote: > > > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > Yes, that's the direction that I was thinking also and specifically with > > > > XTS as the encryption algorithm to allow us to exclude the LSN but keep > > > > everything else, and to address the concern around the nonce/tweak/etc > > > > being the same sometimes across multiple writes. Another thing to > > > > consider is if we want to encrypt zero'd page. There was a point > > > > brought up that if we do then we are encrypting a fair bit of very > > > > predictable bytes and that's not great (though there's a fair bit about > > > > our pages that someone could quite possibly predict anyway based on > > > > table structures and such...). I would think that if it's easy enough > > > > to not encrypt zero'd pages that we should avoid doing so. Don't recall > > > > offhand which way zero'd pages were being handled already but thought it > > > > made sense to mention that as part of this discussion. > > > > > > Yeah, I wanted to mention that. I don't see any security difference > > > between fully-zero pages, pages with headers and no tuples, and pages > > > with headers and only a few tuples. If any of those are insecure, they > > > all are. Therefore, I don't see any reason to treat them differently. > > > > > > > > > We had to special case zero pages and not encrypt them because as far as I can > > > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in > > > the same step. > > > > Oh, good point. Yeah, we will need to handle that. > > Not sure what's meant here by 'handle that', but I don't see any > particular reason to avoid doing exactly the same for zero pages with > TDE in core..? I don't think there's any reason we need to make things > complicated to ensure that we encrypt entirely empty pages. I thought he was saying that when you extend a file, you might have to extend it with all zeros, rather than being able to extend it with an actual encrypted page of zeros. For example, I think when a page is corrupt in storage, it reads back as a fully zero page, and we would need to handle that. Are you saying we already have logic to handle that so we don't need to change anything? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Oct 12, 2021 at 08:25:52AM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote: > > > > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > > Yes, that's the direction that I was thinking also and specifically with > > > > > XTS as the encryption algorithm to allow us to exclude the LSN but keep > > > > > everything else, and to address the concern around the nonce/tweak/etc > > > > > being the same sometimes across multiple writes. Another thing to > > > > > consider is if we want to encrypt zero'd page. There was a point > > > > > brought up that if we do then we are encrypting a fair bit of very > > > > > predictable bytes and that's not great (though there's a fair bit about > > > > > our pages that someone could quite possibly predict anyway based on > > > > > table structures and such...). I would think that if it's easy enough > > > > > to not encrypt zero'd pages that we should avoid doing so. Don't recall > > > > > offhand which way zero'd pages were being handled already but thought it > > > > > made sense to mention that as part of this discussion. > > > > > > > > Yeah, I wanted to mention that. I don't see any security difference > > > > between fully-zero pages, pages with headers and no tuples, and pages > > > > with headers and only a few tuples. If any of those are insecure, they > > > > all are. Therefore, I don't see any reason to treat them differently. > > > > > > > > > > > > We had to special case zero pages and not encrypt them because as far as I can > > > > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in > > > > the same step. > > > > > > Oh, good point. Yeah, we will need to handle that. > > > > Not sure what's meant here by 'handle that', but I don't see any > > particular reason to avoid doing exactly the same for zero pages with > > TDE in core..? I don't think there's any reason we need to make things > > complicated to ensure that we encrypt entirely empty pages. > > I thought he was saying that when you extend a file, you might have to > extend it with all zeros, rather than being able to extend it with > an actual encrypted page of zeros. For example, I think when a page is > corrupt in storage, it reads back as a fully zero page, and we would > need to handle that. Are you saying we already have logic to handle > that so we don't need to change anything? When we extend a file, it gets extended with all zeros. PG already handles that case, PG w/ TDE would need to also recognize that case (which is what Ants was saying their patch does) and handle it. In other words, we just need to realize when a page is all zeros and not try to decrypt it when we're reading it. Ants' patch does that and my recollection is that it wasn't very complicated to do, and that seems much simpler than trying to figure out a way to ensure we do encrypt a zero'd page as part of extending a file. Thanks, Stephen
Attachment
On Tue, Oct 12, 2021 at 08:49:28AM -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > I thought he was saying that when you extend a file, you might have to > > extend it with all zeros, rather than being able to extend it with > > an actual encrypted page of zeros. For example, I think when a page is > > corrupt in storage, it reads back as a fully zero page, and we would > > need to handle that. Are you saying we already have logic to handle > > that so we don't need to change anything? > > When we extend a file, it gets extended with all zeros. PG already > handles that case, PG w/ TDE would need to also recognize that case > (which is what Ants was saying their patch does) and handle it. In > other words, we just need to realize when a page is all zeros and not > try to decrypt it when we're reading it. Ants' patch does that and my > recollection is that it wasn't very complicated to do, and that seems > much simpler than trying to figure out a way to ensure we do encrypt a > zero'd page as part of extending a file. Well, how do you detect an all-zero page vs a page that encrypted to all zeros? I am thinking a zero LSN (which is not encrypted) would be the only sure way, but we then have to make sure unlogged relations always get a fake LSN. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Oct 7, 2021 at 11:05 PM Stephen Frost <sfrost@snowman.net> wrote: > Sure, I get that. Would be awesome if all these things were clearly > documented somewhere but I've never been able to find it quite as > explicitly laid out as one would like. :-( > specifically: Appendix C: Tweaks > > Quoting a couple of paragraphs from that appendix: > > """ > In general, if there is information that is available and statically > associated with a plaintext, it is recommended to use that information > as a tweak for the plaintext. Ideally, the non-secret tweak associated > with a plaintext is associated only with that plaintext. > > Extensive tweaking means that fewer plaintexts are encrypted under any > given tweak. This corresponds, in the security model that is described > in [1], to fewer queries to the target instance of the encryption. > """ > > The gist of this being- the more diverse the tweaking being used, the > better. That's where I was going with my "limit the risk" comment. If > we can make the tweak vary more for a given encryption invokation, > that's going to be better, pretty much by definition, and as explained > in publications by NIST. I mean I don't have anything against that appendix, but I think we need to understand - with confidence - what the expectations are specifically around XTS, and that appendix seems much more general than that. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Oct 11, 2021 at 1:30 PM Stephen Frost <sfrost@snowman.net> wrote: > Regarding unlogged LSNs at least, I would think that we'd want to > actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd > out. The fixed value for GiST index pages is just during the index > build process, as I recall, and that's perhaps less of a concern. Part > of the point of using XTS is to avoid the issue of the LSN not being > changed when hint bits are, or more generally not being unique in > various cases. I don't believe there's anything to prevent the fake-LSN counter from overtaking the real end-of-WAL, and if that should happen, then the buffer manager would get confused. Maybe that can be fixed by doing some sort of surgery on the buffer manager, but it doesn't seem to be a trivial or ignorable problem. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Oct 11, 2021 at 1:30 PM Stephen Frost <sfrost@snowman.net> wrote: > > Regarding unlogged LSNs at least, I would think that we'd want to > > actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd > > out. The fixed value for GiST index pages is just during the index > > build process, as I recall, and that's perhaps less of a concern. Part > > of the point of using XTS is to avoid the issue of the LSN not being > > changed when hint bits are, or more generally not being unique in > > various cases. > > I don't believe there's anything to prevent the fake-LSN counter from > overtaking the real end-of-WAL, and if that should happen, then the > buffer manager would get confused. Maybe that can be fixed by doing > some sort of surgery on the buffer manager, but it doesn't seem to be > a trivial or ignorable problem. Using fake LSNs isn't new.. how is this not a concern already then? Also wondering why the buffer manager would care about the LSN on pages which are not BM_PERMANENT..? I'll admit that I might certainly be missing something here. Thanks, Stephen
Attachment
On Tue, Oct 12, 2021 at 10:39 AM Stephen Frost <sfrost@snowman.net> wrote: > Using fake LSNs isn't new.. how is this not a concern already then? > > Also wondering why the buffer manager would care about the LSN on pages > which are not BM_PERMANENT..? > > I'll admit that I might certainly be missing something here. Oh, FlushBuffer has a guard against this case in it. I hadn't realized that. Sorry for the noise. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
Well, how do you detect an all-zero page vs a page that encrypted to all
zeros?
Page encrypting to all zeros is for all practical purposes impossible to hit. Basically an attacker would have to be able to arbitrarily set the whole contents of the page and they would then achieve that this page gets ignored.
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote: > On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote: > > Well, how do you detect an all-zero page vs a page that encrypted to all > zeros? > > Page encrypting to all zeros is for all practical purposes impossible to hit. > Basically an attacker would have to be able to arbitrarily set the whole > contents of the page and they would then achieve that this page gets ignored. Uh, how do we know that valid data can't produce an encrypted all-zero page? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
>
> Well, how do you detect an all-zero page vs a page that encrypted to all
> zeros?
>
> Page encrypting to all zeros is for all practical purposes impossible to hit.
> Basically an attacker would have to be able to arbitrarily set the whole
> contents of the page and they would then achieve that this page gets ignored.
Uh, how do we know that valid data can't produce an encrypted all-zero
page?
Because the chances of that happening by accident are equivalent to making a series of commits to postgres and ending up with the same git commit hash 400 times in a row.
--
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Greetings,
On Tue, Oct 12, 2021 at 17:49 Ants Aasma <ants@cybertec.at> wrote:
On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
>
> Well, how do you detect an all-zero page vs a page that encrypted to all
> zeros?
>
> Page encrypting to all zeros is for all practical purposes impossible to hit.
> Basically an attacker would have to be able to arbitrarily set the whole
> contents of the page and they would then achieve that this page gets ignored.
Uh, how do we know that valid data can't produce an encrypted all-zero
page?Because the chances of that happening by accident are equivalent to making a series of commits to postgres and ending up with the same git commit hash 400 times in a row.
And to then have a valid checksum … seems next to impossible.
Thanks,
Stephen
On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote: > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote: > > Page encrypting to all zeros is for all practical purposes impossible to > hit. > > Basically an attacker would have to be able to arbitrarily set the whole > > contents of the page and they would then achieve that this page gets > ignored. > > Uh, how do we know that valid data can't produce an encrypted all-zero > page? > > > Because the chances of that happening by accident are equivalent to making a > series of commits to postgres and ending up with the same git commit hash 400 > times in a row. Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an empty page, and if not, an error? Seems easier than checking if each page contains all zeros every time. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, 13 Oct 2021 at 02:20, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote:
> On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> > Page encrypting to all zeros is for all practical purposes impossible to
> hit.
> > Basically an attacker would have to be able to arbitrarily set the whole
> > contents of the page and they would then achieve that this page gets
> ignored.
>
> Uh, how do we know that valid data can't produce an encrypted all-zero
> page?
>
>
> Because the chances of that happening by accident are equivalent to making a
> series of commits to postgres and ending up with the same git commit hash 400
> times in a row.
Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an
empty page, and if not, an error? Seems easier than checking if each
page contains all zeros every time.
We already check it anyway, see PageIsVerifiedExtended().
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Greetings, * Ants Aasma (ants@cybertec.at) wrote: > On Wed, 13 Oct 2021 at 02:20, Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote: > > > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote: > > > > Page encrypting to all zeros is for all practical purposes > > impossible to > > > hit. > > > > Basically an attacker would have to be able to arbitrarily set the > > whole > > > > contents of the page and they would then achieve that this page > > gets > > > ignored. > > > > > > Uh, how do we know that valid data can't produce an encrypted > > all-zero > > > page? > > > > > > > > > Because the chances of that happening by accident are equivalent to > > making a > > > series of commits to postgres and ending up with the same git commit > > hash 400 > > > times in a row. > > > > Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an > > empty page, and if not, an error? Seems easier than checking if each > > page contains all zeros every time. > > > > We already check it anyway, see PageIsVerifiedExtended(). Right- we check the LSN along with the rest of the page there. Thanks, Stephen
Attachment
On Wed, Oct 13, 2021 at 09:16:37AM -0400, Stephen Frost wrote: > Greetings, > > * Ants Aasma (ants@cybertec.at) wrote: > > On Wed, 13 Oct 2021 at 02:20, Bruce Momjian <bruce@momjian.us> wrote: > > > On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote: > > > > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote: > > > > > Page encrypting to all zeros is for all practical purposes > > > impossible to > > > > hit. > > > > > Basically an attacker would have to be able to arbitrarily set the > > > whole > > > > > contents of the page and they would then achieve that this page > > > gets > > > > ignored. > > > > > > > > Uh, how do we know that valid data can't produce an encrypted > > > all-zero > > > > page? > > > > > > > > > > > > Because the chances of that happening by accident are equivalent to > > > making a > > > > series of commits to postgres and ending up with the same git commit > > > hash 400 > > > > times in a row. > > > > > > Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an > > > empty page, and if not, an error? Seems easier than checking if each > > > page contains all zeros every time. > > > > > > > We already check it anyway, see PageIsVerifiedExtended(). > > Right- we check the LSN along with the rest of the page there. Very good. I have not looked at the Cybertec patch recently. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Thu, Oct 7, 2021 at 7:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Oct 7, 2021 at 3:24 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Every other > > caller/flow passes false for 'create_storage' and we still need to > > create storage in heap_create() if relkind has storage. > > That seems surprising. I have revised the patch w.r.t the way 'create_storage' is interpreted in heap_create() along with some minor changes to preserve the DBOID patch. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Sasasu <i@sasa.su> wrote: > On 2021/10/6 23:01, Robert Haas wrote: > > This seems wrong to me. CTR requires that you not reuse the IV. If you > > re-encrypt the page with a different IV, torn pages are a problem. If > > you re-encrypt it with the same IV, then it's not secure any more. > for CBC if the IV is predictable will case "dictionary attack". The following sounds like IV *uniqueness* is needed to defend against "known plaintext attack" ... > and for CBC and GCM reuse IV will case "known plaintext attack". ... but here you seem to say that *randomness* is also necessary: > XTS works like CBC but adds a tweak step. the tweak step does not add > randomness. It means XTS still has "known plaintext attack", (I suppose you mean "XTS with incorrect (e.g. non-random) IV", rather than XTS as such.) > due to the same reason from CBC. According to the Appendix C of https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf CBC requires *unpredictability* of the IV, but that does not necessarily mean randomness: the unpredictable IV can be obtained by applying the forward cipher function to an unique value. Can you please try to explain once again what you consider a requirement (uniqueness, randomness, etc.) on the IV for the XTS mode? Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Tue, Oct 12, 2021 at 10:26:54AM -0400, Robert Haas wrote: > > specifically: Appendix C: Tweaks > > > > Quoting a couple of paragraphs from that appendix: > > > > """ > > In general, if there is information that is available and statically > > associated with a plaintext, it is recommended to use that information > > as a tweak for the plaintext. Ideally, the non-secret tweak associated > > with a plaintext is associated only with that plaintext. > > > > Extensive tweaking means that fewer plaintexts are encrypted under any > > given tweak. This corresponds, in the security model that is described > > in [1], to fewer queries to the target instance of the encryption. > > """ > > > > The gist of this being- the more diverse the tweaking being used, the > > better. That's where I was going with my "limit the risk" comment. If > > we can make the tweak vary more for a given encryption invokation, > > that's going to be better, pretty much by definition, and as explained > > in publications by NIST. > > I mean I don't have anything against that appendix, but I think we > need to understand - with confidence - what the expectations are > specifically around XTS, and that appendix seems much more general > than that. Since there has not been activity on this thread for one month, I have updated the Postgres TDE wiki to include the conclusions and discussions from this thread: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Sadhuprasad Patro
Date:
On Tue, Oct 26, 2021 at 6:55 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > I have revised the patch w.r.t the way 'create_storage' is interpreted > in heap_create() along with some minor changes to preserve the DBOID > patch. > Hi Shruthi, I am reviewing the attached patches and providing a few comments here below for patch "v5-0002-Preserve-database-OIDs-in-pg_upgrade.patch" 1. --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -31,7 +31,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable> - [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ] ] + [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ] + [ OID [=] <replaceable class="parameter">db_oid</replaceable> ] ] Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'. 2. --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c + if ((dboid < FirstNormalObjectId) && + (strcmp(dbname, "template0") != 0) && + (!IsBinaryUpgrade)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE)), + errmsg("Invalid value for option \"%s\"", defel->defname), + errhint("The specified OID %u is less than the minimum OID for user objects %u.", + dboid, FirstNormalObjectId)); + } Are we sure that 'IsBinaryUpgrade' will be set properly, before the createdb function is called? Can we recheck once ? 3. @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) */ pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock); - do + /* Select an OID for the new database if is not explicitly configured. */ + if (!OidIsValid(dboid)) { - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId, - Anum_pg_database_oid); - } while (check_db_file_conflict(dboid)); I think we need to do 'check_db_file_conflict' for the USER given OID also.. right? It may already be present. 4. --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c /* + * Create template0 database with oid Template0ObjectId i.e, 4 + */ + Better to mention here, why OID 4 is reserved for template0 database?. 5. + /* + * Create template0 database with oid Template0ObjectId i.e, 4 + */ + static const char *const template0_setup[] = { + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID " + CppAsString2(Template0ObjectId) ";\n\n", Can we write something like, 'OID = CppAsString2(Template0ObjectId)'? mention "=". 6. + + /* + * We use the OID of postgres to determine datlastsysoid + */ + "UPDATE pg_database SET datlastsysoid = " + " (SELECT oid FROM pg_database " + " WHERE datname = 'postgres');\n\n", + Make the above comment a single line comment. 7. There are some spelling mistakes in the comments as below, please correct the same + /* + * Make sure that binary upgrade propogate the database OID to the new =====> correct spelling + * cluster + */ +/* OID 4 is reserved for Templete0 database */ ====> Correct spelling +#define Template0ObjectId 4 I am reviewing another patch "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well and will provide the comments soon if any... Thanks & Regards SadhuPrasad EnterpriseDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > 1. > --- a/doc/src/sgml/ref/create_database.sgml > +++ b/doc/src/sgml/ref/create_database.sgml > @@ -31,7 +31,8 @@ CREATE DATABASE <replaceable > class="parameter">name</replaceable> > - [ IS_TEMPLATE [=] <replaceable > class="parameter">istemplate</replaceable> ] ] > + [ IS_TEMPLATE [=] <replaceable > class="parameter">istemplate</replaceable> ] > + [ OID [=] <replaceable > class="parameter">db_oid</replaceable> ] ] > > Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'. I agree that the listitem and the synopsis need to be consistent, but it could be made consistent either by changing that one to db_oid or this one to oid. > 2. > --- a/src/backend/commands/dbcommands.c > +++ b/src/backend/commands/dbcommands.c > + if ((dboid < FirstNormalObjectId) && > + (strcmp(dbname, "template0") != 0) && > + (!IsBinaryUpgrade)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE)), > + errmsg("Invalid value for option \"%s\"", defel->defname), > + errhint("The specified OID %u is less than the minimum OID for user > objects %u.", > + dboid, FirstNormalObjectId)); > + } > > Are we sure that 'IsBinaryUpgrade' will be set properly, before the > createdb function is called? Can we recheck once ? How could it be set incorrectly, and how could we recheck this? > 3. > @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > */ > pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock); > > - do > + /* Select an OID for the new database if is not explicitly configured. */ > + if (!OidIsValid(dboid)) > { > - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId, > - Anum_pg_database_oid); > - } while (check_db_file_conflict(dboid)); > > I think we need to do 'check_db_file_conflict' for the USER given OID > also.. right? It may already be present. Hopefully, if that happens, we straight up fail later on. > 4. > --- a/src/bin/initdb/initdb.c > +++ b/src/bin/initdb/initdb.c > > /* > + * Create template0 database with oid Template0ObjectId i.e, 4 > + */ > + > > Better to mention here, why OID 4 is reserved for template0 database?. I'm not sure how we would give a reason for selecting an arbitrary constant? We could perhaps explain why we use a fixed OID. But there's no reason it has to be 4, I think. > 5. > + /* > + * Create template0 database with oid Template0ObjectId i.e, 4 > + */ > + static const char *const template0_setup[] = { > + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID " > + CppAsString2(Template0ObjectId) ";\n\n", > > Can we write something like, 'OID = CppAsString2(Template0ObjectId)'? > mention "=". That seems like a good idea, because it would be more consistent. > 6. > + > + /* > + * We use the OID of postgres to determine datlastsysoid > + */ > + "UPDATE pg_database SET datlastsysoid = " > + " (SELECT oid FROM pg_database " > + " WHERE datname = 'postgres');\n\n", > + > > Make the above comment a single line comment. I think what Shruthi did is more correct. It doesn't have to be done as a single-line comment just because it can fit on one line. And Shruthi didn't write this comment anyway, it's only moved slightly from where it was before. > 7. > There are some spelling mistakes in the comments as below, please > correct the same > + /* > + * Make sure that binary upgrade propogate the database OID to the > new =====> correct spelling > + * cluster > + */ > > +/* OID 4 is reserved for Templete0 database */ > ====> Correct spelling > +#define Template0ObjectId 4 Yes, those would be good to fix. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Mon, Dec 6, 2021 at 10:14 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > On Tue, Oct 26, 2021 at 6:55 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > > > > I have revised the patch w.r.t the way 'create_storage' is interpreted > > in heap_create() along with some minor changes to preserve the DBOID > > patch. > > > > Hi Shruthi, > > I am reviewing the attached patches and providing a few comments here > below for patch "v5-0002-Preserve-database-OIDs-in-pg_upgrade.patch" > > 1. > --- a/doc/src/sgml/ref/create_database.sgml > +++ b/doc/src/sgml/ref/create_database.sgml > @@ -31,7 +31,8 @@ CREATE DATABASE <replaceable > class="parameter">name</replaceable> > - [ IS_TEMPLATE [=] <replaceable > class="parameter">istemplate</replaceable> ] ] > + [ IS_TEMPLATE [=] <replaceable > class="parameter">istemplate</replaceable> ] > + [ OID [=] <replaceable > class="parameter">db_oid</replaceable> ] ] > > Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'. Replaced "db_oid" with "oid" > > 2. > --- a/src/backend/commands/dbcommands.c > +++ b/src/backend/commands/dbcommands.c > + if ((dboid < FirstNormalObjectId) && > + (strcmp(dbname, "template0") != 0) && > + (!IsBinaryUpgrade)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE)), > + errmsg("Invalid value for option \"%s\"", defel->defname), > + errhint("The specified OID %u is less than the minimum OID for user > objects %u.", > + dboid, FirstNormalObjectId)); > + } > > Are we sure that 'IsBinaryUpgrade' will be set properly, before the > createdb function is called? Can we recheck once ? I believe 'IsBinaryUpgrade' will be set to true when pg_upgrade is invoked. pg_ugrade internally does pg_dump and pg_restore for every database in the cluster. > 3. > @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > */ > pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock); > > - do > + /* Select an OID for the new database if is not explicitly configured. */ > + if (!OidIsValid(dboid)) > { > - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId, > - Anum_pg_database_oid); > - } while (check_db_file_conflict(dboid)); > > I think we need to do 'check_db_file_conflict' for the USER given OID > also.. right? It may already be present. If a datafile with user-specified OID exists, the create database fails with the below error. postgres=# create database d2 oid 16452; ERROR: could not create directory "base/16452": File exists > 4. > --- a/src/bin/initdb/initdb.c > +++ b/src/bin/initdb/initdb.c > > /* > + * Create template0 database with oid Template0ObjectId i.e, 4 > + */ > + > > Better to mention here, why OID 4 is reserved for template0 database?. The comment is updated to explain why template0 oid is fixed. > 5. > + /* > + * Create template0 database with oid Template0ObjectId i.e, 4 > + */ > + static const char *const template0_setup[] = { > + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID " > + CppAsString2(Template0ObjectId) ";\n\n", > > Can we write something like, 'OID = CppAsString2(Template0ObjectId)'? > mention "=". Fixed > 6. > + > + /* > + * We use the OID of postgres to determine datlastsysoid > + */ > + "UPDATE pg_database SET datlastsysoid = " > + " (SELECT oid FROM pg_database " > + " WHERE datname = 'postgres');\n\n", > + > > Make the above comment a single line comment. As Robert confirmed, this part of the code is moved from a different place. > 7. > There are some spelling mistakes in the comments as below, please > correct the same > + /* > + * Make sure that binary upgrade propogate the database OID to the > new =====> correct spelling > + * cluster > + */ > > +/* OID 4 is reserved for Templete0 database */ > ====> Correct spelling > +#define Template0ObjectId 4 > Fixed. > I am reviewing another patch > "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well > and will provide the comments soon if any... Thanks. I have rebased relfilenode oid preserve patch. You may use the rebased patch for review. Thanks & Regards Shruthi K C EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Mon, Dec 6, 2021 at 11:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > 3. > > @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > > */ > > pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock); > > > > - do > > + /* Select an OID for the new database if is not explicitly configured. */ > > + if (!OidIsValid(dboid)) > > { > > - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId, > > - Anum_pg_database_oid); > > - } while (check_db_file_conflict(dboid)); > > > > I think we need to do 'check_db_file_conflict' for the USER given OID > > also.. right? It may already be present. > > Hopefully, if that happens, we straight up fail later on. That's right. If a database with user-specified OID exists, the createdb fails with a "duplicate key value" error. If just a data directory with user-specified OID exists, MakePGDirectory() fails to create the directory and the cleanup callback createdb_failure_callback() removes the directory that was not created by 'createdb()' function. The subsequent create database call with the same OID will succeed. Should we handle the case where a data directory exists and the corresponding DB with that oid does not exist? I presume this situation doesn't arise unless the user tries to create directories in the data path. Any thoughts? Thanks & Regards Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Mon, Dec 13, 2021 at 9:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > I am reviewing another patch > > "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well > > and will provide the comments soon if any... I spent much of today reviewing 0001. Here's an updated version, so far only lightly tested. Please check whether I've broken anything. Here are the changes: - I adjusted the function header comment for heap_create. Your proposed comment seemed like it was pretty detailed but not 100% correct. It also made one of the lines kind of long because you didn't wrap the text in the surrounding style. I decided to make it simpler and shorter instead of longer still and 100% correct. - I removed a one-line comment that said /* Override the toast relfilenode */ because it preceded an error check, not a line of code that would have done what the comment claimed. - I removed a one-line comment that said /* Override the relfilenode */ because the following code would only sometimes override the relfilenode. The code didn't seem complex enough to justify a a longer and more accurate comment, so I just took it out. - I changed a test for (relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE || relkind == RELKIND_MATVIEW) to use RELKIND_HAS_STORAGE(). It's true that not all of the storage types that RELKIND_HAS_STORAGE() tests are possible here, but that's not a reason to avoiding using the macro. If somebody adds a new relkind with storage in the future, they might miss the need to manually update this place, but they will not likely miss the need to update RELKIND_HAS_STORAGE() since, if they did, their code probably wouldn't work at all. - I changed the way that you were passing create_storage down to heap_create. I think I said before that you should EITHER fix it so that we set create_storage = true only when the relation actually has storage OR ELSE have heap_create() itself override the value to false when there is no storage. You did both. There are times when it's reasonable to ensure the same thing in multiple places, but this doesn't seem to be one of them. So I took that out. I chose to retain the code in heap_create() that overrides the value to false, added a comment explaining that it does that, and then adjusted the callers to ignore the storage type. I then added comments, and in one place an assertion, to make it clearer what is happening. - In pg_dump.c, I adjusted the comment that says "Not every relation has storage." and the test that immediately follows, to ignore the relfilenode when relkind says it's a partitioned table. Really, partitioned tables should never have had relfilenodes, but as it turns out, they used to have them. Let me know your thoughts. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Dec 14, 2021 at 2:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 13, 2021 at 9:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > I am reviewing another patch > > > "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well > > > and will provide the comments soon if any... > > I spent much of today reviewing 0001. Here's an updated version, so > far only lightly tested. Please check whether I've broken anything. > Here are the changes: Thanks, Robert for the updated version. I reviewed the changes and it looks fine. I also tested the patch. The patch works as expected. > - I adjusted the function header comment for heap_create. Your > proposed comment seemed like it was pretty detailed but not 100% > correct. It also made one of the lines kind of long because you didn't > wrap the text in the surrounding style. I decided to make it simpler > and shorter instead of longer still and 100% correct. The comment update looks fine. However, I still feel it would be good to mention on which (rare) circumstance a valid relfilenode can get passed. > - I removed a one-line comment that said /* Override the toast > relfilenode */ because it preceded an error check, not a line of code > that would have done what the comment claimed. > > - I removed a one-line comment that said /* Override the relfilenode > */ because the following code would only sometimes override the > relfilenode. The code didn't seem complex enough to justify a a longer > and more accurate comment, so I just took it out. Fine > - I changed a test for (relkind == RELKIND_RELATION || relkind == > RELKIND_SEQUENCE || relkind == RELKIND_MATVIEW) to use > RELKIND_HAS_STORAGE(). It's true that not all of the storage types > that RELKIND_HAS_STORAGE() tests are possible here, but that's not a > reason to avoiding using the macro. If somebody adds a new relkind > with storage in the future, they might miss the need to manually > update this place, but they will not likely miss the need to update > RELKIND_HAS_STORAGE() since, if they did, their code probably wouldn't > work at all. I agree. > - I changed the way that you were passing create_storage down to > heap_create. I think I said before that you should EITHER fix it so > that we set create_storage = true only when the relation actually has > storage OR ELSE have heap_create() itself override the value to false > when there is no storage. You did both. There are times when it's > reasonable to ensure the same thing in multiple places, but this > doesn't seem to be one of them. So I took that out. I chose to retain > the code in heap_create() that overrides the value to false, added a > comment explaining that it does that, and then adjusted the callers to > ignore the storage type. I then added comments, and in one place an > assertion, to make it clearer what is happening. The changes are fine. Thanks for the fine-tuning. > - In pg_dump.c, I adjusted the comment that says "Not every relation > has storage." and the test that immediately follows, to ignore the > relfilenode when relkind says it's a partitioned table. Really, > partitioned tables should never have had relfilenodes, but as it turns > out, they used to have them. > Fine. Understood. Thanks & Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
tushar
Date:
On 12/14/21 2:35 AM, Robert Haas wrote: > I spent much of today reviewing 0001. Here's an updated version, so > far only lightly tested. Please check whether I've broken anything. Thanks Robert, I tested from v96/12/13/v14 -> v15( with patch) things are working fine i.e table /index relfilenode is preserved, not changing after pg_upgrade. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
tushar
Date:
On 12/15/21 12:09 AM, tushar wrote:
I covered tablespace OIDs testing scenarios and that is also preserved after pg_upgrade.I spent much of today reviewing 0001. Here's an updated version, soThanks Robert, I tested from v96/12/13/v14 -> v15( with patch)
far only lightly tested. Please check whether I've broken anything.
things are working fine i.e table /index relfilenode is preserved,
not changing after pg_upgrade.
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Mon, Dec 13, 2021 at 8:43 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > On Mon, Dec 6, 2021 at 11:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > > 3. > > > @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > > > */ > > > pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock); > > > > > > - do > > > + /* Select an OID for the new database if is not explicitly configured. */ > > > + if (!OidIsValid(dboid)) > > > { > > > - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId, > > > - Anum_pg_database_oid); > > > - } while (check_db_file_conflict(dboid)); > > > > > > I think we need to do 'check_db_file_conflict' for the USER given OID > > > also.. right? It may already be present. > > > > Hopefully, if that happens, we straight up fail later on. > > That's right. If a database with user-specified OID exists, the > createdb fails with a "duplicate key value" error. > If just a data directory with user-specified OID exists, > MakePGDirectory() fails to create the directory and the cleanup > callback createdb_failure_callback() removes the directory that was > not created by 'createdb()' function. > The subsequent create database call with the same OID will succeed. > Should we handle the case where a data directory exists and the > corresponding DB with that oid does not exist? I presume this > situation doesn't arise unless the user tries to create directories in > the data path. Any thoughts? I have updated the DBOID preserve patch to handle this case and generated the latest patch on top of your v7-001-preserve-relfilenode patch. Thanks & Regards Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Julien Rouhaud
Date:
Hi, On Fri, Dec 17, 2021 at 01:03:06PM +0530, Shruthi Gowda wrote: > > I have updated the DBOID preserve patch to handle this case and > generated the latest patch on top of your v7-001-preserve-relfilenode > patch. The cfbot reports that the patch doesn't apply anymore: http://cfbot.cputube.org/patch_36_3296.log === Applying patches on top of PostgreSQL commit ID 5513dc6a304d8bda114004a3b906cc6fde5d6274 === === applying patch ./v7-0002-Preserve-database-OIDs-in-pg_upgrade.patch [...] patching file src/bin/pg_upgrade/info.c Hunk #1 FAILED at 190. Hunk #2 succeeded at 351 (offset 27 lines). 1 out of 2 hunks FAILED -- saving rejects to file src/bin/pg_upgrade/info.c.rej patching file src/bin/pg_upgrade/pg_upgrade.h Hunk #1 FAILED at 145. 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/pg_upgrade.h.rej patching file src/bin/pg_upgrade/relfilenode.c Hunk #1 FAILED at 193. 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/relfilenode.c.rej Could you send a rebased version? In the meantime I willl switch the cf entry to Waiting on Author.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Sat, Jan 15, 2022 at 11:17 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > Hi, > > On Fri, Dec 17, 2021 at 01:03:06PM +0530, Shruthi Gowda wrote: > > > > I have updated the DBOID preserve patch to handle this case and > > generated the latest patch on top of your v7-001-preserve-relfilenode > > patch. > > The cfbot reports that the patch doesn't apply anymore: > http://cfbot.cputube.org/patch_36_3296.log > === Applying patches on top of PostgreSQL commit ID 5513dc6a304d8bda114004a3b906cc6fde5d6274 === > === applying patch ./v7-0002-Preserve-database-OIDs-in-pg_upgrade.patch > [...] > patching file src/bin/pg_upgrade/info.c > Hunk #1 FAILED at 190. > Hunk #2 succeeded at 351 (offset 27 lines). > 1 out of 2 hunks FAILED -- saving rejects to file src/bin/pg_upgrade/info.c.rej > patching file src/bin/pg_upgrade/pg_upgrade.h > Hunk #1 FAILED at 145. > 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/pg_upgrade.h.rej > patching file src/bin/pg_upgrade/relfilenode.c > Hunk #1 FAILED at 193. > 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/relfilenode.c.rej > > Could you send a rebased version? In the meantime I willl switch the cf entry > to Waiting on Author. I have rebased and generated the patches on top of PostgreSQL commit ID cf925936ecc031355cd56fbd392ec3180517a110. Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch. Thanks & Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > Thanks, Robert for the updated version. I reviewed the changes and it > looks fine. > I also tested the patch. The patch works as expected. Thanks. > > - I adjusted the function header comment for heap_create. Your > > proposed comment seemed like it was pretty detailed but not 100% > > correct. It also made one of the lines kind of long because you didn't > > wrap the text in the surrounding style. I decided to make it simpler > > and shorter instead of longer still and 100% correct. > > The comment update looks fine. However, I still feel it would be good to > mention on which (rare) circumstance a valid relfilenode can get passed. In general, I think it's the job of a function parameter comment to describe what the parameter does, not how the callers actually use it. One problem with describing the latter is that, if someone later adds another caller, there is a pretty good chance that they won't notice that the comment needs to be changed. More fundamentally, the parameter function comments should be like an instruction manual for how to use the function. If you are trying to figure out how to use this function, it is not helpful to know that "most callers like to pass false" for this parameter. What you need to know is what value your new call site should pass, and knowing what "most callers" do or that something is "rare" doesn't really help. If we want to make this comment more detailed, we should approach it from the point of view of explaining how it ought to be set. I've committed the v8-0001 patch you attached. I'll write separately about v8-0002. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Mon, Jan 17, 2022 at 9:57 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > I have rebased and generated the patches on top of PostgreSQL commit > ID cf925936ecc031355cd56fbd392ec3180517a110. > Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch > first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch. OK, so looking over 0002, I noticed a few things: 1. datlastsysoid isn't being used for anything any more. That's not a defect in your patch, but I've separately proposed to remove it. 2. I realized that the whole idea here depends on not having initdb create more than one database without a fixed OID. The patch solves that by nailing down the OID of template0, which is a sufficient solution. However, I think nailing down the (initial) OID of postgres as well would be a good idea, just in case somebody in the future decides to add another system-created database. 3. The changes to gram.y don't do anything. Sure, you've added a new "OID" token, but nothing generates that token, so it has no effect. The reason the syntax works is that createdb_opt_name accepts "IDENT", which means any string that's not in the keyword list (see kwlist.h). But that's there already, so you don't need to do anything in this file. 4. I felt that the documentation and comments could be somewhat improved. Here's an updated version in which I've reverted the changes to gram.y and tried to improve the comments and documentation. Could you have a look at implementing (2) above? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Jan 18, 2022 at 12:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Thanks, Robert for the updated version. I reviewed the changes and it > > looks fine. > > I also tested the patch. The patch works as expected. > > Thanks. > > > > - I adjusted the function header comment for heap_create. Your > > > proposed comment seemed like it was pretty detailed but not 100% > > > correct. It also made one of the lines kind of long because you didn't > > > wrap the text in the surrounding style. I decided to make it simpler > > > and shorter instead of longer still and 100% correct. > > > > The comment update looks fine. However, I still feel it would be good to > > mention on which (rare) circumstance a valid relfilenode can get passed. > > In general, I think it's the job of a function parameter comment to > describe what the parameter does, not how the callers actually use it. > One problem with describing the latter is that, if someone later adds > another caller, there is a pretty good chance that they won't notice > that the comment needs to be changed. More fundamentally, the > parameter function comments should be like an instruction manual for > how to use the function. If you are trying to figure out how to use > this function, it is not helpful to know that "most callers like to > pass false" for this parameter. What you need to know is what value > your new call site should pass, and knowing what "most callers" do or > that something is "rare" doesn't really help. If we want to make this > comment more detailed, we should approach it from the point of view of > explaining how it ought to be set. It's clear now. Thanks for clarifying. > I've committed the v8-0001 patch you attached. I'll write separately > about v8-0002. Sure. Thank you.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Jan 18, 2022 at 2:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jan 17, 2022 at 9:57 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > I have rebased and generated the patches on top of PostgreSQL commit > > ID cf925936ecc031355cd56fbd392ec3180517a110. > > Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch > > first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch. > > OK, so looking over 0002, I noticed a few things: > > 1. datlastsysoid isn't being used for anything any more. That's not a > defect in your patch, but I've separately proposed to remove it. okay > 2. I realized that the whole idea here depends on not having initdb > create more than one database without a fixed OID. The patch solves > that by nailing down the OID of template0, which is a sufficient > solution. However, I think nailing down the (initial) OID of postgres > as well would be a good idea, just in case somebody in the future > decides to add another system-created database. I agree with your thought. In my latest patch, postgres db gets created with a fixed OID. I have chosen an arbitrary number 16000 as postgres OID from the unpinned object OID range1200 - 16383. > 3. The changes to gram.y don't do anything. Sure, you've added a new > "OID" token, but nothing generates that token, so it has no effect. > The reason the syntax works is that createdb_opt_name accepts "IDENT", > which means any string that's not in the keyword list (see kwlist.h). > But that's there already, so you don't need to do anything in this > file. okay > 4. I felt that the documentation and comments could be somewhat improved. The documentation and comment updates are more accurate with the required details. Thanks. > Here's an updated version in which I've reverted the changes to gram.y > and tried to improve the comments and documentation. Could you have a > look at implementing (2) above? Attached is the patch that implements comment (2). Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Jan 20, 2022 at 7:09 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Here's an updated version in which I've reverted the changes to gram.y > > and tried to improve the comments and documentation. Could you have a > > look at implementing (2) above? > > Attached is the patch that implements comment (2). This probably needs minor rebasing on account of the fact that I just pushed the patch to remove datlastsysoid. I intended to do that before you posted a new version to save you the trouble, but I was too slow (or you were too fast, however you want to look at it). + errmsg("Invalid value for option \"%s\"", defel->defname), Per the "error message style" section of the documentation, primary error messages neither begin with a capital letter nor end with a period, while errdetail() messages are complete sentences and thus both begin with a capital letter and end with a period. But what I think you should really do here is get rid of the error detail and convey all the information in a primary error message. e.g. "OID %u is a system OID", or maybe better, "OIDs less than %u are reserved for system objects". + errmsg("database oid %u is already used by database %s", + errmsg("data directory exists for database oid %u", dboid)); Usually we write "OID" rather than "oid" in error messages. I think maybe it would be best to change the text slightly too. I suggest: database OID %u is already in use by database \"%s\" data directory already exists for database with OID %u + * it would fail. To avoid that, assign a fixed OID to template0 and + * postgres rather than letting the server choose one. a fixed OID -> fixed OIDs one -> them Or maybe put this comment back the way I had it and just talk about postgres, and then change the comment in make_postgres to say "Assign a fixed OID to postgres, for the same reasons as template0." + /* + * Make sure that binary upgrade propagate the database OID to the new + * cluster + */ This comment doesn't really seem necessary. It's sort of self-explanatory. +# Push the OID that is reserved for template0 database. +my $Template0ObjectId = + Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId'); +push @{$oids}, $Template0ObjectId; Don't you need to do this for PostgresObjectId also? -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Thu, Jan 20, 2022 at 7:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 20, 2022 at 7:09 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > Here's an updated version in which I've reverted the changes to gram.y > > > and tried to improve the comments and documentation. Could you have a > > > look at implementing (2) above? > > > > Attached is the patch that implements comment (2). > > This probably needs minor rebasing on account of the fact that I just > pushed the patch to remove datlastsysoid. I intended to do that before > you posted a new version to save you the trouble, but I was too slow > (or you were too fast, however you want to look at it). I have rebased the patch. Kindly have a look at it. > + errmsg("Invalid value for option \"%s\"", defel->defname), > > Per the "error message style" section of the documentation, primary > error messages neither begin with a capital letter nor end with a > period, while errdetail() messages are complete sentences and thus > both begin with a capital letter and end with a period. But what I > think you should really do here is get rid of the error detail and > convey all the information in a primary error message. e.g. "OID %u is > a system OID", or maybe better, "OIDs less than %u are reserved for > system objects". Fixed > + errmsg("database oid %u is already used by database %s", > + errmsg("data directory exists for database oid %u", dboid)); > > Usually we write "OID" rather than "oid" in error messages. I think > maybe it would be best to change the text slightly too. I suggest: > > database OID %u is already in use by database \"%s\" > data directory already exists for database with OID %u The second error message will be reported when the data directory with the given OID exists in the data path but the corresponding DB does not exist. This could happen only if the user creates directories in the data folder which is indeed not an invalid usage. I have updated the error message to "data directory with the specified OID %u already exists" because the error message you recommended gives a slightly different meaning. > + * it would fail. To avoid that, assign a fixed OID to template0 and > + * postgres rather than letting the server choose one. > > a fixed OID -> fixed OIDs > one -> them > > Or maybe put this comment back the way I had it and just talk about > postgres, and then change the comment in make_postgres to say "Assign > a fixed OID to postgres, for the same reasons as template0." Fixed > + /* > + * Make sure that binary upgrade propagate the database OID to the new > + * cluster > + */ > > This comment doesn't really seem necessary. It's sort of self-explanatory. Removed > +# Push the OID that is reserved for template0 database. > +my $Template0ObjectId = > + Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId'); > +push @{$oids}, $Template0ObjectId; > > Don't you need to do this for PostgresObjectId also? It is not required for PostgresObjectId. The unused_oids script provides a list of unused oids in the manually-assignable OIDs range (1-9999). Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Jan 20, 2022 at 11:03 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > It is not required for PostgresObjectId. The unused_oids script > provides a list of unused oids in the manually-assignable OIDs range > (1-9999). Well, so ... why are we not treating the OIDs for these two databases the same? If there's a range from which we can assign OIDs without risk of duplication and without needing to update this script, perhaps we ought to assign both of them from that range and leave the script alone. + * that is in use in the old cluster is also used in th new cluster - and th -> the. +preserves the DB, tablespace, relfilenode OIDs so TOAST and other references Insert "and" before "relfilenode". I think this is in pretty good shape now. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Fri, Jan 21, 2022 at 1:08 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 20, 2022 at 11:03 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > It is not required for PostgresObjectId. The unused_oids script > > provides a list of unused oids in the manually-assignable OIDs range > > (1-9999). > > Well, so ... why are we not treating the OIDs for these two databases > the same? If there's a range from which we can assign OIDs without > risk of duplication and without needing to update this script, perhaps > we ought to assign both of them from that range and leave the script > alone. From what I see in the code, template0 and postgres are the last things that get created in initdb phase. The system OIDs that get assigned to these DBs vary from release to release. At present, the system assigned OIDs of template0 and postgres are 13679 and 13680 respectively. I feel it would be safe to assign 16000 and 16001 for template0 and postgres respectively from the unpinned object OID range 12000 - 16383. In the future, even if the initdb unpinned objects reach the range of 16000 issues can only arise if initdb() creates another system-created database for which the system assigns these reserved OIDs (16000, 16001). > + * that is in use in the old cluster is also used in th new cluster - and > > th -> the. Fixed > +preserves the DB, tablespace, relfilenode OIDs so TOAST and other references > > Insert "and" before "relfilenode". Fixed Attached is the latest patch for review. Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Fri, Jan 21, 2022 at 8:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > From what I see in the code, template0 and postgres are the last > things that get created in initdb phase. The system OIDs that get > assigned to these DBs vary from release to release. At present, the > system assigned OIDs of template0 and postgres are 13679 and 13680 > respectively. I feel it would be safe to assign 16000 and 16001 for > template0 and postgres respectively from the unpinned object OID range > 12000 - 16383. In the future, even if the initdb unpinned objects > reach the range of 16000 issues can only arise if initdb() creates > another system-created database for which the system assigns these > reserved OIDs (16000, 16001). It doesn't seem safe to me to rely on that. We don't know what could happen in the future if the number of built-in objects increases. Looking at the lengthy comment on this topic in transam.h, I see that there are three ranges: 1-9999 manually assigned OIDs 10000-11999 OIDs assigned by genbki.pl 12000-16384 OIDs assigned to unpinned objects post-bootstrap It seems to me that what this comment is saying is that OIDs in the second and third categories are doled out by counters. Therefore, we can't know which of those OIDs will get used, or how many of them will get used, or which objects will get which OIDs. Therefore, I think we should go back to the approach that you were using for template0 and handle both that database and postgres using that method. That is, assign an OID manually, and make sure unused_oids knows that it should be counted as already used. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > It seems to me that what this comment is saying is that OIDs in the > second and third categories are doled out by counters. Therefore, we > can't know which of those OIDs will get used, or how many of them will > get used, or which objects will get which OIDs. Therefore, I think we > should go back to the approach that you were using for template0 and > handle both that database and postgres using that method. That is, > assign an OID manually, and make sure unused_oids knows that it should > be counted as already used. Indeed. If you're going to manually assign OIDs to these databases, do it honestly, and put them into the range intended for that purpose. Trying to take short-cuts is just going to cause trouble down the road. regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Sat, Jan 22, 2022 at 12:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > It seems to me that what this comment is saying is that OIDs in the > > second and third categories are doled out by counters. Therefore, we > > can't know which of those OIDs will get used, or how many of them will > > get used, or which objects will get which OIDs. Therefore, I think we > > should go back to the approach that you were using for template0 and > > handle both that database and postgres using that method. That is, > > assign an OID manually, and make sure unused_oids knows that it should > > be counted as already used. > > Indeed. If you're going to manually assign OIDs to these databases, > do it honestly, and put them into the range intended for that purpose. > Trying to take short-cuts is just going to cause trouble down the road. Understood. I will rework the patch accordingly. Thanks Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Sat, Jan 22, 2022 at 12:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 21, 2022 at 8:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > From what I see in the code, template0 and postgres are the last > > things that get created in initdb phase. The system OIDs that get > > assigned to these DBs vary from release to release. At present, the > > system assigned OIDs of template0 and postgres are 13679 and 13680 > > respectively. I feel it would be safe to assign 16000 and 16001 for > > template0 and postgres respectively from the unpinned object OID range > > 12000 - 16383. In the future, even if the initdb unpinned objects > > reach the range of 16000 issues can only arise if initdb() creates > > another system-created database for which the system assigns these > > reserved OIDs (16000, 16001). > > It doesn't seem safe to me to rely on that. We don't know what could > happen in the future if the number of built-in objects increases. > Looking at the lengthy comment on this topic in transam.h, I see that > there are three ranges: > > 1-9999 manually assigned OIDs > 10000-11999 OIDs assigned by genbki.pl > 12000-16384 OIDs assigned to unpinned objects post-bootstrap > > It seems to me that what this comment is saying is that OIDs in the > second and third categories are doled out by counters. Therefore, we > can't know which of those OIDs will get used, or how many of them will > get used, or which objects will get which OIDs. Therefore, I think we > should go back to the approach that you were using for template0 and > handle both that database and postgres using that method. That is, > assign an OID manually, and make sure unused_oids knows that it should > be counted as already used. Agree. In the latest patch, the template0 and postgres OIDs are fixed to unused manually assigned OIDs 4 and 5 respectively. These OIDs are no more listed as unused OIDs. Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Bruce Momjian
Date:
On Sat, Jan 22, 2022 at 12:47:35PM +0530, Shruthi Gowda wrote: > On Sat, Jan 22, 2022 at 12:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Robert Haas <robertmhaas@gmail.com> writes: > > > It seems to me that what this comment is saying is that OIDs in the > > > second and third categories are doled out by counters. Therefore, we > > > can't know which of those OIDs will get used, or how many of them will > > > get used, or which objects will get which OIDs. Therefore, I think we > > > should go back to the approach that you were using for template0 and > > > handle both that database and postgres using that method. That is, > > > assign an OID manually, and make sure unused_oids knows that it should > > > be counted as already used. > > > > Indeed. If you're going to manually assign OIDs to these databases, > > do it honestly, and put them into the range intended for that purpose. > > Trying to take short-cuts is just going to cause trouble down the road. > > Understood. I will rework the patch accordingly. Thanks Thanks. To get the rsync update reduction we need to preserve database oids. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > Agree. In the latest patch, the template0 and postgres OIDs are fixed > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are > no more listed as unused OIDs. Thanks. Committed with a few more cosmetic changes. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Shruthi Gowda
Date:
On Tue, Jan 25, 2022 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Agree. In the latest patch, the template0 and postgres OIDs are fixed > > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are > > no more listed as unused OIDs. > > Thanks. Committed with a few more cosmetic changes. Thanks, Robert for committing this patch.
Stephen Frost <sfrost@snowman.net> wrote: > Perhaps this is all too meta and we need to work through some specific > ideas around just what this would look like. In particular, thinking > about what this API would look like and how it would be used by > reorderbuffer.c, which builds up changes in memory and then does a bare > write() call, seems like a main use-case to consider. The gist there > being "can we come up with an API to do all these things that doesn't > require entirely rewriting ReorderBufferSerializeChange()?" > > Seems like it'd be easier to achieve that by having something that looks > very close to how write() looks, but just happens to have the option to > run the data through a stream cipher and maybe does better error > handling for us. Making that layer also do block-based access to the > files underneath seems like a much larger effort that, sure, may make > some things better too but if we could do that with the same API then it > could also be done later if someone's interested in that. My initial proposal is in this new thread: https://www.postgresql.org/message-id/4987.1644323098%40antos -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Justin Pryzby
Date:
On Tue, Jan 25, 2022 at 10:19:53AM +0530, Shruthi Gowda wrote: > On Tue, Jan 25, 2022 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > > Agree. In the latest patch, the template0 and postgres OIDs are fixed > > > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are > > > no more listed as unused OIDs. > > > > Thanks. Committed with a few more cosmetic changes. > > Thanks, Robert for committing this patch. If I'm not wrong, this can be closed. https://commitfest.postgresql.org/37/3296/
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Andres Freund
Date:
On 2022-01-24 14:44:10 -0500, Robert Haas wrote: > On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote: > > Agree. In the latest patch, the template0 and postgres OIDs are fixed > > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are > > no more listed as unused OIDs. > > Thanks. Committed with a few more cosmetic changes. I noticed this still has an open CF entry: https://commitfest.postgresql.org/37/3296/ I assume it can be marked as committed? - Andres
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Mon, Mar 21, 2022 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > I noticed this still has an open CF entry: https://commitfest.postgresql.org/37/3296/ > I assume it can be marked as committed? Yeah, done now. But don't forget that we still need to do something on the "wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:" thread, and I think the ball is in your court there. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote: >> Agree. In the latest patch, the template0 and postgres OIDs are fixed >> to unused manually assigned OIDs 4 and 5 respectively. These OIDs are >> no more listed as unused OIDs. > Thanks. Committed with a few more cosmetic changes. I happened to take a closer look at this patch today, and I'm pretty unhappy with the way that the assignment of those OIDs was managed. There are two big problems: 1. IsPinnedObject() will now report that template0 and postgres are pinned. This seems not to prevent one from dropping them (I guess dropdb() doesn't consult IsPinnedObject), but it would probably bollix any pg_shdepend management that should happen for them. 2. The Catalog.pm infrastructure knows nothing about these OIDs. While the unused_oids script was hack-n-slashed to claim that the OIDs are used, other scripts won't know about them; for example duplicate_oids won't report conflicts if someone tries to reuse those OIDs. The attached draft patch attempts to improve this situation. It reserves these OIDs, and creates the associated macros, through the normal BKI infrastructure by adding entries in pg_database.dat. We have to delete those rows again during initdb, which is slightly ugly but surely no more so than initdb's other direct manipulations of pg_database. There are a few loose ends: * I'm a bit inclined to simplify IsPinnedObject by just teaching it that *no* entries of pg_database are pinned, which would correspond to the evident lack of enforcement in dropdb(). Can anyone see a reason why we might pin some database in future? * I had to set up the additional pg_database entries with nonzero datfrozenxid to avoid an assertion failure during initdb's first VACUUM. (That VACUUM will overwrite template1's datfrozenxid before computing the global minimum frozen XID, but not these others; and it doesn't like finding that the minimum is zero.) This feels klugy. An alternative is to delete the extra pg_database rows before that VACUUM, which would mean taking those deletes out of make_template0 and make_postgres and putting them somewhere seemingly unrelated, so that's a bit crufty too. Anybody have a preference? * The new macro names seem ill-chosen. Template0ObjectId is spelled randomly differently from the longstanding TemplateDbOid, and surely PostgresObjectId is about as vague a name as could possibly have been thought of (please name an object that it couldn't apply to). I'm a little inclined to rename TemplateDbOid to Template1DbOid and use Template0DbOid and PostgresDbOid for the others, but I didn't pull that trigger here. Comments? regards, tom lane diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index 520f77971b..d7e5c02f95 100644 --- a/src/backend/catalog/catalog.c +++ b/src/backend/catalog/catalog.c @@ -339,9 +339,11 @@ IsPinnedObject(Oid classId, Oid objectId) * robustness. */ - /* template1 is not pinned */ + /* template1, template0, postgres are not pinned */ if (classId == DatabaseRelationId && - objectId == TemplateDbOid) + (objectId == TemplateDbOid || + objectId == Template0ObjectId || + objectId == PostgresObjectId)) return false; /* the public namespace is not pinned */ diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 1cb4a5b0d2..04454b3d7c 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -59,11 +59,11 @@ #include "sys/mman.h" #endif -#include "access/transam.h" #include "access/xlog_internal.h" #include "catalog/pg_authid_d.h" #include "catalog/pg_class_d.h" /* pgrminclude ignore */ #include "catalog/pg_collation_d.h" +#include "catalog/pg_database_d.h" /* pgrminclude ignore */ #include "common/file_perm.h" #include "common/file_utils.h" #include "common/logging.h" @@ -1806,14 +1806,24 @@ make_template0(FILE *cmdfd) * objects in the old cluster, the problem scenario only exists if the OID * that is in use in the old cluster is also used in the new cluster - and * the new cluster should be the result of a fresh initdb.) - * - * We use "STRATEGY = file_copy" here because checkpoints during initdb - * are cheap. "STRATEGY = wal_log" would generate more WAL, which would - * be a little bit slower and make the new cluster a little bit bigger. */ static const char *const template0_setup[] = { - "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = " - CppAsString2(Template0ObjectId) + /* + * Since pg_database.dat includes a dummy row for template0, we have + * to remove that before creating the database for real. + */ + "DELETE FROM pg_database WHERE datname = 'template0';\n\n", + + /* + * Clone template1 to make template0. + * + * We use "STRATEGY = file_copy" here because checkpoints during + * initdb are cheap. "STRATEGY = wal_log" would generate more WAL, + * which would be a little bit slower and make the new cluster a + * little bit bigger. + */ + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false" + " OID = " CppAsString2(Template0ObjectId) " STRATEGY = file_copy;\n\n", /* @@ -1836,12 +1846,11 @@ make_template0(FILE *cmdfd) "REVOKE CREATE,TEMPORARY ON DATABASE template1 FROM public;\n\n", "REVOKE CREATE,TEMPORARY ON DATABASE template0 FROM public;\n\n", - "COMMENT ON DATABASE template0 IS 'unmodifiable empty database';\n\n", - /* - * Finally vacuum to clean up dead rows in pg_database + * Note: postgres.bki filled in a comment for template0, so we need + * not do that here. */ - "VACUUM pg_database;\n\n", + NULL }; @@ -1858,12 +1867,19 @@ make_postgres(FILE *cmdfd) const char *const *line; /* - * Just as we did for template0, and for the same reasons, assign a fixed - * OID to postgres and select the file_copy strategy. + * Comments in make_template0() mostly apply here too. */ static const char *const postgres_setup[] = { - "CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) " STRATEGY = file_copy;\n\n", - "COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n", + "DELETE FROM pg_database WHERE datname = 'postgres';\n\n", + + "CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) + " STRATEGY = file_copy;\n\n", + + /* + * Finally vacuum to clean up dead rows in pg_database + */ + "VACUUM pg_database;\n\n", + NULL }; diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 969e2a7a46..bcb81e02c4 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -2844,10 +2844,11 @@ dumpDatabase(Archive *fout) qdatname = pg_strdup(fmtId(datname)); /* - * Prepare the CREATE DATABASE command. We must specify encoding, locale, - * and tablespace since those can't be altered later. Other DB properties - * are left to the DATABASE PROPERTIES entry, so that they can be applied - * after reconnecting to the target DB. + * Prepare the CREATE DATABASE command. We must specify OID (if we want + * to preserve that), as well as the encoding, locale, and tablespace + * since those can't be altered later. Other DB properties are left to + * the DATABASE PROPERTIES entry, so that they can be applied after + * reconnecting to the target DB. */ if (dopt->binary_upgrade) { diff --git a/src/include/access/transam.h b/src/include/access/transam.h index 9a2816de51..338dfca5a0 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -196,10 +196,6 @@ FullTransactionIdAdvance(FullTransactionId *dest) #define FirstUnpinnedObjectId 12000 #define FirstNormalObjectId 16384 -/* OIDs of Template0 and Postgres database are fixed */ -#define Template0ObjectId 4 -#define PostgresObjectId 5 - /* * VariableCache is a data structure in shared memory that is used to track * OID and XID assignment state. For largely historical reasons, there is diff --git a/src/include/catalog/pg_database.dat b/src/include/catalog/pg_database.dat index 5feedff7bf..6c2221a4e9 100644 --- a/src/include/catalog/pg_database.dat +++ b/src/include/catalog/pg_database.dat @@ -19,4 +19,24 @@ datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE', datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' }, +# The template0 and postgres entries are included here to reserve their +# associated OIDs. We show their correct properties for reference, but +# note that these pg_database rows are removed and replaced by initdb. +# Unlike the row for template1, their datfrozenxids will not be overwritten +# during initdb's first VACUUM, so we have to provide normal-looking XIDs. + +{ oid => '4', oid_symbol => 'Template0ObjectId', + descr => 'unmodifiable empty database', + datname => 'template0', encoding => 'ENCODING', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't', + datallowconn => 'f', datconnlimit => '-1', datfrozenxid => '4', + datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE', + datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' }, + +{ oid => '5', oid_symbol => 'PostgresObjectId', + descr => 'default administrative connection database', + datname => 'postgres', encoding => 'ENCODING', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 'f', + datallowconn => 't', datconnlimit => '-1', datfrozenxid => '4', + datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE', + datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' }, + ] diff --git a/src/include/catalog/unused_oids b/src/include/catalog/unused_oids index 61d41e7561..e55bc6fa3c 100755 --- a/src/include/catalog/unused_oids +++ b/src/include/catalog/unused_oids @@ -32,15 +32,6 @@ my @input_files = glob("pg_*.h"); my $oids = Catalog::FindAllOidsFromHeaders(@input_files); -# Push the template0 and postgres database OIDs. -my $Template0ObjectId = - Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId'); -push @{$oids}, $Template0ObjectId; - -my $PostgresObjectId = - Catalog::FindDefinedSymbol('access/transam.h', '..', 'PostgresObjectId'); -push @{$oids}, $PostgresObjectId; - # Also push FirstGenbkiObjectId to serve as a terminator for the last gap. my $FirstGenbkiObjectId = Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId');
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Wed, Apr 20, 2022 at 2:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The attached draft patch attempts to improve this situation. > It reserves these OIDs, and creates the associated macros, through > the normal BKI infrastructure by adding entries in pg_database.dat. > We have to delete those rows again during initdb, which is slightly > ugly but surely no more so than initdb's other direct manipulations > of pg_database. I'm not sure I really like this approach, but if you're firmly convinced that it's better than cleaning up the loose ends in some other way, I'm not going to waste a lot of energy fighting about it. It doesn't seem horrible. > There are a few loose ends: > > * I'm a bit inclined to simplify IsPinnedObject by just teaching > it that *no* entries of pg_database are pinned, which would correspond > to the evident lack of enforcement in dropdb(). Can anyone see a > reason why we might pin some database in future? It's kind of curious that we don't pin anything now. There's kind of nothing to stop you from hosing the system by dropping template0 and/or template1, or mutating them into some crazy form that doesn't work. But having said that, if as a matter of policy we don't even pin template0 or template1 or postgres, then it seems unlikely that we would suddenly decide to pin some new database in the future. > * I had to set up the additional pg_database entries with nonzero > datfrozenxid to avoid an assertion failure during initdb's first VACUUM. > (That VACUUM will overwrite template1's datfrozenxid before computing > the global minimum frozen XID, but not these others; and it doesn't like > finding that the minimum is zero.) This feels klugy. An alternative is > to delete the extra pg_database rows before that VACUUM, which would > mean taking those deletes out of make_template0 and make_postgres and > putting them somewhere seemingly unrelated, so that's a bit crufty too. > Anybody have a preference? Not really. If anything that's an argument against this entire approach, but I already commented on that point above. > * The new macro names seem ill-chosen. Template0ObjectId is spelled > randomly differently from the longstanding TemplateDbOid, and surely > PostgresObjectId is about as vague a name as could possibly have > been thought of (please name an object that it couldn't apply to). > I'm a little inclined to rename TemplateDbOid to Template1DbOid and > use Template0DbOid and PostgresDbOid for the others, but I didn't pull > that trigger here. Seems totally reasonable. I don't find the current naming horrible or I'd not have committed it that way, but putting Db in there makes sense, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 20, 2022 at 2:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The attached draft patch attempts to improve this situation. >> It reserves these OIDs, and creates the associated macros, through >> the normal BKI infrastructure by adding entries in pg_database.dat. >> We have to delete those rows again during initdb, which is slightly >> ugly but surely no more so than initdb's other direct manipulations >> of pg_database. > I'm not sure I really like this approach, but if you're firmly > convinced that it's better than cleaning up the loose ends in some > other way, I'm not going to waste a lot of energy fighting about it. Having just had to bury my nose in renumber_oids.pl, I thought of a different approach we could take to expose these OIDs to Catalog.pm. That's to invent a new macro that Catalog.pm recognizes, and write something about like this in pg_database.h: DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4); DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5); That would result in (a) the OIDs becoming known to Catalog.pm as reserved, though it wouldn't have any great clue about their semantics; and (b) this getting emitted into pg_database_d.h: #define Template0ObjectId 4 #define PostgresObjectId 5 Then we'd not need the dummy entries in pg_database.dat, which does seem cleaner now that I think about it. A downside is that with no context, Catalog.pm could not provide name translation services during postgres.bki generation for such OIDs --- but at least for these entries, we don't need that. If that seems more plausible to you I'll set about preparing a patch. regards, tom lane
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Wed, Apr 20, 2022 at 4:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Having just had to bury my nose in renumber_oids.pl, I thought of a > different approach we could take to expose these OIDs to Catalog.pm. > That's to invent a new macro that Catalog.pm recognizes, and write > something about like this in pg_database.h: > > DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4); > DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5); > > If that seems more plausible to you I'll set about preparing a patch. I like it! -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 20, 2022 at 4:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Having just had to bury my nose in renumber_oids.pl, I thought of a >> different approach we could take to expose these OIDs to Catalog.pm. >> That's to invent a new macro that Catalog.pm recognizes, and write >> something about like this in pg_database.h: >> DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4); >> DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5); > I like it! 0001 attached is a revised patch that does it that way. This seems like a clearly better answer. 0002 contains the perhaps-slightly-more-controversial changes of changing the macro names and explicitly pinning no databases. regards, tom lane diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm index 0275795dea..ece0a934f0 100644 --- a/src/backend/catalog/Catalog.pm +++ b/src/backend/catalog/Catalog.pm @@ -44,6 +44,8 @@ sub ParseHeader $catalog{columns} = []; $catalog{toasting} = []; $catalog{indexing} = []; + $catalog{other_oids} = []; + $catalog{foreign_keys} = []; $catalog{client_code} = []; open(my $ifh, '<', $input_file) || die "$input_file: $!"; @@ -118,6 +120,14 @@ sub ParseHeader index_decl => $6 }; } + elsif (/^DECLARE_OID_DEFINING_MACRO\(\s*(\w+),\s*(\d+)\)/) + { + push @{ $catalog{other_oids} }, + { + other_name => $1, + other_oid => $2 + }; + } elsif ( /^DECLARE_(ARRAY_)?FOREIGN_KEY(_OPT)?\(\s*\(([^)]+)\),\s*(\w+),\s*\(([^)]+)\)\)/ ) @@ -572,6 +582,10 @@ sub FindAllOidsFromHeaders { push @oids, $index->{index_oid}; } + foreach my $other (@{ $catalog->{other_oids} }) + { + push @oids, $other->{other_oid}; + } } return \@oids; diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index 520f77971b..d7e5c02f95 100644 --- a/src/backend/catalog/catalog.c +++ b/src/backend/catalog/catalog.c @@ -339,9 +339,11 @@ IsPinnedObject(Oid classId, Oid objectId) * robustness. */ - /* template1 is not pinned */ + /* template1, template0, postgres are not pinned */ if (classId == DatabaseRelationId && - objectId == TemplateDbOid) + (objectId == TemplateDbOid || + objectId == Template0ObjectId || + objectId == PostgresObjectId)) return false; /* the public namespace is not pinned */ diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl index 2d02b02267..f4ec6d6d40 100644 --- a/src/backend/catalog/genbki.pl +++ b/src/backend/catalog/genbki.pl @@ -472,7 +472,7 @@ EOM $catalog->{rowtype_oid_macro}, $catalog->{rowtype_oid} if $catalog->{rowtype_oid_macro}; - # Likewise for macros for toast and index OIDs + # Likewise for macros for toast, index, and other OIDs foreach my $toast (@{ $catalog->{toasting} }) { printf $def "#define %s %s\n", @@ -488,6 +488,12 @@ EOM $index->{index_oid_macro}, $index->{index_oid} if $index->{index_oid_macro}; } + foreach my $other (@{ $catalog->{other_oids} }) + { + printf $def "#define %s %s\n", + $other->{other_name}, $other->{other_oid} + if $other->{other_name}; + } print $def "\n"; diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 1cb4a5b0d2..5e2eeefc4c 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -59,11 +59,11 @@ #include "sys/mman.h" #endif -#include "access/transam.h" #include "access/xlog_internal.h" #include "catalog/pg_authid_d.h" #include "catalog/pg_class_d.h" /* pgrminclude ignore */ #include "catalog/pg_collation_d.h" +#include "catalog/pg_database_d.h" /* pgrminclude ignore */ #include "common/file_perm.h" #include "common/file_utils.h" #include "common/logging.h" diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index d3588607e7..786d592e2b 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -2901,10 +2901,11 @@ dumpDatabase(Archive *fout) qdatname = pg_strdup(fmtId(datname)); /* - * Prepare the CREATE DATABASE command. We must specify encoding, locale, - * and tablespace since those can't be altered later. Other DB properties - * are left to the DATABASE PROPERTIES entry, so that they can be applied - * after reconnecting to the target DB. + * Prepare the CREATE DATABASE command. We must specify OID (if we want + * to preserve that), as well as the encoding, locale, and tablespace + * since those can't be altered later. Other DB properties are left to + * the DATABASE PROPERTIES entry, so that they can be applied after + * reconnecting to the target DB. */ if (dopt->binary_upgrade) { diff --git a/src/include/access/transam.h b/src/include/access/transam.h index 9a2816de51..338dfca5a0 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -196,10 +196,6 @@ FullTransactionIdAdvance(FullTransactionId *dest) #define FirstUnpinnedObjectId 12000 #define FirstNormalObjectId 16384 -/* OIDs of Template0 and Postgres database are fixed */ -#define Template0ObjectId 4 -#define PostgresObjectId 5 - /* * VariableCache is a data structure in shared memory that is used to track * OID and XID assignment state. For largely historical reasons, there is diff --git a/src/include/catalog/genbki.h b/src/include/catalog/genbki.h index 4ecd76f4be..992b784236 100644 --- a/src/include/catalog/genbki.h +++ b/src/include/catalog/genbki.h @@ -84,6 +84,14 @@ #define DECLARE_UNIQUE_INDEX(name,oid,oidmacro,decl) extern int no_such_variable #define DECLARE_UNIQUE_INDEX_PKEY(name,oid,oidmacro,decl) extern int no_such_variable +/* + * These lines inform genbki.pl about manually-assigned OIDs that do not + * correspond to any entry in the catalog *.dat files, but should be subject + * to uniqueness verification and renumber_oids.pl renumbering. A C macro + * to #define the given name is emitted into the corresponding *_d.h file. + */ +#define DECLARE_OID_DEFINING_MACRO(name,oid) extern int no_such_variable + /* * These lines are processed by genbki.pl to create a table for use * by the pg_get_catalog_foreign_keys() function. We do not have any diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h index e10e91c0af..96be9e9729 100644 --- a/src/include/catalog/pg_database.h +++ b/src/include/catalog/pg_database.h @@ -91,4 +91,13 @@ DECLARE_TOAST_WITH_MACRO(pg_database, 4177, 4178, PgDatabaseToastTable, PgDataba DECLARE_UNIQUE_INDEX(pg_database_datname_index, 2671, DatabaseNameIndexId, on pg_database using btree(datname name_ops)); DECLARE_UNIQUE_INDEX_PKEY(pg_database_oid_index, 2672, DatabaseOidIndexId, on pg_database using btree(oid oid_ops)); +/* + * pg_database.dat contains an entry for template1, but not for the template0 + * or postgres databases, because those are created later in initdb. + * However, we still want to manually assign the OIDs for template0 and + * postgres, so declare those here. + */ +DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4); +DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5); + #endif /* PG_DATABASE_H */ diff --git a/src/include/catalog/renumber_oids.pl b/src/include/catalog/renumber_oids.pl index 7de13da4bd..ba8c69c87e 100755 --- a/src/include/catalog/renumber_oids.pl +++ b/src/include/catalog/renumber_oids.pl @@ -170,6 +170,16 @@ foreach my $input_file (@header_files) $changed = 1; } } + elsif (/^(DECLARE_OID_DEFINING_MACRO\(\s*\w+,\s*)(\d+)\)/) + { + if (exists $maphash{$2}) + { + my $repl = $1 . $maphash{$2} . ")"; + $line =~ + s/^DECLARE_OID_DEFINING_MACRO\(\s*\w+,\s*\d+\)/$repl/; + $changed = 1; + } + } elsif ($line =~ m/^CATALOG\((\w+),(\d+),(\w+)\)/) { if (exists $maphash{$2}) diff --git a/src/include/catalog/unused_oids b/src/include/catalog/unused_oids index 61d41e7561..e55bc6fa3c 100755 --- a/src/include/catalog/unused_oids +++ b/src/include/catalog/unused_oids @@ -32,15 +32,6 @@ my @input_files = glob("pg_*.h"); my $oids = Catalog::FindAllOidsFromHeaders(@input_files); -# Push the template0 and postgres database OIDs. -my $Template0ObjectId = - Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId'); -push @{$oids}, $Template0ObjectId; - -my $PostgresObjectId = - Catalog::FindDefinedSymbol('access/transam.h', '..', 'PostgresObjectId'); -push @{$oids}, $PostgresObjectId; - # Also push FirstGenbkiObjectId to serve as a terminator for the last gap. my $FirstGenbkiObjectId = Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId'); diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml index 33955494c6..20894baf18 100644 --- a/doc/src/sgml/bki.sgml +++ b/doc/src/sgml/bki.sgml @@ -180,12 +180,13 @@ [ # A comment could appear here. -{ oid => '1', oid_symbol => 'TemplateDbOid', +{ oid => '1', oid_symbol => 'Template1DbOid', descr => 'database\'s default template', - datname => 'template1', encoding => 'ENCODING', datistemplate => 't', + datname => 'template1', encoding => 'ENCODING', + datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't', datallowconn => 't', datconnlimit => '-1', datfrozenxid => '0', datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE', - datctype => 'LC_CTYPE', datacl => '_null_' }, + datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' }, ] ]]></programlisting> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5eabd32cf6..61cda56c6f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4540,9 +4540,9 @@ BootStrapXLOG(void) checkPoint.nextMulti = FirstMultiXactId; checkPoint.nextMultiOffset = 0; checkPoint.oldestXid = FirstNormalTransactionId; - checkPoint.oldestXidDB = TemplateDbOid; + checkPoint.oldestXidDB = Template1DbOid; checkPoint.oldestMulti = FirstMultiXactId; - checkPoint.oldestMultiDB = TemplateDbOid; + checkPoint.oldestMultiDB = Template1DbOid; checkPoint.oldestCommitTsXid = InvalidTransactionId; checkPoint.newestCommitTsXid = InvalidTransactionId; checkPoint.time = (pg_time_t) time(NULL); diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index d7e5c02f95..e784538aae 100644 --- a/src/backend/catalog/catalog.c +++ b/src/backend/catalog/catalog.c @@ -339,18 +339,20 @@ IsPinnedObject(Oid classId, Oid objectId) * robustness. */ - /* template1, template0, postgres are not pinned */ - if (classId == DatabaseRelationId && - (objectId == TemplateDbOid || - objectId == Template0ObjectId || - objectId == PostgresObjectId)) - return false; - /* the public namespace is not pinned */ if (classId == NamespaceRelationId && objectId == PG_PUBLIC_NAMESPACE) return false; + /* + * Databases are never pinned. It might seem that it'd be prudent to pin + * at least template0; but we do this intentionally so that template0 and + * template1 can be rebuilt from each other, thus letting them serve as + * mutual backups (as long as you've not modified template1, anyway). + */ + if (classId == DatabaseRelationId) + return false; + /* * All other initdb-created objects are pinned. This is overkill (the * system doesn't really depend on having every last weird datatype, for diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 9139fe895c..5dbc7379e3 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -908,7 +908,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, */ if (bootstrap) { - MyDatabaseId = TemplateDbOid; + MyDatabaseId = Template1DbOid; MyDatabaseTableSpace = DEFAULTTABLESPACE_OID; } else if (in_dbname != NULL) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 5e2eeefc4c..fcef651c2f 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -1812,8 +1812,8 @@ make_template0(FILE *cmdfd) * be a little bit slower and make the new cluster a little bit bigger. */ static const char *const template0_setup[] = { - "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = " - CppAsString2(Template0ObjectId) + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false" + " OID = " CppAsString2(Template0DbOid) " STRATEGY = file_copy;\n\n", /* @@ -1862,7 +1862,8 @@ make_postgres(FILE *cmdfd) * OID to postgres and select the file_copy strategy. */ static const char *const postgres_setup[] = { - "CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) " STRATEGY = file_copy;\n\n", + "CREATE DATABASE postgres OID = " CppAsString2(PostgresDbOid) + " STRATEGY = file_copy;\n\n", "COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n", NULL }; diff --git a/src/include/catalog/pg_database.dat b/src/include/catalog/pg_database.dat index 5feedff7bf..05873f74f6 100644 --- a/src/include/catalog/pg_database.dat +++ b/src/include/catalog/pg_database.dat @@ -12,7 +12,7 @@ [ -{ oid => '1', oid_symbol => 'TemplateDbOid', +{ oid => '1', oid_symbol => 'Template1DbOid', descr => 'default template for new databases', datname => 'template1', encoding => 'ENCODING', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't', datallowconn => 't', datconnlimit => '-1', datfrozenxid => '0', diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h index 96be9e9729..611c95656a 100644 --- a/src/include/catalog/pg_database.h +++ b/src/include/catalog/pg_database.h @@ -97,7 +97,7 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_database_oid_index, 2672, DatabaseOidIndexId, on pg * However, we still want to manually assign the OIDs for template0 and * postgres, so declare those here. */ -DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4); -DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5); +DECLARE_OID_DEFINING_MACRO(Template0DbOid, 4); +DECLARE_OID_DEFINING_MACRO(PostgresDbOid, 5); #endif /* PG_DATABASE_H */
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Robert Haas
Date:
On Thu, Apr 21, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Apr 20, 2022 at 4:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Having just had to bury my nose in renumber_oids.pl, I thought of a > >> different approach we could take to expose these OIDs to Catalog.pm. > >> That's to invent a new macro that Catalog.pm recognizes, and write > >> something about like this in pg_database.h: > >> DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4); > >> DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5); > > > I like it! > > 0001 attached is a revised patch that does it that way. This seems > like a clearly better answer. > > 0002 contains the perhaps-slightly-more-controversial changes of > changing the macro names and explicitly pinning no databases. Both patches look good to me. -- Robert Haas EDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 21, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 0001 attached is a revised patch that does it that way. This seems >> like a clearly better answer. >> 0002 contains the perhaps-slightly-more-controversial changes of >> changing the macro names and explicitly pinning no databases. > Both patches look good to me. Pushed, thanks for looking. regards, tom lane
On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost@snowman.net> wrote: > Alternatively, we could use > that technique to just provide a better per-page checksum than what we > have today. Maybe we could figure out how to leverage that to move to > 64bit transaction IDs with some page-level epoch. I'm interested in assessing the feasibility of a "better page-level checksums" feature. I have a few questions, and a few observations. One of my questions is what algorithm(s) we'd want to support. I did a quick Google search and found that brtfs supports CRC-32C, XXHASH, SHA256, and BLAKE2B. I don't know that we want to support that many options (but maybe we do) and I don't think CRC-32C makes any sense here, for two reasons. First, we've already got a 16-bit checksum, and a 32-bit checksum doesn't seem like it's gaining enough to be worth the implementation complexity. Second, we're probably going to have to dole out per-page space in multiples of MAXALIGN, and that's usually 8. I think for this purpose we should limit ourselves to algorithms whose output size is, at minimum, 64 bits, and ideally, a multiple of 64 bits. I'm sure there are plenty of options other than the ones that btrfs uses; I mentioned them only as a way of jump-starting the discussion. Note that SHA-256 and BLAKE2B apparently emit enormously wide 16 BYTE checksums. That's a lot of space to consume with a checksum, but your chances of a collision are very small indeed. Even if we only offer one new kind of checksum, making space for a wider checksum makes the page format variable in a way that it currently isn't. There seem to be about 50 compile-time constants in the source code whose values are computed based on the block size and amount of special space in use by some particular AM (yikes!). For example, for the heap, there's stuff like MaxHeapTuplesPerPage and MaxHeapTupleSize. If in the future we have some pages that are just like the ones we have today, and other clusters where we've allowed space for a checksum, then those constants become run-time variable. And since they're used in some very low-level functions that are called a lot, like PageGetHeapFreeSpace(), that seems potentially problematic. The problem is multiplied if you also think about trying to store an epoch on each heap page, as per Stephen's proposal above, because now every page used by any AM whatsoever might or might not have a checksum, and every heap page might also have or not have an epoch XID. I think it's going to be somewhat tricky to figure out a scheme here that avoids making things slow. Originally I was thinking that things like MaxHeapTuplesPerPage ought to become macros or static inline functions, but now I have what I think is a better idea: make them into global variables and initialize them with the appropriate values for the cluster right after we read the control file. This doesn't solve the problem if some pages are different than others, though, and even for the case where every page in the cluster has the same amount of reserved space, reading a global variable is not going to be as efficient as referring to a constant compiled right into the code. I'm hopeful that all of this is solvable somehow, but it's hairy, for sure. Another thing I realized is that we would probably end up with the pd_checksum unused when this other feature is activated. If someone comes up with a clever idea for how to allocate extra space without needing things to be a multiple of MAXIMUM_ALIGNOF, they could potentially shrink the space they need elsewhere by 2 bytes and then use both that space and pd_checksum, but otherwise pd_checksum is likely to be dead when an enhanced checksum feature is in use. Since it's also dead when checksums are turned off, that's probably OK. I suppose another possibility is to allow both to be turned on and off independently, i.e. let someone have both a Fletcher-16 checksum in pd_checksum, and also a wider checksum in this other chunk of space, but I'm not sure whether that's really a useful thing to be able to do. (Opinions?) I'm also a little fuzzy on what the command-line interface for selecting this functionality would look like. The existing option to initdb is just --data-checksums, which doesn't leave any way to say what kind of checksums you want. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jun 9, 2022 at 2:13 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'm interested in assessing the feasibility of a "better page-level > checksums" feature. I have a few questions, and a few observations. > One of my questions is what algorithm(s) we'd want to support. I did a > quick Google search and found that brtfs supports CRC-32C, XXHASH, > SHA256, and BLAKE2B. I don't know that we want to support that many > options (but maybe we do) and I don't think CRC-32C makes any sense > here, for two reasons. First, we've already got a 16-bit checksum, and > a 32-bit checksum doesn't seem like it's gaining enough to be worth > the implementation complexity. Why not? The only problems that it won't solve are all related to crypto. Which is perfectly fine, but it seems like there is a terminology issue here. ISTM that you're really talking about adding a cryptographic hash function, not a checksum. These are rather different things. > Even if we only offer one new kind of checksum, making space for a > wider checksum makes the page format variable in a way that it > currently isn't. I believe that the page special area was designed to be variable-sized, and even anticipates dynamic resizing of the special area. At least in index AMs, where it's not that hard to make extra space in the special area by shifting the tuples back, and then fixing line pointers to point to the new offsets. So you have a dynamic variable-sized array that's a little like a second line pointer array (though probably not added to all that often). My preference is for an approach that builds on that, or at least doesn't significantly complicate it. So a cryptographic hash or nonce can go in the special area proper (structs like BTPageOpaqueData don't need any changes), but at a page offset before the special area proper -- not after. What disadvantages does that approach have, if any, from your point of view? -- Peter Geoghegan
On Thu, Jun 9, 2022 at 2:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > My preference is for an approach that builds on that, or at least > doesn't significantly complicate it. So a cryptographic hash or nonce > can go in the special area proper (structs like BTPageOpaqueData don't > need any changes), but at a page offset before the special area proper > -- not after. Minor correction: I meant "before structs like BTPageOpaqueData, earlier in the page and in the special area proper". -- Peter Geoghegan
On Thu, 9 Jun 2022 at 23:13, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost@snowman.net> wrote: > > Alternatively, we could use > > that technique to just provide a better per-page checksum than what we > > have today. Maybe we could figure out how to leverage that to move to > > 64bit transaction IDs with some page-level epoch. > > I'm interested in assessing the feasibility of a "better page-level > checksums" feature. I have a few questions, and a few observations. > One of my questions is what algorithm(s) we'd want to support. I did a > quick Google search and found that brtfs supports CRC-32C, XXHASH, > SHA256, and BLAKE2B. I don't know that we want to support that many > options (but maybe we do) and I don't think CRC-32C makes any sense > here, for two reasons. First, we've already got a 16-bit checksum, and > a 32-bit checksum doesn't seem like it's gaining enough to be worth > the implementation complexity. Second, we're probably going to have to > dole out per-page space in multiples of MAXALIGN, and that's usually > 8. Why so? We already dole out per-page space in 4-byte increments through pd_linp, and I see no reason why we can't reserve some line pointers for per-page metadata if we decide that we need extra per-page ~overhead~ metadata. > I think for this purpose we should limit ourselves to algorithms > whose output size is, at minimum, 64 bits, and ideally, a multiple of > 64 bits. I'm sure there are plenty of options other than the ones that > btrfs uses; I mentioned them only as a way of jump-starting the > discussion. Note that SHA-256 and BLAKE2B apparently emit enormously > wide 16 BYTE checksums. That's a lot of space to consume with a > checksum, but your chances of a collision are very small indeed. Isn't the goal of a checksum to find - and where possible, correct - bit flips and other broken pages? I would suggest not to use cryptographic hash functions for that, as those are rarely error-correcting. > Even if we only offer one new kind of checksum, making space for a > wider checksum makes the page format variable in a way that it > currently isn't. There seem to be about 50 compile-time constants in > the source code whose values are computed based on the block size and > amount of special space in use by some particular AM (yikes!). Isn't that expected for most of those places? With the current bufpage.h description of Page, it seems obvious that all bytes on a page except those in the "hole" and those in the page header are under full control of the AM. Of course AMs will pre-calculate limits and offsets during compilation, that saves recalculation cycles and/or cache lines with constants to keep in L1. > For > example, for the heap, there's stuff like MaxHeapTuplesPerPage and > MaxHeapTupleSize. If in the future we have some pages that are just > like the ones we have today, and other clusters where we've allowed > space for a checksum, then those constants become run-time variable. > And since they're used in some very low-level functions that are > called a lot, like PageGetHeapFreeSpace(), that seems potentially > problematic. The problem is multiplied if you also think about trying > to store an epoch on each heap page, as per Stephen's proposal above, > because now every page used by any AM whatsoever might or might not > have a checksum, and every heap page might also have or not have an > epoch XID. I think it's going to be somewhat tricky to figure out a > scheme here that avoids making things slow. Can't we add some extra fork that stores this extra per-page information, and contains this extra metadata in a double-buffered format, so that both before the actual page is written the metadata too is written to disk, while the old metadata is available too for recovery purposes. This allows us to maintain the current format with its low per-page overhead, and only have extra overhead (up to 2x writes for each page, but the writes for these metadata pages need not be BLCKSZ in size) for those that opt-in to the more computationally expensive features of larger checksums, nonces, and/or other non-AM per-page ~overhead~ metadata. > Originally I was thinking > that things like MaxHeapTuplesPerPage ought to become macros or static > inline functions, but now I have what I think is a better idea: make > them into global variables and initialize them with the appropriate > values for the cluster right after we read the control file. This > doesn't solve the problem if some pages are different than others, > though, and even for the case where every page in the cluster has the > same amount of reserved space, reading a global variable is not going > to be as efficient as referring to a constant compiled right into the > code. I'm hopeful that all of this is solvable somehow, but it's > hairy, for sure. > > Another thing I realized is that we would probably end up with the > pd_checksum unused when this other feature is activated. If someone > comes up with a clever idea for how to allocate extra space without > needing things to be a multiple of MAXIMUM_ALIGNOF, they could > potentially shrink the space they need elsewhere by 2 bytes and then > use both that space and pd_checksum, but otherwise pd_checksum is > likely to be dead when an enhanced checksum feature is in use. Since > it's also dead when checksums are turned off, that's probably OK. I > suppose another possibility is to allow both to be turned on and off > independently, i.e. let someone have both a Fletcher-16 checksum in > pd_checksum, and also a wider checksum in this other chunk of space, > but I'm not sure whether that's really a useful thing to be able to > do. (Opinions?) I'd prefer if we didn't change the way pages are presented to AMs. Currently, it is clear what area is available to you if you write an AM that uses the bufpage APIs. Changing the page format to have the buffer manager also touch / reserve space in the special areas seems like a break of abstraction: Quoting from bufpage.h: * AM-generic per-page information is kept in PageHeaderData. * * AM-specific per-page data (if any) is kept in the area marked "special * space"; each AM has an "opaque" structure defined somewhere that is * stored as the page trailer. an access method should always * initialize its pages with PageInit and then set its own opaque * fields. I'd rather we keep this contract: am-generic stuff belongs in PageHeaderData, with the rest of the page fully available for the AM to use (including the special area). Kind regards, Matthias van de Meent
Hello Robert, > I think for this purpose we should limit ourselves to algorithms > whose output size is, at minimum, 64 bits, and ideally, a multiple of > 64 bits. I'm sure there are plenty of options other than the ones that > btrfs uses; I mentioned them only as a way of jump-starting the > discussion. Note that SHA-256 and BLAKE2B apparently emit enormously > wide 16 BYTE checksums. That's a lot of space to consume with a > checksum, but your chances of a collision are very small indeed. My 0.02€ about that: You do not have to store the whole hash algorithm output, you can truncate or reduce (eg by xoring parts) the size to what makes sense for your application and security requirements. ISTM that 64 bits is more than enough for a page checksum, whatever the underlying hash algorithm. Also, ISTM that a checksum algorithm does not really need to be cryptographically strong, which means that cheaper alternatives are ok, although good quality should be sought nevertheless. -- Fabien.
On Fri, Jun 10, 2022 at 5:00 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Can't we add some extra fork that stores this extra per-page
information, and contains this extra metadata
+1 for this approach. I had observed some painful corruption cases where block storage simply returned stale version of a rage of blocks. This is only possible because checksum is stored on the page itself.
A special fork for checksums would allow us to better detect failures in SSD firmawares, MMU SEUs etc, OS page cache, backup software and storage. It may seems that these kind of stuff never happen. But probability of such failure is drastically bigger than probability of hardware failure being undetected due to CRC16 collision.
Also I'm skeptical about correcting detected errors with the information from checksum. This approach requires very very large checksum. It's much easier to obtain fresh block copy from HA standby.
Best regards, Andrey Borodin.
On Thu, Jun 9, 2022 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > Why not? The only problems that it won't solve are all related to > crypto. Which is perfectly fine, but it seems like there is a > terminology issue here. ISTM that you're really talking about adding a > cryptographic hash function, not a checksum. These are rather > different things. I don't think those are mutually exclusive categories. I shall cite Wikipedia: "Cryptographic hash ... can also be used as ordinary hash functions, to index data in hash tables, for fingerprinting, to detect duplicate data or uniquely identify files, and as checksums to detect accidental data corruption."[1] There is also PostgreSQL precedent in the form of the --manifest-checksums argument to pg_basebackup, whose legal values are SHA{224,256,384,512}|CRC32C|NONE. The man page for the "shasum" utility says that the purpose of the command is to "Print or Check SHA Checksums". I'm not perfectly attached to the idea of using SHA here, but it seems to me that's pretty much the standard thing these days. Stephen Frost and David Steele pushed hard for SHA checksums in backup manifests, and actually wanted it to be the default. I think that if you're the kind of person who looks at our existing page checksums and finds them too weak, I doubt that CRC-32C is going to make you feel any better. You're probably the sort of person who thinks that checksums should have a lot of bits, and you're probably not going to be satisfied with the properties of an algorithm invented in the 1960s. Of course if there's anyone out there who thinks that our existing 16-bit checksums are a pile of garbage but would be much happier if CRC-32C is an option, I am happy to have them show up here and say so, but I find it much more likely that people who want this kind of feature would advocate for a more modern algorithm. > My preference is for an approach that builds on that, or at least > doesn't significantly complicate it. So a cryptographic hash or nonce > can go in the special area proper (structs like BTPageOpaqueData don't > need any changes), but at a page offset before the special area proper > -- not after. > > What disadvantages does that approach have, if any, from your point of view? I think it would be an extremely good idea to store the extended checksum at the same offset in every page. Right now, code that wants to compute checksums, or a tool like pg_checksums that wants to verify them, can find the checksum without needing to interpret any of the remaining page contents. Things get sticky if you have to interpret the page contents to locate the checksum that's going to tell you whether the page contents are messed up. Perhaps this could be worked around if you tried hard enough, but I don't see what we get out of it. I don't think that putting the checksum at the very end of the every page precludes using variable-size special space in the AMs, or even complicates it much, because if there's a fixed-length block of stuff at the end of every page, you can easily account for that. There's a lot less code that cares about the space above pd_special than there is code that cares about any other portion of the page. -- Robert Haas EDB: http://www.enterprisedb.com [1] https://en.wikipedia.org/wiki/Cryptographic_hash_function
On 10.06.22 15:16, Robert Haas wrote: > I'm not perfectly attached to the idea of using SHA here, but it seems > to me that's pretty much the standard thing these days. Stephen Frost > and David Steele pushed hard for SHA checksums in backup manifests, > and actually wanted it to be the default. That seems like a reasonable use in that application, since you might want to verify whether a backup has been (maliciously?) altered rather than just accidentally bit flipped. > I think that if you're the kind of person who looks at our existing > page checksums and finds them too weak, I doubt that CRC-32C is going > to make you feel any better. You're probably the sort of person who > thinks that checksums should have a lot of bits, and you're probably > not going to be satisfied with the properties of an algorithm invented > in the 1960s. Of course if there's anyone out there who thinks that > our existing 16-bit checksums are a pile of garbage but would be much > happier if CRC-32C is an option, I am happy to have them show up here > and say so, but I find it much more likely that people who want this > kind of feature would advocate for a more modern algorithm. I think there ought to be a bit more principled analysis here than just "let's add a lot more bits". There is probably some kind of information to be had about how many CRC bits are useful for a given block size, say. And then there is the question of performance. When data checksum were first added, there was a lot of concern about that. CRC is usually baked directly into hardware, so it's about as cheap as we can hope for. SHA not so much.
On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Why so? We already dole out per-page space in 4-byte increments > through pd_linp, and I see no reason why we can't reserve some line > pointers for per-page metadata if we decide that we need extra > per-page ~overhead~ metadata. Hmm, that's an interesting approach. I was thinking that putting data after the PageHeaderData struct would be a non-starter because the code that looks up a line pointer by index is currently just multiply-and-add and complicating it seems bad for performance. However, if we treated the space there as overlapping the line pointer array and making some line pointers unusable rather than something inserted prior to the line pointer array, we could avoid that. I still think it would be kind of complicated, though, because we'd have to find every bit of code that loops over the line pointer array or accesses it by index and make sure that it doesn't try to access the low-numbered line pointers. > Isn't the goal of a checksum to find - and where possible, correct - > bit flips and other broken pages? I would suggest not to use > cryptographic hash functions for that, as those are rarely > error-correcting. I wasn't thinking of trying to do error correction, just error detection. See also my earlier reply to Peter Geoghegan. > Isn't that expected for most of those places? With the current > bufpage.h description of Page, it seems obvious that all bytes on a > page except those in the "hole" and those in the page header are under > full control of the AM. Of course AMs will pre-calculate limits and > offsets during compilation, that saves recalculation cycles and/or > cache lines with constants to keep in L1. Yep. > Can't we add some extra fork that stores this extra per-page > information, and contains this extra metadata in a double-buffered > format, so that both before the actual page is written the metadata > too is written to disk, while the old metadata is available too for > recovery purposes. This allows us to maintain the current format with > its low per-page overhead, and only have extra overhead (up to 2x > writes for each page, but the writes for these metadata pages need not > be BLCKSZ in size) for those that opt-in to the more computationally > expensive features of larger checksums, nonces, and/or other non-AM > per-page ~overhead~ metadata. It's not impossible, I'm sure, but it doesn't seem very appealing to me. Those extra reads and writes could be expensive, and there's no place to cleanly integrate them into the code structure. A function like PageIsVerified() -- which is where we currently validate checksums -- only gets the page. It can't go off and read some other page from disk to perform the checksum calculation. I'm not exactly sure what you have in mind when you say that the writes need not be BLCKSZ in size. Technically I guess that's true, but then the checksums have to be crash safe, or they're not much good. If they're not part of the page, how do they get updated in a way that makes them crash safe? I guess it could be done: every time we write a FPW, enlarge the page image by the number of bytes that are stored in this location. When replaying an FPW, update those bytes too. And every time we read or write a page, also read or write those bytes. In essence, we'd be deciding that pages are 8192+n bytes, but the last n bytes are stored in a different file - and, in memory, a different buffer pool. I think that would be hugely invasive and unpleasant to make work and I think the performance would be poor, too. > I'd prefer if we didn't change the way pages are presented to AMs. > Currently, it is clear what area is available to you if you write an > AM that uses the bufpage APIs. Changing the page format to have the > buffer manager also touch / reserve space in the special areas seems > like a break of abstraction: Quoting from bufpage.h: > > * AM-generic per-page information is kept in PageHeaderData. > * > * AM-specific per-page data (if any) is kept in the area marked "special > * space"; each AM has an "opaque" structure defined somewhere that is > * stored as the page trailer. an access method should always > * initialize its pages with PageInit and then set its own opaque > * fields. > > I'd rather we keep this contract: am-generic stuff belongs in > PageHeaderData, with the rest of the page fully available for the AM > to use (including the special area). I don't think that changing the contract has to mean that it becomes unclear what the contract is. And you can't improve any system without changing some stuff. But you certainly don't have to like my ideas or anything.... -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jun 10, 2022 at 9:36 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > I think there ought to be a bit more principled analysis here than just > "let's add a lot more bits". There is probably some kind of information > to be had about how many CRC bits are useful for a given block size, say. > > And then there is the question of performance. When data checksum were > first added, there was a lot of concern about that. CRC is usually > baked directly into hardware, so it's about as cheap as we can hope for. > SHA not so much. That's all pretty fair. I have to admit that SHA checksums sound quite expensive, and also that I'm no expert on what kinds of checksums would be best for this sort of application. Based on the earlier discussions around TDE, I do think that people want tamper-resistant checksums here too -- like maybe something where you can't recompute the checksum without access to some secret. I could propose naive ways to do that, like prepending a fixed chunk of secret bytes to the beginning of every block and then running SHA512 or something over the result, but I'm sure that people with actual knowledge of cryptography have developed much better and more robust ways of doing this sort of thing. I've really been devoting most of my mental energy here to understanding what problems there are at the PostgreSQL level - i.e. when we carve out bytes for a wider checksum, what breaks? The only research that I did to try to understand what algorithms might make sense was a quick Google search, which led me to the list of algorithms that btrfs uses. I figured that was a good starting point because, like a filesystem, we're encrypting fixed-size blocks of data. However, I didn't intend to present the results of that quick look as the definitive answer to the question of what might make sense for PostgreSQL, and would be interested in hearing what you or anyone else thinks about that. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Jun 10, 2022 at 9:36 AM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: > > I think there ought to be a bit more principled analysis here than just > > "let's add a lot more bits". There is probably some kind of information > > to be had about how many CRC bits are useful for a given block size, say. > > > > And then there is the question of performance. When data checksum were > > first added, there was a lot of concern about that. CRC is usually > > baked directly into hardware, so it's about as cheap as we can hope for. > > SHA not so much. > > That's all pretty fair. I have to admit that SHA checksums sound quite > expensive, and also that I'm no expert on what kinds of checksums > would be best for this sort of application. Based on the earlier > discussions around TDE, I do think that people want tamper-resistant > checksums here too -- like maybe something where you can't recompute > the checksum without access to some secret. I could propose naive ways > to do that, like prepending a fixed chunk of secret bytes to the > beginning of every block and then running SHA512 or something over the > result, but I'm sure that people with actual knowledge of cryptography > have developed much better and more robust ways of doing this sort of > thing. So, it's not quite as simple as use X or use Y, we need to be considering the use case too. In particular, the amount of data that's being hash'd is relevant when it comes to making a decision about what hash or checksum to use. When you're talking about (potentially) 1G segment files, you'll want to use something different (like SHA) vs. when you're talking about an 8K block (not that you couldn't use SHA, but it may very well be overkill for it). In terms of TDE, that's yet a different use-case and you'd want to use AE (authenticated encryption) + AAD (additional authenticated data) and the result of that operation is a block which has some amount of unencrypted data (eg: LSN, potentially used as the IV), some amount of encrypted data (eg: everything else), and then space to store the tag (which can be thought of, but is *distinct* from, a hash of the encrypted data + the additional unencrypted data, where the latter would include the unencrypted data on the block, like the LSN, plus other information that we want to include like the qualified path+filename of the file as relevant to the PGDATA root). If our goal is cryptographiclly authenticated and encrypted data pages (which I believe is at least one of our goals) then we're talking about encryption methods like AES GCM which handle production of the tag for us and with that tag we would *not* need to have any independent hash or checksum for the block (though we could, but that should really be included in the *encrypted* section, as hashing unencrypted data and then storing that hash unencrypted could potentially leak information that we'd rather not). Note that NIST has put out information regarding how big a tag is appropriate for how much data is being encrypted with a given authenticated encryption method such as AES GCM. I recall Robert finding similar information for hashing/checksumming of unencrypted data from a similar source and that'd make sense to consider when talking about *just* adding a hash/checksum for unencrypted data blocks. This is the relevant discussion from NIST on this subject: https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-38d.pdf Note particularly Appendix C: Requirements and Guidelines for Using Short Tags (though, really, the whole thing is good to read..). > I've really been devoting most of my mental energy here to > understanding what problems there are at the PostgreSQL level - i.e. > when we carve out bytes for a wider checksum, what breaks? The only > research that I did to try to understand what algorithms might make > sense was a quick Google search, which led me to the list of > algorithms that btrfs uses. I figured that was a good starting point > because, like a filesystem, we're encrypting fixed-size blocks of > data. However, I didn't intend to present the results of that quick > look as the definitive answer to the question of what might make sense > for PostgreSQL, and would be interested in hearing what you or anyone > else thinks about that. In the thread about checksum/hashes for the backup manifest, I was pretty sure you found some information regarding the amount of data being hashed vs. the size you want the hash/checksum to be and that seems like it'd be particularly relevant for this discussion (as it was for backups, at least as I recall..). Hopefully we can go find that. Thanks, Stephen
Attachment
Greetings, * Andrey Borodin (x4m@double.cloud) wrote: > On Fri, Jun 10, 2022 at 5:00 AM Matthias van de Meent < > boekewurm+postgres@gmail.com> wrote: > > Can't we add some extra fork that stores this extra per-page > > information, and contains this extra metadata > > +1 for this approach. I had observed some painful corruption cases where > block storage simply returned stale version of a rage of blocks. This is > only possible because checksum is stored on the page itself. > A special fork for checksums would allow us to better detect failures in > SSD firmawares, MMU SEUs etc, OS page cache, backup software and storage. > It may seems that these kind of stuff never happen. But probability of such > failure is drastically bigger than probability of hardware failure being > undetected due to CRC16 collision. This is another possible approach, sure, but it has its own downsides: clearly more IO ends up being involved and then you also have to deal with the fact that the fork's page would certainly end up covering a lot of the pages in the main relation, not to mention the question of what to do when we want to get checksums *on forks*, which we surely will want to have... > Also I'm skeptical about correcting detected errors with the information > from checksum. This approach requires very very large checksum. It's much > easier to obtain fresh block copy from HA standby. Yeah, error correcting checksums are yet another use-case and one that would require a lot more space. Thanks, Stephen
Attachment
Greetings, * Fabien COELHO (coelho@cri.ensmp.fr) wrote: > >I think for this purpose we should limit ourselves to algorithms > >whose output size is, at minimum, 64 bits, and ideally, a multiple of > >64 bits. I'm sure there are plenty of options other than the ones that > >btrfs uses; I mentioned them only as a way of jump-starting the > >discussion. Note that SHA-256 and BLAKE2B apparently emit enormously > >wide 16 BYTE checksums. That's a lot of space to consume with a > >checksum, but your chances of a collision are very small indeed. > > My 0.02€ about that: > > You do not have to store the whole hash algorithm output, you can truncate > or reduce (eg by xoring parts) the size to what makes sense for your > application and security requirements. ISTM that 64 bits is more than enough > for a page checksum, whatever the underlying hash algorithm. Agreed on this- but we shouldn't be guessing at what the correct answers are here, there's published information from standards bodies about this sort of thing. > Also, ISTM that a checksum algorithm does not really need to be > cryptographically strong, which means that cheaper alternatives are ok, > although good quality should be sought nevertheless. Right, if we aren't doing encryption then we just need to focus on what is needed for the amount of error detection that we want and we can go look at how much space we need when we're talking about 8K or so worth of data. When we *are* doing encryption, what's interesting is the tag length and that's a different thing which has its own published information from standards bodies about and we should be looking at that. While the general "need X amount of space on the page to store the hash/authentication data" problem is the same, the answer to "how much space is needed" will depend on which use case the user requested (well ... probably anyway, maybe we'll get lucky and find that there's a reasonable answer to both which fits in the same amount of space and could possibly leverage that, but let's not try to force that to happen as we'll surely get called out if we go against the guideance from the standards bodies who study this stuff). Thanks, Stephen
Attachment
On Fri, Jun 10, 2022 at 12:08 PM Stephen Frost <sfrost@snowman.net> wrote: > So, it's not quite as simple as use X or use Y, we need to be > considering the use case too. In particular, the amount of data that's > being hash'd is relevant when it comes to making a decision about what > hash or checksum to use. When you're talking about (potentially) 1G > segment files, you'll want to use something different (like SHA) vs. > when you're talking about an 8K block (not that you couldn't use SHA, > but it may very well be overkill for it). Interesting. I expected you to be cheerleading for SHA like a madman. > In terms of TDE, that's yet a different use-case and you'd want to use > AE (authenticated encryption) + AAD (additional authenticated data) and > the result of that operation is a block which has some amount of > unencrypted data (eg: LSN, potentially used as the IV), some amount of > encrypted data (eg: everything else), and then space to store the tag > (which can be thought of, but is *distinct* from, a hash of the > encrypted data + the additional unencrypted data, where the latter would > include the unencrypted data on the block, like the LSN, plus other > information that we want to include like the qualified path+filename of > the file as relevant to the PGDATA root). If our goal is > cryptographiclly authenticated and encrypted data pages (which I believe > is at least one of our goals) then we're talking about encryption > methods like AES GCM which handle production of the tag for us and with > that tag we would *not* need to have any independent hash or checksum for > the block (though we could, but that should really be included in the > *encrypted* section, as hashing unencrypted data and then storing that > hash unencrypted could potentially leak information that we'd rather > not). Yeah, and I feel there was discussion of how much space AES-GCM-SIV would need per page and I can't find that discussion now. Pretty sure it was a pretty meaty number of bytes, and I assume it's also not that cheap to compute. > Note that NIST has put out information regarding how big a tag is > appropriate for how much data is being encrypted with a given > authenticated encryption method such as AES GCM. I recall Robert > finding similar information for hashing/checksumming of unencrypted > data from a similar source and that'd make sense to consider when > talking about *just* adding a hash/checksum for unencrypted data blocks. > > This is the relevant discussion from NIST on this subject: > > https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-38d.pdf > > Note particularly Appendix C: Requirements and Guidelines for Using > Short Tags (though, really, the whole thing is good to read..). I don't see that as very relevant. That's talking about using 32-bit or 64-bit tags for things like VOIP packets where a single compromised packet wouldn't reveal a whole lot. I think we have to take the view that a single compromised disk block is a big deal. > In the thread about checksum/hashes for the backup manifest, I was > pretty sure you found some information regarding the amount of data > being hashed vs. the size you want the hash/checksum to be and that > seems like it'd be particularly relevant for this discussion (as it was > for backups, at least as I recall..). Hopefully we can go find that. I went back and looked and found that I had written this: https://www.postgresql.org/message-id/CA+TgmoYOKC_8o-AR1jTQs0mOrFx=_Rcy5udor1m-LjyJNiSWPQ@mail.gmail.com I think that gets us a whole lot of nowhere, honestly. I think this email from Andres is more on point: http://postgr.es/m/20200327195624.xthhd4xuwabvd3ou@alap3.anarazel.de I don't really understand all the details of the smhasher pages to which he links, but I understand that they're measuring the quality of the bit-mixing, which does matter. So does speed, because people like their database to be fast even if it's using checksums (or TDE if we had that). And I think the output size is also a relevant consideration, because more output bits means both more chance of detecting errors (assuming good bit-mixing, at least) and also more wasted space that isn't being used to store your actual data. I haven't really seen anything that makes me believe that there's a particularly strong relationship between block size and ideal checksum size. There's some relationship, for sure. For instance, you want the checksum to be small relative to the size of the input, so as not to waste a whole bunch of storage space. I wouldn't propose to hash 32 byte blocks with SHA-256, because my checksums would be as big as the original data. But I don't really think such considerations are relevant here. An 8kB block is big enough that any checksum algorithm in common use today is going to produce output that is well under 1% of the page size, so you're not going to be wasting tons of storage. You might be wasting your time, though. One big knock on the Fletcher-16 approach we're using today is that the chance of a chance hash collision is pretty noticeably more than 0. Generally, I think we're right to think that's acceptable, because your chances of noticing even a single corrupted block are very high. However, if you're operating tens or hundreds of thousands of PostgreSQL clusters containing terabytes or petabytes of data, it's quite likely that there will be instances of corruption which you fail to detect because the checksum collided. Maybe you care about that. If you do, you probably need at least a 64-bit checksum before the risk of missing even a single instance of corruption due to a checksum collision becomes negligible. Maybe even slightly larger if the amount of data you're managing is truly vast. So I think there's probably a good argument that if you're just concerned about detecting corruption due to bugs, operator error, hardware failure, etc., something like a 512-bit checksum is overkill if the only purpose is to detect random bit flips. I think the time when you need more bits is when you have some goal beyond being really likely to detect a random error - e.g. if you want 100% guaranteed detection of every single-bit error, or if you want error correction, or if you want to foil an adversary who is trying to construct checksums for maliciously modified blocks. But that is also true for the backup manifest case, and we allowed SHA-n as an option there. I feel like there are bound to be some people who want something like SHA-n just because it sounds good, regardless of whether they really need it. We can tell them "no," though. -- Robert Haas EDB: http://www.enterprisedb.com
Hi hackers, > > Can't we add some extra fork that stores this extra per-page > > information, and contains this extra metadata > > > +1 for this approach. I had observed some painful corruption cases where block storage simply returned stale version ofa rage of blocks. This is only possible because checksum is stored on the page itself. That's very interesting, Andrey. Thanks for sharing. > One of my questions is what algorithm(s) we'd want to support. Should it necessarily be a fixed list? Why not support plugable algorithms? An extension implementing a checksum algorithm is going to need: - several hooks: check_page_after_reading, calc_checksum_before_writing - register_checksum()/deregister_checksum() - an API to save the checksums to a seperate fork By knowing the block number and the hash size the extension knows exactly where to look for the checksum in the fork. -- Best regards, Aleksander Alekseev
On Mon, Jun 13, 2022 at 9:23 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > Should it necessarily be a fixed list? Why not support plugable algorithms? > > An extension implementing a checksum algorithm is going to need: > > - several hooks: check_page_after_reading, calc_checksum_before_writing > - register_checksum()/deregister_checksum() > - an API to save the checksums to a seperate fork > > By knowing the block number and the hash size the extension knows > exactly where to look for the checksum in the fork. I don't think that a separate fork is a good option for reasons that I articulated previously: I think it will be significantly more complex to implement and add extra I/O. I am not completely opposed to the idea of making the algorithm pluggable but I'm not very excited about it either. Making the algorithm pluggable probably wouldn't be super-hard, but allowing a checksum of arbitrary size rather than one of a short list of fixed sizes might complicate efforts to ensure this doesn't degrade performance. And I'm not sure what the benefit is, either. This isn't like archive modules or custom backup targets where the feature proposes to interact with things outside the server and we don't know what's happening on the other side and so need to offer an interface that can accommodate what the user wants to do. Nor is it like a custom background worker or a custom data type which lives fully inside the database but the desired behavior could be anything. It's not even like column compression where I think that the same small set of strategies are probably fine for everybody but some people think that customizing the behavior by datatype would be a good idea. All it's doing is taking a fixed size block of data and checksumming it. I don't see that as being something where there's a lot of interesting things to experiment with from an extension point of view. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert, > I don't think that a separate fork is a good option for reasons that I > articulated previously: I think it will be significantly more complex > to implement and add extra I/O. > > I am not completely opposed to the idea of making the algorithm > pluggable but I'm not very excited about it either. Making the > algorithm pluggable probably wouldn't be super-hard, but allowing a > checksum of arbitrary size rather than one of a short list of fixed > sizes might complicate efforts to ensure this doesn't degrade > performance. And I'm not sure what the benefit is, either. This isn't > like archive modules or custom backup targets where the feature > proposes to interact with things outside the server and we don't know > what's happening on the other side and so need to offer an interface > that can accommodate what the user wants to do. Nor is it like a > custom background worker or a custom data type which lives fully > inside the database but the desired behavior could be anything. It's > not even like column compression where I think that the same small set > of strategies are probably fine for everybody but some people think > that customizing the behavior by datatype would be a good idea. All > it's doing is taking a fixed size block of data and checksumming it. I > don't see that as being something where there's a lot of interesting > things to experiment with from an extension point of view. I see your point. Makes sense. So, to clarify, what we are trying to achieve here is to reduce the probability of an event when a page gets corrupted but the checksum is accidentally the same as it was before the corruption, correct? And we also assume that neither file system nor hardware catched this corruption. If that's the case I would say that using something like SHA256 would be an overkill, not only because of the consumed disk space but also because SHA256 is expensive. Allowing the user to choose from 16-bit, 32-bit and maybe 64-bit checksums should be enough. I would also suggest that no matter how we do it, if the user chooses 16-bit checksums the performance and the disk consumption should remain as they currently are. Regarding the particular choice of a hash function I would suggest the MurmurHash family [1]. This is basically the industry standard (it's good, it's fast, and relatively simple), and we already have murmurhash32() in the core. We also have hash_bytes_extended() to get 64-bit checksums, but I have no strong opinion on whether this particular hash function should be used for pages or not. I believe some benchmarking is appropriate. There is also a 128-bit version of MurmurHash. Personally I doubt that it may be of value in practice, but it will not hurt to support it either, while we are at it. (Probably not in the MVP, though). And if we are going to choose this path, I see no reason not to support SHA256 as well, for the paranoid users. [1]: https://en.wikipedia.org/wiki/MurmurHash -- Best regards, Aleksander Alekseev
On Mon, Jun 13, 2022 at 12:59 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > So, to clarify, what we are trying to achieve here is to reduce the > probability of an event when a page gets corrupted but the checksum is > accidentally the same as it was before the corruption, correct? And we > also assume that neither file system nor hardware catched this > corruption. Yeah, I think so, although it also depends on what the filesystem and the hardware would do if they did catch the corruption. If they would have made our read() or write() operation fail, then any checksum feature at the PostgreSQL level is superfluous. If they would have noticed the operation but not caused a failure, and say just logged something in the system log, a PostgreSQL check could still be useful, because the PostgreSQL user might not be looking at the system log, but will definitely notice if they get an ERROR rather than a query result from PostgreSQL. And if the lower-level systems wouldn't have caught the failure at all, then checksums are useful in that case, too. > If that's the case I would say that using something like SHA256 would > be an overkill, not only because of the consumed disk space but also > because SHA256 is expensive. Allowing the user to choose from 16-bit, > 32-bit and maybe 64-bit checksums should be enough. I would also > suggest that no matter how we do it, if the user chooses 16-bit > checksums the performance and the disk consumption should remain as > they currently are. If the user wants 16-bit checksums, the feature we've already got seems good enough -- and, as you say, it doesn't use any extra disk space. This proposal is just about making people happy if they want a bigger checksum. On the topic of which algorithm to use, I'd be inclined to think that it is going to be more useful to offer checksums that are 64 bits or more, since IMHO 32 is not all that much more than 16, and I still think there are going to be alignment issues. Beyond that I don't have anything against your specific suggestions, but I'd like to hear what other people think. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 10 Jun 2022 at 15:58, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Why so? We already dole out per-page space in 4-byte increments > > through pd_linp, and I see no reason why we can't reserve some line > > pointers for per-page metadata if we decide that we need extra > > per-page ~overhead~ metadata. > > Hmm, that's an interesting approach. I was thinking that putting data > after the PageHeaderData struct would be a non-starter because the > code that looks up a line pointer by index is currently just > multiply-and-add and complicating it seems bad for performance. > However, if we treated the space there as overlapping the line pointer > array and making some line pointers unusable rather than something > inserted prior to the line pointer array, we could avoid that. I still > think it would be kind of complicated, though, because we'd have to > find every bit of code that loops over the line pointer array or > accesses it by index and make sure that it doesn't try to access the > low-numbered line pointers. > > > Isn't the goal of a checksum to find - and where possible, correct - > > bit flips and other broken pages? I would suggest not to use > > cryptographic hash functions for that, as those are rarely > > error-correcting. > > I wasn't thinking of trying to do error correction, just error > detection. See also my earlier reply to Peter Geoghegan. The use of CRC in our current page format implies that we can correct (some) bit errors, which is why I presumed that that was a goal of page checksums. I stand corrected. > > Isn't that expected for most of those places? With the current > > bufpage.h description of Page, it seems obvious that all bytes on a > > page except those in the "hole" and those in the page header are under > > full control of the AM. Of course AMs will pre-calculate limits and > > offsets during compilation, that saves recalculation cycles and/or > > cache lines with constants to keep in L1. > > Yep. > > > Can't we add some extra fork that stores this extra per-page > > information, and contains this extra metadata in a double-buffered > > format, so that both before the actual page is written the metadata > > too is written to disk, while the old metadata is available too for > > recovery purposes. This allows us to maintain the current format with > > its low per-page overhead, and only have extra overhead (up to 2x > > writes for each page, but the writes for these metadata pages need not > > be BLCKSZ in size) for those that opt-in to the more computationally > > expensive features of larger checksums, nonces, and/or other non-AM > > per-page ~overhead~ metadata. > > It's not impossible, I'm sure, but it doesn't seem very appealing to > me. Those extra reads and writes could be expensive, and there's no > place to cleanly integrate them into the code structure. A function > like PageIsVerified() -- which is where we currently validate > checksums -- only gets the page. It can't go off and read some other > page from disk to perform the checksum calculation. It could be part of the buffer IO code to provide PageIsVerifiedExtended with a pointer to the block metadata buffer. > I'm not exactly sure what you have in mind when you say that the > writes need not be BLCKSZ in size. What I meant was that when the extra metadata is stored seperately from the block itself, it could be written directly to the file offset instead of having to track BLCKSZ data for N blocks, so the metadata-write would be << BLCKSZ in length, while the block itself would still be the normal BLCKSZ write. > Technically I guess that's true, > but then the checksums have to be crash safe, or they're not much > good. If they're not part of the page, how do they get updated in a > way that makes them crash safe? I guess it could be done: every time > we write a FPW, enlarge the page image by the number of bytes that are > stored in this location. When replaying an FPW, update those bytes > too. And every time we read or write a page, also read or write those > bytes. In essence, we'd be deciding that pages are 8192+n bytes, but > the last n bytes are stored in a different file - and, in memory, a > different buffer pool. I think that would be hugely invasive and > unpleasant to make work and I think the performance would be poor, > too. I agree that this wouldn't be as performant from a R/W perspective as keeping that metadata inside the block. But on the other hand, that is only for block R/W operations, and not for in-memory block manipulations. > > I'd prefer if we didn't change the way pages are presented to AMs. > > Currently, it is clear what area is available to you if you write an > > AM that uses the bufpage APIs. Changing the page format to have the > > buffer manager also touch / reserve space in the special areas seems > > like a break of abstraction: Quoting from bufpage.h: > > > > * AM-generic per-page information is kept in PageHeaderData. > > * > > * AM-specific per-page data (if any) is kept in the area marked "special > > * space"; each AM has an "opaque" structure defined somewhere that is > > * stored as the page trailer. an access method should always > > * initialize its pages with PageInit and then set its own opaque > > * fields. > > > > I'd rather we keep this contract: am-generic stuff belongs in > > PageHeaderData, with the rest of the page fully available for the AM > > to use (including the special area). > > I don't think that changing the contract has to mean that it becomes > unclear what the contract is. And you can't improve any system without > changing some stuff. But you certainly don't have to like my ideas or > anything.... It's not that I disagree with (or dislike the idea of) increasing the resilience of checksums, I just want to be very careful that we don't trade (potentially significant) runtime performance for features people might not use. This thread seems very related to the 'storing an explicit nonce'-thread, which also wants to reclaim space from a page that is currently used by AMs, while AMs would lose access to certain information on pages and certain optimizations that they could do before. I'm very hesitant to let just any modification to the page format go through because someone needs extra metadata attached to a page. That reminds me, there's one more item to be put on the compatibility checklist: Currently, the FSM code assumes it can use all space on a page (except the page header) for its total of 3 levels of FSM data. Mixing page formats would break how it currently works, as changing the space that is available on a page will change the fanout level of each leaf in the tree, which our current code can't handle. To change the page format of one page in the FSM would thus either require a rewrite of the whole FSM fork, or extra metadata attached to the relation that details where the format changes. A similar issue exists with the VM fork. That being said, I think that it could be possible to reuse pd_checksum as an extra area indicator between pd_upper and pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole] pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area]. This should require limited rework in current AMs, especially if we provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some upper limit on how much overhead the storage uses per page. Alternatively, we could claim some space on a page using a special line pointer at the start of the page referring to storage data, while having the same limitation on size. One last option is we recognise that there are two storage locations of pages that have different data requirements -- on-disk that requires checksums, and in-memory that requires LSNs. Currently, those fields are both stored on the page in distinct fields, but we could (_could_) update the code to drop LSN when we store the page, and drop the checksum when we load the page (at the cost of redo speed when recovering from an unclean shutdown). That would provide an extra 64 bits on the page without breaking storage, assuming AMs don't already misuse pd_lsn. - Matthias
On Fri, Jun 10, 2022 at 6:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > > My preference is for an approach that builds on that, or at least > > doesn't significantly complicate it. So a cryptographic hash or nonce > > can go in the special area proper (structs like BTPageOpaqueData don't > > need any changes), but at a page offset before the special area proper > > -- not after. > > > > What disadvantages does that approach have, if any, from your point of view? > > I think it would be an extremely good idea to store the extended > checksum at the same offset in every page. Right now, code that wants > to compute checksums, or a tool like pg_checksums that wants to verify > them, can find the checksum without needing to interpret any of the > remaining page contents. Things get sticky if you have to interpret > the page contents to locate the checksum that's going to tell you > whether the page contents are messed up. Perhaps this could be worked > around if you tried hard enough, but I don't see what we get out of > it. Is that the how block-level encryption feature from EDB Advanced Server does it? -- Peter Geoghegan
On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote: > On Fri, Jun 10, 2022 at 6:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > My preference is for an approach that builds on that, or at least > > > doesn't significantly complicate it. So a cryptographic hash or nonce > > > can go in the special area proper (structs like BTPageOpaqueData don't > > > need any changes), but at a page offset before the special area proper > > > -- not after. > > > > > > What disadvantages does that approach have, if any, from your point of view? > > > > I think it would be an extremely good idea to store the extended > > checksum at the same offset in every page. Right now, code that wants > > to compute checksums, or a tool like pg_checksums that wants to verify > > them, can find the checksum without needing to interpret any of the > > remaining page contents. Things get sticky if you have to interpret > > the page contents to locate the checksum that's going to tell you > > whether the page contents are messed up. Perhaps this could be worked > > around if you tried hard enough, but I don't see what we get out of > > it. > > Is that the how block-level encryption feature from EDB Advanced Server does it? Uh, EDB Advanced Server doesn't have a block-level encryption feature. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Mon, Jun 13, 2022 at 2:54 PM Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote: > > Is that the how block-level encryption feature from EDB Advanced Server does it? > > Uh, EDB Advanced Server doesn't have a block-level encryption feature. Apparently there is something called "Vormetric Transparent Encryption (VTE) – Transparent block-level encryption with access controls": https://www.enterprisedb.com/blog/enhanced-security-edb-postgres-advanced-server-vormetric-data-security-platform Perhaps there is some kind of confusion around the terminology here? -- Peter Geoghegan
On Mon, Jun 13, 2022 at 03:03:17PM -0700, Peter Geoghegan wrote: > On Mon, Jun 13, 2022 at 2:54 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote: > > > Is that the how block-level encryption feature from EDB Advanced Server does it? > > > > Uh, EDB Advanced Server doesn't have a block-level encryption feature. > > Apparently there is something called "Vormetric Transparent Encryption > (VTE) – Transparent block-level encryption with access controls": > > https://www.enterprisedb.com/blog/enhanced-security-edb-postgres-advanced-server-vormetric-data-security-platform > > Perhaps there is some kind of confusion around the terminology here? That is encryption done in a virtual file system independent of Postgres. So, I guess the answer to your question is that this is not how EDB Advanced Server does it. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Mon, Jun 13, 2022 at 3:06 PM Bruce Momjian <bruce@momjian.us> wrote: > That is encryption done in a virtual file system independent of > Postgres. So, I guess the answer to your question is that this is not > how EDB Advanced Server does it. Okay, thanks for clearing that up. The term "block based" does appear in the article I linked to, so you can see why I didn't understand it that way initially. Anyway, I can see how it would be useful to be able to know the offset of a nonce or of a hash digest on any given page, without access to a running server. But why shouldn't that be possible with other designs, including designs closer to what I've outlined? A known fixed offset in the special area already assumes that all pages must have a value in the first place, even though that won't be true for the majority of individual Postgres servers. There is implicit information involved in a design like the one Robert has proposed; your backup tool (or whatever) already has to understand to expect something other than no encryption at all, or no checksum at all. Tools like pg_filedump already rely on implicit information about the special area. I'm not against the idea of picking a handful of checksum/encryption schemes, with the understanding that we'll be committing to those particular schemes indefinitely -- it's not reasonable to expect infinite flexibility here (and so I don't). But why should we accept something that seems to me to be totally inflexible, and doesn't compose with other things? -- Peter Geoghegan
On Mon, Jun 13, 2022 at 5:14 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > It's not that I disagree with (or dislike the idea of) increasing the > resilience of checksums, I just want to be very careful that we don't > trade (potentially significant) runtime performance for features > people might not use. This thread seems very related to the 'storing > an explicit nonce'-thread, which also wants to reclaim space from a > page that is currently used by AMs, while AMs would lose access to > certain information on pages and certain optimizations that they could > do before. I'm very hesitant to let just any modification to the page > format go through because someone needs extra metadata attached to a > page. Right. So, to be clear, I think there is an opportunity to store ONE extra blob of data in the page. It might be an extended checksum, or it might be a nonce for cryptographic authentication, but it can't be both. I think this is OK, because in earlier discussions of TDE, it seems that if you're using encryption and also want to verify page integrity, you'll use an encryption system that produces some kind of verifier, and you'll store that into this space in the page instead of using an enhanced-checksum feature. In other words, I'm imagining creating a space at the end of each page for some sort of enhanced security or data integrity feature, and you can either choose not to use one (in which case things work as they do today), or you can choose an extended checksums feature, or maybe in the future you can choose some form of TDE that involves storing a nonce or a page verifier in the page. But you just get one. Now, the logical question to ask is: well, if there's only one opportunity to store an extra blob of data on every page, is this the best way to use it? What if someone comes along with another feature that also wants to store a blob of data on every page, and they can't do it because this proposal got there first? My answer is: well, if that additional feature is something that provides encryption or tamper-resistance or data integrity or security in any form, then it can just be added as a new option for how you use this blob of space, and users who prefer the new thing to the existing options can pick it. If it's something else, then .... what is it, exactly? It seems to me that the kinds of things that require space in *every* page of the cluster are really the things that fall into this category. For example, Stephen mused earlier that maybe while we're at it we could find a way to include an XID epoch in every page. Maybe so, but we wouldn't actually want that in *every* page. We would only want it in the heap pages. And as far as I can see that's pretty generally how things go. There are plenty of projects that might want extra space in each page *for a certain AM* and I don't see any reason why what I propose to do here would rule that out. I think this and that could both be done, and doing this might even make doing that easier by putting in place some useful infrastructure. What I don't think we can get away with is having multiple systems that are each taking a bite out of every page for every AM -- but I think that's OK, because I don't think there's a lot of need for multiple such systems. > That reminds me, there's one more item to be put on the compatibility > checklist: Currently, the FSM code assumes it can use all space on a > page (except the page header) for its total of 3 levels of FSM data. > Mixing page formats would break how it currently works, as changing > the space that is available on a page will change the fanout level of > each leaf in the tree, which our current code can't handle. To change > the page format of one page in the FSM would thus either require a > rewrite of the whole FSM fork, or extra metadata attached to the > relation that details where the format changes. A similar issue exists > with the VM fork. I agree with all of this except I think that "mixing page formats" is a thing we can't do. > That being said, I think that it could be possible to reuse > pd_checksum as an extra area indicator between pd_upper and > pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole] > pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area]. > This should require limited rework in current AMs, especially if we > provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some > upper limit on how much overhead the storage uses per page. This is an interesting alternative. It's unclear to me that it makes anything better if the [blackbox] area is before the special area vs. afterward. And either way, if that area is fixed-size across the cluster, you don't really need to use pd_checksum to find it, because you can just know where it is. A possible advantage of this approach is that it might make it simpler to cope with a scenario where some pages in the cluster have this blackbox space and others don't. I wasn't really thinking that on-line page format conversions were likely to be practical, but certainly the chances are better if we've got an explicit pointer to the extra space vs. just knowing where it has to be. > Alternatively, we could claim some space on a page using a special > line pointer at the start of the page referring to storage data, while > having the same limitation on size. That sounds messy. > One last option is we recognise that there are two storage locations > of pages that have different data requirements -- on-disk that > requires checksums, and in-memory that requires LSNs. Currently, those > fields are both stored on the page in distinct fields, but we could > (_could_) update the code to drop LSN when we store the page, and drop > the checksum when we load the page (at the cost of redo speed when > recovering from an unclean shutdown). That would provide an extra 64 > bits on the page without breaking storage, assuming AMs don't already > misuse pd_lsn. It seems wrong to me to say that we don't need the LSN for a page stored on disk. Recovery relies on it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 14 Jun 2022 at 14:56, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jun 13, 2022 at 5:14 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > It's not that I disagree with (or dislike the idea of) increasing the > > resilience of checksums, I just want to be very careful that we don't > > trade (potentially significant) runtime performance for features > > people might not use. This thread seems very related to the 'storing > > an explicit nonce'-thread, which also wants to reclaim space from a > > page that is currently used by AMs, while AMs would lose access to > > certain information on pages and certain optimizations that they could > > do before. I'm very hesitant to let just any modification to the page > > format go through because someone needs extra metadata attached to a > > page. > > Right. So, to be clear, I think there is an opportunity to store ONE > extra blob of data in the page. It might be an extended checksum, or > it might be a nonce for cryptographic authentication, but it can't be > both. I think this is OK, because in earlier discussions of TDE, it > seems that if you're using encryption and also want to verify page > integrity, you'll use an encryption system that produces some kind of > verifier, and you'll store that into this space in the page instead of > using an enhanced-checksum feature. Agreed. > In other words, I'm imagining creating a space at the end of each page > for some sort of enhanced security or data integrity feature, and you > can either choose not to use one (in which case things work as they do > today), or you can choose an extended checksums feature, or maybe in > the future you can choose some form of TDE that involves storing a > nonce or a page verifier in the page. But you just get one. > > Now, the logical question to ask is: well, if there's only one > opportunity to store an extra blob of data on every page, is this the > best way to use it? What if someone comes along with another feature > that also wants to store a blob of data on every page, and they can't > do it because this proposal got there first? My answer is: well, if > that additional feature is something that provides encryption or > tamper-resistance or data integrity or security in any form, then it > can just be added as a new option for how you use this blob of space, > and users who prefer the new thing to the existing options can pick > it. If it's something else, then .... what is it, exactly? It seems to > me that the kinds of things that require space in *every* page of the > cluster are really the things that fall into this category. > > For example, Stephen mused earlier that maybe while we're at it we > could find a way to include an XID epoch in every page. Maybe so, but > we wouldn't actually want that in *every* page. We would only want it > in the heap pages. And as far as I can see that's pretty generally how > things go. There are plenty of projects that might want extra space in > each page *for a certain AM* and I don't see any reason why what I > propose to do here would rule that out. I think this and that could > both be done, and doing this might even make doing that easier by > putting in place some useful infrastructure. What I don't think we can > get away with is having multiple systems that are each taking a bite > out of every page for every AM -- but I think that's OK, because I > don't think there's a lot of need for multiple such systems. I agree with the premise of one only needing one such blob on the page, yet I don't think that putting it on the exact end of the page is the best option. PageGetSpecialPointer is much simpler when you can rely on the location of the special area. As special areas can be accessed N times each time a buffer is loaded from disk, and yet the 'storage system extra blob' only twice (once read, once write), I think the special area should have priority when handing out page space. > > That reminds me, there's one more item to be put on the compatibility > > checklist: Currently, the FSM code assumes it can use all space on a > > page (except the page header) for its total of 3 levels of FSM data. > > Mixing page formats would break how it currently works, as changing > > the space that is available on a page will change the fanout level of > > each leaf in the tree, which our current code can't handle. To change > > the page format of one page in the FSM would thus either require a > > rewrite of the whole FSM fork, or extra metadata attached to the > > relation that details where the format changes. A similar issue exists > > with the VM fork. > > I agree with all of this except I think that "mixing page formats" is > a thing we can't do. I'm not sure it's impossible, but I would indeed agree it would not be a trivial issue to solve. > > That being said, I think that it could be possible to reuse > > pd_checksum as an extra area indicator between pd_upper and > > pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole] > > pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area]. > > This should require limited rework in current AMs, especially if we > > provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some > > upper limit on how much overhead the storage uses per page. > > This is an interesting alternative. It's unclear to me that it makes > anything better if the [blackbox] area is before the special area vs. > afterward. The main benefit of this order is that an AM will see it's special area at a fixed location if it always uses a fixed-size Opaque struct, i.e. that an AM may still use (Page + BLCKSZ - sizeof(IndexOpaque)) as seen in [0]. There might be little to gain, but alternatively there's also little to lose for the storage system -- page read/write to the FS happens at most once for each time the page is accessed/written to. I'd thus much rather let the IO subsystem pay this cost than the AM, as when you'd offload this cost to the AM that would be a constant overhead for all in-memory operations, while if it were offloaded to the IO it would only be felt once per swapped block, on average. The best point for this layout is that this lets us determine what the data on each page is for without requiring access to shmem variables. Appending or prepending storage-special areas to the pd_special area would confuse AMs about what data is theirs on the page -- making it explicit in the page format would remove this potential for confustion, while allowing this storage-blob area to be dynamically sized. > And either way, if that area is fixed-size across the > cluster, you don't really need to use pd_checksum to find it, because > you can just know where it is. A possible advantage of this approach > is that it might make it simpler to cope with a scenario where some > pages in the cluster have this blackbox space and others don't. I > wasn't really thinking that on-line page format conversions were > likely to be practical, but certainly the chances are better if we've > got an explicit pointer to the extra space vs. just knowing where it > has to be. > > > Alternatively, we could claim some space on a page using a special > > line pointer at the start of the page referring to storage data, while > > having the same limitation on size. > > That sounds messy. Yep. It isn't my first choice neither, but it is something that I did consider - it has the potentially desirable effect of the AM being able to relocate this blob. > > One last option is we recognise that there are two storage locations > > of pages that have different data requirements -- on-disk that > > requires checksums, and in-memory that requires LSNs. Currently, those > > fields are both stored on the page in distinct fields, but we could > > (_could_) update the code to drop LSN when we store the page, and drop > > the checksum when we load the page (at the cost of redo speed when > > recovering from an unclean shutdown). That would provide an extra 64 > > bits on the page without breaking storage, assuming AMs don't already > > misuse pd_lsn. > > It seems wrong to me to say that we don't need the LSN for a page > stored on disk. Recovery relies on it. It's not critical for recovery, "just" very useful; but indeed this too isn't great. - Matthias [0] https://commitfest.postgresql.org/38/3543 [1] https://www.postgresql.org/message-id/CA+TgmoaD8wMN6i1mmuo+4ZNeGE3Hd57ys8uV8UZm7cneqy3W2g@mail.gmail.com
On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > Anyway, I can see how it would be useful to be able to know the offset > of a nonce or of a hash digest on any given page, without access to a > running server. But why shouldn't that be possible with other designs, > including designs closer to what I've outlined? I don't know what you mean by this. As far as I'm aware, the only design you've outlined is one where the space wasn't at the same offset on every page. > A known fixed offset in the special area already assumes that all > pages must have a value in the first place, even though that won't be > true for the majority of individual Postgres servers. There is > implicit information involved in a design like the one Robert has > proposed; your backup tool (or whatever) already has to understand to > expect something other than no encryption at all, or no checksum at > all. Tools like pg_filedump already rely on implicit information about > the special area. In general, I was imagining that you'd need to look at the control file to understand how much space had been reserved per page in this particular cluster. I agree that's a bit awkward, especially for pg_filedump. However, pg_filedump and I think also some code internal to PostgreSQL try to figure out what kind of page we've got by looking at the *size* of the special space. It's only good luck that we haven't had a collision there yet, and continuing to rely on that seems like a dead end. Perhaps we should start including a per-AM magic number at the beginning of the special space. > I'm not against the idea of picking a handful of checksum/encryption > schemes, with the understanding that we'll be committing to those > particular schemes indefinitely -- it's not reasonable to expect > infinite flexibility here (and so I don't). But why should we accept > something that seems to me to be totally inflexible, and doesn't > compose with other things? We shouldn't accept something that's totally inflexible, but I don't know why this seems that way to you. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 8:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Anyway, I can see how it would be useful to be able to know the offset > > of a nonce or of a hash digest on any given page, without access to a > > running server. But why shouldn't that be possible with other designs, > > including designs closer to what I've outlined? > > I don't know what you mean by this. As far as I'm aware, the only > design you've outlined is one where the space wasn't at the same > offset on every page. I am skeptical of that particular aspect, yes. Though I would define it the other way around (now the true special area struct isn't necessarily at the same offset for a given AM, at least across data directories). My main concern is maintaining the ability to interpret much about the contents of a page without context, and to not make it any harder to grow the special area dynamically -- which is a broader concern. Your patch isn't going to be the last one that wants to do something with the special area. This needs to be carefully considered. I see a huge amount of potential for adding new optimizations that use subsidiary space on the page, presumably implemented via a special area that can grow dynamically. For example, an ad-hoc compression technique for heap pages that temporarily "absorbs" some extra versions in the event of opportunistic pruning running and failing to free enough space. Such a design would operate on similar principles to deduplication in unique indexes, where the goal is to buy time rather than buy space. When we fail to keep the contents of a heap page together today, we often barely fail, so I expect something like this to have an outsized impact on some workloads. > In general, I was imagining that you'd need to look at the control > file to understand how much space had been reserved per page in this > particular cluster. I agree that's a bit awkward, especially for > pg_filedump. However, pg_filedump and I think also some code internal > to PostgreSQL try to figure out what kind of page we've got by looking > at the *size* of the special space. It's only good luck that we > haven't had a collision there yet, and continuing to rely on that > seems like a dead end. Perhaps we should start including a per-AM > magic number at the beginning of the special space. It's true that that approach is just a hack -- we probably can do better. I don't think that it's okay to break it, though. At least not without providing a comparable alternative, that doesn't rely on context from the control file. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Tue, Jun 14, 2022 at 8:48 AM Robert Haas <robertmhaas@gmail.com> wrote: >> However, pg_filedump and I think also some code internal >> to PostgreSQL try to figure out what kind of page we've got by looking >> at the *size* of the special space. It's only good luck that we >> haven't had a collision there yet, and continuing to rely on that >> seems like a dead end. Perhaps we should start including a per-AM >> magic number at the beginning of the special space. It's been some years since I had much to do with pg_filedump, but my recollection is that the size of the special space is only one part of its heuristics, because there already *are* collisions. Moreover, there already are per-AM magic numbers in there that it uses to resolve those cases. They're not at the front though. Nobody has ever wanted to break on-disk compatibility just to make pg_filedump's page-type identification less klugy, so I find it hard to believe that the above suggestion isn't a non-starter. regards, tom lane
On Tue, Jun 14, 2022 at 11:08 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I agree with the premise of one only needing one such blob on the > page, yet I don't think that putting it on the exact end of the page > is the best option. > > PageGetSpecialPointer is much simpler when you can rely on the > location of the special area. As special areas can be accessed N times > each time a buffer is loaded from disk, and yet the 'storage system > extra blob' only twice (once read, once write), I think the special > area should have priority when handing out page space. Hmm, but on the other hand, if you imagine a scenario in which the "storage system extra blob" is actually a nonce for TDE, you need to be able to find it before you've decrypted the rest of the page. If pd_checksum gives you the offset of that data, you need to exclude it from what gets encrypted, which means that you need encrypt three separate non-contiguous areas of the page whose combined size is unlikely to be a multiple of the encryption algorithm's block size. That kind of sucks (and putting it at the end of the page makes it way better). That said, I certainly agree that finding the special space needs to be fast. The question in my mind is HOW fast it needs to be, and what techniques we might be able to use to dodge the problem. For instance, suppose that, during the startup sequence, we look at the control file, figure out the size of the 'storage system extra blob', and based on that each AM figures out the byte-offset of its special space and caches that in a global variable. Then, instead of PageGetSpecialSpace(page) it does PageGetBtreeSpecialSpace(page) or whatever, where the implementation is ((char*) page) + the_afformentioned_global_variable. Is that going to be too slow? If it is, then I think this whole effort may be in more trouble than I can get it out of, because it's not just the location of the special space that is an issue here, and indeed from what I can see that's not even the most important issue. There's tons of constants that are computed based on the amount of usable space in the page, and I don't have a better idea than turning those constants into global variables that are computed once ... well, perhaps in some cases we could multiply compile hot bits of code, once per possible value of the compile-time constant, but I'm pretty sure we don't want to do that for the entire index AM. There's going to have to be some compromise here. On the one hand you're going to have people who want to be able to do run-time conversions between page formats even at the cost of extra runtime overhead on top of what the basic feature necessarily implies. On the other hand you're going to have people who don't think any overhead at all is acceptable, even if it's purely nominal and only visible on a microbenchmark. Such arguments can easily become holy wars. I think we should take a pragmatic approach: big slowdowns are categorically unacceptable, and every effort must be made to minimize overhead, but if the only permissible amount of overhead is exactly zero, then there's no hope of ever implementing any of these kinds of features. I don't think that's actually what most people want. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 9:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's been some years since I had much to do with pg_filedump, but > my recollection is that the size of the special space is only one > part of its heuristics, because there already *are* collisions. Right, there are collisions even today. The heuristics are kludgey, but they work perfectly in practice. That's not just due to luck -- it's due to people making sure that they continued to work over time. > Moreover, there already are per-AM magic numbers in there that > it uses to resolve those cases. They're not at the front though. > Nobody has ever wanted to break on-disk compatibility just to make > pg_filedump's page-type identification less klugy, so I find it > hard to believe that the above suggestion isn't a non-starter. There is no doubt that it's not worth breaking on-disk compatibility just for pg_filedump. The important principle here is that high-context page formats are bad, and should be avoided whenever possible. Why isn't it possible to avoid it here? We have all the bits we need for it in the page header, and then some. Why should we assume that it'll never be useful to apply encryption selectively, perhaps at the relation level? -- Peter Geoghegan
On Tue, Jun 14, 2022 at 10:43 AM Robert Haas <robertmhaas@gmail.com> wrote: > Hmm, but on the other hand, if you imagine a scenario in which the > "storage system extra blob" is actually a nonce for TDE, you need to > be able to find it before you've decrypted the rest of the page. If > pd_checksum gives you the offset of that data, you need to exclude it > from what gets encrypted, which means that you need encrypt three > separate non-contiguous areas of the page whose combined size is > unlikely to be a multiple of the encryption algorithm's block size. > That kind of sucks (and putting it at the end of the page makes it way > better). I don't have a great understanding of how that cost will be felt in detail right now, because I don't know enough about the project and the requirements for TDE in general. > That said, I certainly agree that finding the special space needs to > be fast. The question in my mind is HOW fast it needs to be, and what > techniques we might be able to use to dodge the problem. For instance, > suppose that, during the startup sequence, we look at the control > file, figure out the size of the 'storage system extra blob', and > based on that each AM figures out the byte-offset of its special space > and caches that in a global variable. Then, instead of > PageGetSpecialSpace(page) it does PageGetBtreeSpecialSpace(page) or > whatever, where the implementation is ((char*) page) + > the_afformentioned_global_variable. Is that going to be too slow? Who knows? For now the important point is that there is a tension between the requirements of TDE, and the requirements of access methods (especially index access methods). It's possible that this will turn out not to be much of a problem. But the burden of proof is yours. Making a big change to the on-disk format like this (a change that affects every access method) should be held to an exceptionally high standard. There are bound to be tacit or even explicit assumptions made by access methods that you risk breaking here. The reality is that all of the access method code evolved in an environment where the special space size was constant and generic for a given BLCKSZ. I don't have much sympathy for any suggestion that code written 20 years ago should have known not to make these assumptions. I have a lot more sympathy for the idea that it's a general problem with our infrastructure (particularly code in bufpage.c and the delicate assumptions made by its callers) -- a problem that is worth addressing with a broad solution that enables lots of different work. We don't necessarily get another shot at this if we get it wrong now. > There's going to have to be some compromise here. On the one hand > you're going to have people who want to be able to do run-time > conversions between page formats even at the cost of extra runtime > overhead on top of what the basic feature necessarily implies. On the > other hand you're going to have people who don't think any overhead at > all is acceptable, even if it's purely nominal and only visible on a > microbenchmark. Such arguments can easily become holy wars. How many times has a big change to the on-disk format of this kind of magnitude taken place, post-pg_upgrade? I would argue that this would be the first, since it is the moral equivalent of extending the size of the generic page header. For all I know the overhead will be perfectly fine, and everybody wins. I just want to be adamant that we're making the right trade-offs, and maximizing the benefit from any new cost imposed on access method code. -- Peter Geoghegan
On Tue, Jun 14, 2022 at 1:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > There is no doubt that it's not worth breaking on-disk compatibility > just for pg_filedump. The important principle here is that > high-context page formats are bad, and should be avoided whenever > possible. I agree. > Why isn't it possible to avoid it here? We have all the bits we need > for it in the page header, and then some. Why should we assume that > it'll never be useful to apply encryption selectively, perhaps at the > relation level? We can have anything we want here, but we can't have everything we want at the same time. There are irreducible engineering trade-offs here. If all pages in a given cluster are the same, backends can compute the values of things that are currently compile-time constants upon startup and continue to use them for the lifetime of the backend. If pages can vary, some encrypted or checksummed and others not, then you have to recompute those values for every page. That's bound to have some cost. It is also more flexible. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > We can have anything we want here, but we can't have everything we > want at the same time. There are irreducible engineering trade-offs > here. If all pages in a given cluster are the same, backends can > compute the values of things that are currently compile-time constants > upon startup and continue to use them for the lifetime of the backend. > If pages can vary, some encrypted or checksummed and others not, then > you have to recompute those values for every page. That's bound to > have some cost. It is also more flexible. Maybe not -- it depends on the particulars of the code. For example, it might be okay for the B-Tree code to assume that B-Tree pages have a special area at a known fixed offset, determined at compile time. At the same time, it might very well not be okay for a backup tool to make any such assumption, because it doesn't have the same context. Even within TDE, it might be okay to assume that it's a feature that the user must commit to using for a whole cluster at initdb time. What isn't okay is committing to that assumption now and forever, by leaving the door open to a world in which that assumption no longer holds. Like when you do finally get around to making TDE something that can work at the relation level, for example. Even if there is only a small chance of that ever happening, why wouldn't we be prepared for it, just on general principle? -- Peter Geoghegan
On Tue, Jun 14, 2022 at 2:23 PM Peter Geoghegan <pg@bowt.ie> wrote: > Maybe not -- it depends on the particulars of the code. For example, > it might be okay for the B-Tree code to assume that B-Tree pages have > a special area at a known fixed offset, determined at compile time. At > the same time, it might very well not be okay for a backup tool to > make any such assumption, because it doesn't have the same context. > > Even within TDE, it might be okay to assume that it's a feature that > the user must commit to using for a whole cluster at initdb time. What > isn't okay is committing to that assumption now and forever, by > leaving the door open to a world in which that assumption no longer > holds. Like when you do finally get around to making TDE something > that can work at the relation level, for example. Even if there is > only a small chance of that ever happening, why wouldn't we be > prepared for it, just on general principle? To the extent that we can leave ourselves room to do new things in the future without incurring unreasonable costs in the present, I'm in favor of that, as I believe anyone would be. But as you say, a lot depends on the specifics. Theoretical flexibility that can only be used in practice by really slow code doesn't help anybody. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 11:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Even within TDE, it might be okay to assume that it's a feature that > > the user must commit to using for a whole cluster at initdb time. What > > isn't okay is committing to that assumption now and forever, by > > leaving the door open to a world in which that assumption no longer > > holds. Like when you do finally get around to making TDE something > > that can work at the relation level, for example. Even if there is > > only a small chance of that ever happening, why wouldn't we be > > prepared for it, just on general principle? > > To the extent that we can leave ourselves room to do new things in the > future without incurring unreasonable costs in the present, I'm in > favor of that, as I believe anyone would be. But as you say, a lot > depends on the specifics. Theoretical flexibility that can only be > used in practice by really slow code doesn't help anybody. A tool like pg_filedump or a backup tool can easily afford this overhead. The only cost that TDE has to pay for this added flexibility is that it has to set one of the PD_* bits in a code path that is already bound to be very expensive. What's so bad about that? Honestly, I'm a bit surprised that you're pushing back on this particular point. A nonce for TDE is just something that code in places like bufpage.h ought to know about. It has to be negotiated at that level, because it will in fact affect a lot of callers to the bufpage.h functions. -- Peter Geoghegan
On Tue, Jun 14, 2022 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > A tool like pg_filedump or a backup tool can easily afford this > overhead. The only cost that TDE has to pay for this added flexibility > is that it has to set one of the PD_* bits in a code path that is > already bound to be very expensive. What's so bad about that? > > Honestly, I'm a bit surprised that you're pushing back on this > particular point. A nonce for TDE is just something that code in > places like bufpage.h ought to know about. It has to be negotiated at > that level, because it will in fact affect a lot of callers to the > bufpage.h functions. Peter, unless I have missed something, this email is the very first one where you or anyone else have said anything at all about a PD_* bit. Even here, it's not very clear exactly what you are proposing. Therefore I have neither said anything bad about it in the past, nor can I now answer the question as to what is "so bad about it." If you want to make a concrete proposal, I will be happy to tell you what I think about it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 12:13 PM Robert Haas <robertmhaas@gmail.com> wrote: > Peter, unless I have missed something, this email is the very first > one where you or anyone else have said anything at all about a PD_* > bit. Even here, it's not very clear exactly what you are proposing. > Therefore I have neither said anything bad about it in the past, nor > can I now answer the question as to what is "so bad about it." If you > want to make a concrete proposal, I will be happy to tell you what I > think about it. I am proposing that we not commit ourselves to relying on implicit information about what must be true for every page in the cluster. Just having a little additional page-header metadata (in pd_flags) would accomplish that much, and wouldn't in itself impose any real overhead on TDE. It's not like the PageHeaderData.pd_flags bits are already a precious commodity, in the same way as the heap tuple infomask status bits are. We can afford to use some of them for this purpose, and then some. Why wouldn't we do it that way, just on general principle? You may still find it useful to rely on high level context at the level of code that runs on the server, perhaps for performance reasons (though it's unclear how much it matters). In which case the status bit is technically redundant information as far as the code is concerned. That may well be fine. -- Peter Geoghegan
On Tue, Jun 14, 2022 at 3:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > I am proposing that we not commit ourselves to relying on implicit > information about what must be true for every page in the cluster. > Just having a little additional page-header metadata (in pd_flags) > would accomplish that much, and wouldn't in itself impose any real > overhead on TDE. > > It's not like the PageHeaderData.pd_flags bits are already a precious > commodity, in the same way as the heap tuple infomask status bits are. > We can afford to use some of them for this purpose, and then some. > > Why wouldn't we do it that way, just on general principle? > > You may still find it useful to rely on high level context at the > level of code that runs on the server, perhaps for performance reasons > (though it's unclear how much it matters). In which case the status > bit is technically redundant information as far as the code is > concerned. That may well be fine. I still am not clear on precisely what you are proposing here. I do agree that there is significant bit space available in pd_flags and that consuming some of it wouldn't be stupid, but that doesn't add up to a proposal. Maybe the proposal is: figure out how many different configurations there are for this new kind of page space, let's say N, and then reserve ceil(log2(N)) bits from pd_flags to indicate which one we've got. One possible problem with this is that, if the page is actually encrypted, we might want pd_flags to also be encrypted. The existing contents of pd_flags disclose some information about the tuples that are on the page, so having them exposed to prying eyes does not seem appealing. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 1:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > I still am not clear on precisely what you are proposing here. I do > agree that there is significant bit space available in pd_flags and > that consuming some of it wouldn't be stupid, but that doesn't add up > to a proposal. Maybe the proposal is: figure out how many different > configurations there are for this new kind of page space, let's say N, > and then reserve ceil(log2(N)) bits from pd_flags to indicate which > one we've got. I'm just making a general point. Why wouldn't we start out with the assumption that we use some pd_flags bit space for this stuff? > One possible problem with this is that, if the page is actually > encrypted, we might want pd_flags to also be encrypted. The existing > contents of pd_flags disclose some information about the tuples that > are on the page, so having them exposed to prying eyes does not seem > appealing. I'm skeptical of the idea that we want to avoid leaving any metadata unencrypted. But I'm not an expert on TDE, and don't want to say too much about it without having done some more research. I would like to see some justification for just encrypting everything on the page without concern for the loss of debuggability, though. What is the underlying theory behind that particular decision? Are there any examples that we can draw from, from other systems or published designs? Let's assume for now that we don't leave pd_flags unencrypted, as you have suggested. We're still discussing new approaches to checksumming in the scope of this work, which of course includes many individual cases that don't involve any encryption. Plus even with encryption there are things like defensive assertions that can be added by using a flag bit for this. -- Peter Geoghegan
On Tue, Jun 14, 2022 at 1:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jun 14, 2022 at 1:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I still am not clear on precisely what you are proposing here. I do > > agree that there is significant bit space available in pd_flags and > > that consuming some of it wouldn't be stupid, but that doesn't add up > > to a proposal. Maybe the proposal is: figure out how many different > > configurations there are for this new kind of page space, let's say N, > > and then reserve ceil(log2(N)) bits from pd_flags to indicate which > > one we've got. > > I'm just making a general point. Why wouldn't we start out with the > assumption that we use some pd_flags bit space for this stuff? Technically we don't already do that today, with the 16-bit checksums that are stored in PageHeaderData.pd_checksum. But we do something equivalent: low-level tools can still infer that checksums must not be enabled on the page (really the cluster) indirectly in the event of a 0 checksum. A 0 value can reasonably be interpreted as a page from a cluster without checksums (barring page corruption). This is basically reasonable because our implementation of checksums is guaranteed to not generate 0 as a valid checksum value. While pg_filedump does not rely on the 0 checksum convention currently, it doesn't really need to. When the user uses the -k option to verify checksums in passing, pg_filedump can assume that checksums must be enabled ("the user said they must be so expect it" is a reasonable assumption at that point). This also depends on there being only one approach to checksums. -- Peter Geoghegan
On Tue, Jun 14, 2022 at 4:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm just making a general point. Why wouldn't we start out with the > assumption that we use some pd_flags bit space for this stuff? Well, the reason that wasn't my starting assumption is because I didn't think of the idea. > I'm skeptical of the idea that we want to avoid leaving any metadata > unencrypted. But I'm not an expert on TDE, and don't want to say too > much about it without having done some more research. I would like to > see some justification for just encrypting everything on the page > without concern for the loss of debuggability, though. What is the > underlying theory behind that particular decision? Are there any > examples that we can draw from, from other systems or published > designs? I don't really think there is much controversy about the idea that it's a good idea to encrypt all of the data rather than only some of it. I mean, that's what side channel attacks are: failure to secure all of the information that an attacker might find useful. Unfortunately, it seems inevitable that any TDE implementation in PostgreSQL is going to leak some information that an attacker might consider useful - e.g. we can't conceal how many files they are, or what they're called, or the lengths of those files. But it seems absolutely clear that our goal ought to be to leak as little information as possible. > Let's assume for now that we don't leave pd_flags unencrypted, as you > have suggested. We're still discussing new approaches to checksumming > in the scope of this work, which of course includes many individual > cases that don't involve any encryption. Plus even with encryption > there are things like defensive assertions that can be added by using > a flag bit for this. True. I don't think we should be too profligate with those bits just in case somebody needs a bunch of them for something important in the future, but it's probably fine to use up one or two. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > Technically we don't already do that today, with the 16-bit checksums > that are stored in PageHeaderData.pd_checksum. But we do something > equivalent: low-level tools can still infer that checksums must not be > enabled on the page (really the cluster) indirectly in the event of a > 0 checksum. A 0 value can reasonably be interpreted as a page from a > cluster without checksums (barring page corruption). This is basically > reasonable because our implementation of checksums is guaranteed to > not generate 0 as a valid checksum value. I don't think that 'pg_checksums -d' zeroes the checksum values on the pages in the cluster. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 7:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > But it seems > absolutely clear that our goal ought to be to leak as little > information as possible. But at what cost? Basically I think that this is giving up rather a lot. For example, isn't it possible that we'd have corruption that could be a bug in either the checksum code, or in recovery? I'd feel a lot better about it if there was some sense of both the costs and the benefits. > > Let's assume for now that we don't leave pd_flags unencrypted, as you > > have suggested. We're still discussing new approaches to checksumming > > in the scope of this work, which of course includes many individual > > cases that don't involve any encryption. Plus even with encryption > > there are things like defensive assertions that can be added by using > > a flag bit for this. > > True. I don't think we should be too profligate with those bits just > in case somebody needs a bunch of them for something important in the > future, but it's probably fine to use up one or two. Sure, but how many could possibly be needed for this? I can't see it being more than 2 or 3. Which seems absolutely fine. They *definitely* have no value if nobody ever uses them for anything. -- Peter Geoghegan
On Tue, Jun 14, 2022 at 10:21:16PM -0400, Robert Haas wrote: > On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan <pg@bowt.ie> wrote: >> Technically we don't already do that today, with the 16-bit checksums >> that are stored in PageHeaderData.pd_checksum. But we do something >> equivalent: low-level tools can still infer that checksums must not be >> enabled on the page (really the cluster) indirectly in the event of a >> 0 checksum. A 0 value can reasonably be interpreted as a page from a >> cluster without checksums (barring page corruption). This is basically >> reasonable because our implementation of checksums is guaranteed to >> not generate 0 as a valid checksum value. > > I don't think that 'pg_checksums -d' zeroes the checksum values on the > pages in the cluster. Saving the suspense.. pg_checksums --disable only updates the control file to keep the operation cheap. -- Michael
Attachment
On Tue, Jun 14, 2022 at 7:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Technically we don't already do that today, with the 16-bit checksums > > that are stored in PageHeaderData.pd_checksum. But we do something > > equivalent: low-level tools can still infer that checksums must not be > > enabled on the page (really the cluster) indirectly in the event of a > > 0 checksum. A 0 value can reasonably be interpreted as a page from a > > cluster without checksums (barring page corruption). This is basically > > reasonable because our implementation of checksums is guaranteed to > > not generate 0 as a valid checksum value. > > I don't think that 'pg_checksums -d' zeroes the checksum values on the > pages in the cluster. Obviously there are limitations on when and how we can infer something about the whole cluster based on one single page image -- it all depends on the context. I'm only arguing that we ought to make this kind of analysis as easy as we reasonably can. I just don't see any downside to having a status bit per checksum or encryption algorithm at the page level, and plenty of upside (especially if the event of bugs). This seems like the absolute bare minimum to me, and I'm genuinely surprised that there is even a question about whether or not we should do that much. -- Peter Geoghegan
On 13.06.22 20:20, Robert Haas wrote: > If the user wants 16-bit checksums, the feature we've already got > seems good enough -- and, as you say, it doesn't use any extra disk > space. This proposal is just about making people happy if they want a > bigger checksum. It's hard to get any definite information about what size of checksum is "good enough", since after all it depends on what kinds of errors you expect and what kinds of probabilities you want to accept. But the best I could gather so far is that 16-bit CRC are good until about 16 kB block size. Which leads to the question whether there is really a lot of interest in catering to larger block sizes. The recent thread about performance impact of different block sizes might renew interest in this. But unless we really want to encourage playing with the block sizes (and if my claim above is correct), then a larger checksum size might not be needed. > On the topic of which algorithm to use, I'd be inclined to think that > it is going to be more useful to offer checksums that are 64 bits or > more, since IMHO 32 is not all that much more than 16, and I still > think there are going to be alignment issues. Beyond that I don't have > anything against your specific suggestions, but I'd like to hear what > other people think. Again, gathering some vague information ... The benefits of doubling the checksum size are exponential rather than linear, so there is no significant benefit of using a 64-bit checksum over a 32-bit one, for supported block sizes (current max is 32 kB).
On Wed, Jun 15, 2022 at 4:54 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > It's hard to get any definite information about what size of checksum is > "good enough", since after all it depends on what kinds of errors you > expect and what kinds of probabilities you want to accept. But the best > I could gather so far is that 16-bit CRC are good until about 16 kB > block size. Not really. There's a lot of misinformation on this topic floating around on this mailing list, and some of that misinformation is my fault. I keep learning more about this topic. However, I'm pretty confident that, on the one hand, there's no hard limit on the size of the data that can be effectively validated via a CRC, and on the other hand, CRC isn't a particularly great algorithm, although it does have certain interesting advantages for certain purposes. For example, according to https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks#Error_detection_strength a CRC is guaranteed to detect all single-bit errors. This property is easy to achieve: for example, a parity bit has this property. According to the same source, a CRC is guaranteed to detect two-bit errors only if the distance between them is less than some limit that gets larger as the CRC gets wider. Imagine that you have a CRC-16 of a message 64k+1 bits in length. Suppose that an error in the first bit changes the result from v to v'. Can we, by flipping a second bit later in the message, change the final result from v' back to v? The calculation only has 64k possible answers, and we have 64k bits we can flip to try to get the desired answer. If every one of those bit flips produces a different answer, then one of those answers must be v -- which means detection of two-bits errors is not guaranteed. If at least two of those bit flips produce the same answer, then consider the messages produced by those two different bit flips. They differ from each other by exactly two bits and yet produced the same CRC, so detection of two-bit errors is still not guaranteed. On the other hand, it's still highly likely. If a message of length 2^16+1 bits contains two bit errors one of which is in the first bit, the chances that the other one is in exactly the right place to cancel out the first error are about 2^-16. That's not zero, but it's just as good as our chances of detecting a replacement of the entire message with some other message chosen completely at random. I think the reason why discussion of CRCs tends to focus on the types of bit errors that it can detect is that the algorithm was designed when people were doing stuff like sending files over a modem. It's easy to understand how individual bits could get garbled without anybody noticing, while large-scale corruption would be less likely, but the risks are not necessarily the same for a PostgreSQL data file. Lower levels of the stack are probably already using checksums to try to detect errors at the level of the physical medium. I'm sure some stuff slips through the cracks, but in practice we also see failure modes where the filesystem substitutes 8kB of data from an unrelated file, or where a torn write in combination with unreliable fsync results in half of the page contents being from an older version of the page. These kinds of large-scale replacements aren't what CRCs are designed to detect, and the chances that we will detect them are roughly 1-2^-bits, whether we use a CRC or something else. Of course, that partly depends on the algorithm quality. If an algorithm is more likely to generate some results than others, then its actual error detection rate will not be as good as the number of output bits would suggest. If the result doesn't depend equally on every input bit, then the actual error detection rate will not be as good as the number of output bits would suggest. And CRC-32 is apparently not great by modern standards: https://github.com/rurban/smhasher Compare the results for CRC-32 with, say, Spooky32. Apparently the latter is faster yet produces better output. So maybe we would've been better off if we'd made Spooky32 the default algorithm for backup manifest checksums rather than CRC-32. > The benefits of doubling the checksum size are exponential rather than > linear, so there is no significant benefit of using a 64-bit checksum > over a 32-bit one, for supported block sizes (current max is 32 kB). I'm still unconvinced that the block size is very relevant here. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 10:30 PM Peter Geoghegan <pg@bowt.ie> wrote: > Basically I think that this is giving up rather a lot. For example, > isn't it possible that we'd have corruption that could be a bug in > either the checksum code, or in recovery? > > I'd feel a lot better about it if there was some sense of both the > costs and the benefits. I think that, if and when we get TDE, debuggability is likely to be a huge issue. Something will go wrong for someone at some point, and when it does, what they'll have is a supposedly-encrypted page that cannot be decrypted, and it will be totally unclear what has gone wrong. Did the page get corrupted on disk by a random bit flip? Is there a bug in the algorithm? Torn page? As things stand today, when a page gets corrupted, a human being can look at the page and make an educated guess about what has gone wrong and whether PostgreSQL or some other system is to blame, and if it's PostgreSQL, perhaps have some ideas as to where to look for the bug. If the pages are encrypted, that's a lot harder. I think what will happen, depending on the encryption mode, is probably that either (a) the page will decrypt to complete garbage or (b) the page will fail some kind of verification and you won't be able to decrypt it at all. Either way, you won't be able to infer anything about what caused the problem. All you'll know is that something is wrong. That sucks - a lot - and I don't have a lot of good ideas as to what can be done about it. The idea that an encrypted page is unintelligible and that small changes to either the encrypted or unencrypted data should result in large changes to the other is intrinsic to the nature of encryption. It's more or less un-debuggable by design. With extended checksums, I don't think the issues are anywhere near as bad. I'm not deeply opposed to setting a page-level flag but I expect nominal benefits. A human being looking at the page isn't going to have a ton of trouble figuring out whether or not the extended checksum is present unless the page is horribly, horribly garbled, and even if that happens, will debugging that problem really be any worse than debugging a horribly, horribly garbled page today? I don't think so. I likewise expect that pg_filedump could use heuristics to figure out what's going on just by looking at the page, even if no external information is available. You are probably right when you say that there's no need to be so parsimonious with pd_flags space as all that, but I believe that if we did decide to set no bit in pd_flags, whoever maintains pg_filedump these days would not have huge difficulty inventing a suitable heuristic. A page with an extended checksum is basically still an intelligible page, and we shouldn't understate the value of that. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 15, 2022 at 1:27 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think what will happen, depending on > the encryption mode, is probably that either (a) the page will decrypt > to complete garbage or (b) the page will fail some kind of > verification and you won't be able to decrypt it at all. Either way, > you won't be able to infer anything about what caused the problem. All > you'll know is that something is wrong. That sucks - a lot - and I > don't have a lot of good ideas as to what can be done about it. The > idea that an encrypted page is unintelligible and that small changes > to either the encrypted or unencrypted data should result in large > changes to the other is intrinsic to the nature of encryption. It's > more or less un-debuggable by design. It's pretty clear that there must be a lot of truth to that. But that doesn't mean that there aren't meaningful gradations beyond that. I think that it's worth doing the following exercise (humor me): Why wouldn't it be okay to just encrypt the tuple space and the line pointer array, leaving both the page header and page special area unencrypted? What kind of user would find that trade-off to be unacceptable, and why? What's the nuance of it? For all I know you're right (about encrypting the whole page, metadata and all). I just want to know why that is. I understand that this whole area is one where in general we may have to live with a certain amount of uncertainty about what really matters. > With extended checksums, I don't think the issues are anywhere near as > bad. I'm not deeply opposed to setting a page-level flag but I expect > nominal benefits. I also expect only a small benefit. But that isn't a particularly important factor in my mind. Let's suppose that it turns out to be significantly more useful than we originally expected, for whatever reason. Assuming all that, what else can be said about it now? Isn't it now *relatively* likely that including that status bit metadata will be *extremely* valuable, and not merely somewhat more valuable? I guess it doesn't matter much now (since you have all but conceded that using a bit for this makes sense), but FWIW that's the main reason why I almost took it for granted that we'd need to use a status bit (or bits) for this. -- Peter Geoghegan
On Wed, Jun 15, 2022 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that it's worth doing the following exercise (humor me): Why > wouldn't it be okay to just encrypt the tuple space and the line > pointer array, leaving both the page header and page special area > unencrypted? What kind of user would find that trade-off to be > unacceptable, and why? What's the nuance of it? Let's consider a continuum where, on the one end, you encrypt the entire disk. Then, consider a solution where you encrypt each individual file, block by block. Next, let's imagine that we don't encrypt some kinds of files at all, if we think the data in them isn't sensitive enough. CLOG, maybe. Perhaps pg_class, because that'd be useful for debugging, and how sensitive can the names of the database tables be? Then, let's adopt your proposal here and leave some parts of each block unencrypted for debuggability. As a next step, we could take the further step of separately encrypting each tuple, but only the data, leaving the tuple header unencrypted. Then, going further, we could encrypt each individual column value within the tuple separately, rather than encrypting the tuple together. Then, let's additionally decide that we're not going to encrypt all the columns, but just the ones the user says are sensitive. Now I think we've pretty much reached the other end of the continuum, unless someone is going to propose something like encrypting only part of each column, or storing some unencrypted data along with each encrypted column that is somehow dependent on the column contents. I think it is undeniable that every step along that continuum has weakened security in some way. The worst case scenario for an attacker must be that the entire disk is encrypted and they can gain no meaningful information at all without having to break that encryption. As the encryption touches fewer things, it becomes easier and easier to make inferences about the unseen data based on the data that you can see. One can sit around and argue about whether the amount of information that is leaked at any given step is enough for anyone to care, but to some extent that's an opinion question where any position can be defended by someone. I would argue that even leaking the lengths of the files is not great at all. Imagine that the table is scheduled_nuclear_missile_launches. I definitely do not want my adversaries to know even as much as whether that table is zero-length or non-zero-length. In fact I would prefer that they be unable to infer that I have such a table at all. Back in 2019 I constructed a similar example for how access to pg_clog could leak meaningful information: http://postgr.es/m/CA+TgmoZhbeYmRoAccJ1oCN03Jz2Uak18QN4afx4WD7g+j7SVcQ@mail.gmail.com Now, obviously, anyone can debate how realistic such cases are, but they definitely exist. If you can read btpo_prev, btpo_next, btpo_level, and btpo_flags for every page in the btree, you can probably infer some things about the distribution of keys in the table -- especially if you can read all the pages at time T1 and then read them all again later at time T2 (and maybe further times T3..Tn). You can make inference about which parts of the keyspace are receiving new index insertions and which are not. If that's the index on the current_secret_missions.country_code column, well then that sucks. Your adversary may be able to infer where in the world your secret organization is operating and round up all your agents. Now, I do realize that if we're ever going to get TDE in PostgreSQL, we will probably have to make some compromises. Actually concealing file lengths would require a redesign of the entire storage system, and so is probably not practical in the short term. Concealing SLRU contents would require significant changes too, some of which I think are things Thomas wants to do anyway, but we might have to punt that goal for the first version of a TDE feature, too. Surely, that weakens security, but if it gets us to a feature that some people can use before the heat death of the universe, there's a reasonable argument that that's better than nothing. Still, conceding that we may not realistically may be able to conceal all the information in v1 is different from arguing that concealing it isn't desirable, and I think the latter argument is pretty hard to defend. People who want to break into computers have gotten incredibly good at exploiting incredibly subtle bits of information in order to infer the contents of unseen data. https://en.wikipedia.org/wiki/Spectre_(security_vulnerability) is a good example: somebody figured out that the branch prediction hardware could initiate speculative accesses to RAM that the user doesn't actually have permission to execute, and thus a JavaScript program running in your browser can read out the entire contents of RAM by measuring exactly how long mis-predicted code takes to execute. There's got to be at least one chip designer out there somewhere who was involved in the design of that branch prediction system, knew that it didn't perform the permissions checks before accessing RAM, and thought to themselves "that should be ok - what's the worst thing that can happen?". I imagine that (those) chip designer(s) had a really bad day when they found out someone had written a program to use that information leakage to read out the entire contents of RAM ... not even using C, but using JavaScript running inside a browser! That's only an example, but I think it's pretty typical of how these sort of things go. I believe computer security literature is literally riddled with attacks where the exposure of seemingly-innocent information turned out to be a big problem. I don't think the information exposed in the btree special space is very innocent: it's not the keys themselves, but if you have the contents of every btree special space in the btree there are definitely cases where you can take inference from that information. > I also expect only a small benefit. But that isn't a particularly > important factor in my mind. > > Let's suppose that it turns out to be significantly more useful than > we originally expected, for whatever reason. Assuming all that, what > else can be said about it now? Isn't it now *relatively* likely that > including that status bit metadata will be *extremely* valuable, and > not merely somewhat more valuable? This is too hypothetical for me to have an intelligent opinion. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 14, 2022 at 01:42:55PM -0400, Robert Haas wrote: > Hmm, but on the other hand, if you imagine a scenario in which the > "storage system extra blob" is actually a nonce for TDE, you need to > be able to find it before you've decrypted the rest of the page. If > pd_checksum gives you the offset of that data, you need to exclude it > from what gets encrypted, which means that you need encrypt three > separate non-contiguous areas of the page whose combined size is > unlikely to be a multiple of the encryption algorithm's block size. > That kind of sucks (and putting it at the end of the page makes it way > better). I continue to believe that a nonce is not needed for XTS encryption mode, and that adding a tamper-detection GCM hash is of limited usefulness since malicious writes can be done to other critical files and can be used to find the cluster or encryption keys -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson