Thread: Key management with tests

Key management with tests

From
Bruce Momjian
Date:
I have completed the key management patch with tests created by Stephen
Frost.  Original patch by Masahiko Sawada.  It requires the hex
reorganization patch first.  The key patch is now 2.1MB because of the
tests, so attaching it here seems unwise:

    https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff
    https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

I will add it to the commitfest.  I think we need to figure out how much
of the tests we want to add.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote:
> I have completed the key management patch with tests created by Stephen
> Frost.  Original patch by Masahiko Sawada.  It requires the hex
> reorganization patch first.  The key patch is now 2.1MB because of the
> tests, so attaching it here seems unwise:
> 
>     https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff
>     https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> 
> I will add it to the commitfest.  I think we need to figure out how much
> of the tests we want to add.

I am getting regression test errors using OpenSSL 1.1.1d  10 Sep 2019
with zero-length input data (no -p), while Stephen is able for those
tests to pass.   This needs more research, plus I think higher-level
tests.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan  1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote:
> On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote:
> > I have completed the key management patch with tests created by Stephen
> > Frost.  Original patch by Masahiko Sawada.  It requires the hex
> > reorganization patch first.  The key patch is now 2.1MB because of the
> > tests, so attaching it here seems unwise:
> > 
> >     https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff
> >     https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> > 
> > I will add it to the commitfest.  I think we need to figure out how much
> > of the tests we want to add.
> 
> I am getting regression test errors using OpenSSL 1.1.1d  10 Sep 2019
> with zero-length input data (no -p), while Stephen is able for those
> tests to pass.   This needs more research, plus I think higher-level
> tests.

I have found the cause of the failure, which I added as a C comment:

    /*
     * OpenSSL 1.1.1d and earlier crashes on some zero-length plaintext
     * and ciphertext strings.  It crashes on an encryption call to
     * EVP_EncryptFinal_ex(() in GCM mode of zero-length strings if
     * plaintext is NULL, even though plaintext_len is zero.  Setting
     * plaintext to non-NULL allows it to work.  In KW/KWP mode,
     * zero-length strings fail if plaintext_len = 0 and plaintext is
     * non-NULL (the opposite).  OpenSSL 1.1.1e+ is fine with all options.
     */
    else if (cipher == PG_CIPHER_AES_GCM)
    {
        plaintext_len = 0;
        plaintext = pg_malloc0(1);
    }

All the tests pass now.  The current src/test directory is 19MB, and
adding these tests takes it to 23MB, or a 20% increase.  That seems like
a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
tests, or just test 256, or use gzip to compress the tests by 50%? 
(Does every platform have gzip?)

My next step is to add the high-level tests.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Alvaro Herrera
Date:
On 2021-Jan-07, Bruce Momjian wrote:

> All the tests pass now.  The current src/test directory is 19MB, and
> adding these tests takes it to 23MB, or a 20% increase.  That seems like
> a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
> tests, or just test 256, or use gzip to compress the tests by 50%? 
> (Does every platform have gzip?)

So the tests are about 95% of the patch ... do we really need that many
tests?

-- 
Álvaro Herrera



Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Jan  7, 2021 at 04:08:49PM -0300, Álvaro Herrera wrote:
> On 2021-Jan-07, Bruce Momjian wrote:
> 
> > All the tests pass now.  The current src/test directory is 19MB, and
> > adding these tests takes it to 23MB, or a 20% increase.  That seems like
> > a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
> > tests, or just test 256, or use gzip to compress the tests by 50%? 
> > (Does every platform have gzip?)
> 
> So the tests are about 95% of the patch ... do we really need that many
> tests?

No, I don't think so.  Stephen imported the entire NIST test suite.  It
was so comperhensive, it detected several OpenSSL bugs for zero-length
strings, which I already reported, but we would never be encrypting
zero-length strings, so there wasn't a lot of value to it.

Anyway, I think we need to figure out how to trim.  The first part would
be to figure out whether we need 128 _and_ 256-bit tests, and then see
what items are really useful.  Stephen, do you have any ideas on that?
We currently have 10296 tests, and I think we could get away with 100.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Jan  7, 2021 at 10:02:14AM -0500, Bruce Momjian wrote:
> My next step is to add the high-level tests.

Here is the high-level script, and the log output.  I used the
pg_upgrade test.sh as a model.

It uses "CFE DEBUG" lines that are already in the code to compare the
initdb encryption with the other initdb decryption and pg_ctl
decryption.  It was easier than I thought.

What it does not do is to test the file descriptor passing from
/dev/tty, or the sample scripts.  This seems acceptable to me since I
test them and they rarely change.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Jan  7, 2021 at 04:08:49PM -0300, Álvaro Herrera wrote:
> > On 2021-Jan-07, Bruce Momjian wrote:
> >
> > > All the tests pass now.  The current src/test directory is 19MB, and
> > > adding these tests takes it to 23MB, or a 20% increase.  That seems like
> > > a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
> > > tests, or just test 256, or use gzip to compress the tests by 50%?
> > > (Does every platform have gzip?)
> >
> > So the tests are about 95% of the patch ... do we really need that many
> > tests?
>
> No, I don't think so.  Stephen imported the entire NIST test suite.  It
> was so comperhensive, it detected several OpenSSL bugs for zero-length
> strings, which I already reported, but we would never be encrypting
> zero-length strings, so there wasn't a lot of value to it.

I ran the entire test suite locally to ensure everything worked, but I
didn't actually include all of it in the PR which you merged- I had
already reduced it quite a bit by removing all 'additional
authenticated data' test cases (which the tests will automatically skip
and which we haven't implemented support for in the common library
wrappers) and by removing the 192-bit cases.  This reduced the overall
test set by about 2/3rd's or so, as I recall.

> Anyway, I think we need to figure out how to trim.  The first part would
> be to figure out whether we need 128 _and_ 256-bit tests, and then see
> what items are really useful.  Stephen, do you have any ideas on that?
> We currently have 10296 tests, and I think we could get away with 100.

Yeah, it's probably still too much, but I don't have any particularly
justifiable suggestions as to exactly what we should remove or what we
should keep.

Perhaps it'd make sense to try and cover the cases that are more likely
to be issues between our wrapper functions and OpenSSL, and not stress
too much about constantly testing cases that should really be up to
OpenSSL.  As such, I'd propose:

- Add back in some 192-bit tests, so we cover all three bit lengths.
- Add back in some additional authenticated test cases, just to make
  sure that, until/unless we implement support, the test code properly
  skips over those.
- Keep tests for various length plaintext/ciphertext (including 0-byte
  cases, so we make sure those work, since they really should).
- Keep at least one test for each length of tag that's included in the
  test suite.

I'm not sure how many tests we'd end up with from that, but my swag /
gut feeling is that it'd probably be on the order of 100ish and a small
enough set that it won't dwarf the rest of the patch.

Would be nice if we had a way for some buildfarm animal or something to
pull in the entire suite and test it, imv..  If anyone wants to
volunteer, I'd be happy to explain how to make that happen (it's not
hard though- download/unzip the files, drop them in the directory,
update the test script to add all the files into the array).

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Stephen Frost
Date:
Greetings Bruce,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Jan  1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote:
> > On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote:
> > > I have completed the key management patch with tests created by Stephen
> > > Frost.  Original patch by Masahiko Sawada.  It requires the hex
> > > reorganization patch first.  The key patch is now 2.1MB because of the
> > > tests, so attaching it here seems unwise:
> > >
> > >     https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff
> > >     https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> > >
> > > I will add it to the commitfest.  I think we need to figure out how much
> > > of the tests we want to add.
> >
> > I am getting regression test errors using OpenSSL 1.1.1d  10 Sep 2019
> > with zero-length input data (no -p), while Stephen is able for those
> > tests to pass.   This needs more research, plus I think higher-level
> > tests.
>
> I have found the cause of the failure, which I added as a C comment:
>
>     /*
>      * OpenSSL 1.1.1d and earlier crashes on some zero-length plaintext
>      * and ciphertext strings.  It crashes on an encryption call to
>      * EVP_EncryptFinal_ex(() in GCM mode of zero-length strings if
>      * plaintext is NULL, even though plaintext_len is zero.  Setting
>      * plaintext to non-NULL allows it to work.  In KW/KWP mode,
>      * zero-length strings fail if plaintext_len = 0 and plaintext is
>      * non-NULL (the opposite).  OpenSSL 1.1.1e+ is fine with all options.
>      */
>     else if (cipher == PG_CIPHER_AES_GCM)
>     {
>         plaintext_len = 0;
>         plaintext = pg_malloc0(1);
>     }
>
> All the tests pass now.  The current src/test directory is 19MB, and
> adding these tests takes it to 23MB, or a 20% increase.  That seems like
> a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
> tests, or just test 256, or use gzip to compress the tests by 50%?
> (Does every platform have gzip?)

Thanks a lot for working on this and figuring out what the issue was and
fixing it!  That's great that we got all those cases passing for you
too.

Thanks again,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan  8, 2021 at 03:33:44PM -0500, Stephen Frost wrote:
> > No, I don't think so.  Stephen imported the entire NIST test suite.  It
> > was so comperhensive, it detected several OpenSSL bugs for zero-length
> > strings, which I already reported, but we would never be encrypting
> > zero-length strings, so there wasn't a lot of value to it.
> 
> I ran the entire test suite locally to ensure everything worked, but I
> didn't actually include all of it in the PR which you merged- I had
> already reduced it quite a bit by removing all 'additional
> authenticated data' test cases (which the tests will automatically skip
> and which we haven't implemented support for in the common library
> wrappers) and by removing the 192-bit cases.  This reduced the overall
> test set by about 2/3rd's or so, as I recall.

Wow, so that was reduced!

> > Anyway, I think we need to figure out how to trim.  The first part would
> > be to figure out whether we need 128 _and_ 256-bit tests, and then see
> > what items are really useful.  Stephen, do you have any ideas on that?
> > We currently have 10296 tests, and I think we could get away with 100.
> 
> Yeah, it's probably still too much, but I don't have any particularly
> justifiable suggestions as to exactly what we should remove or what we
> should keep.
> 
> Perhaps it'd make sense to try and cover the cases that are more likely
> to be issues between our wrapper functions and OpenSSL, and not stress
> too much about constantly testing cases that should really be up to
> OpenSSL.  As such, I'd propose:
> 
> - Add back in some 192-bit tests, so we cover all three bit lengths.
> - Add back in some additional authenticated test cases, just to make
>   sure that, until/unless we implement support, the test code properly
>   skips over those.
> - Keep tests for various length plaintext/ciphertext (including 0-byte
>   cases, so we make sure those work, since they really should).
> - Keep at least one test for each length of tag that's included in the
>   test suite.

Makes sense.  I did a simplistic trim-down to 90 tests but it still was
40% of the patch;  attached.  The hex strings are very long.

> I'm not sure how many tests we'd end up with from that, but my swag /
> gut feeling is that it'd probably be on the order of 100ish and a small
> enough set that it won't dwarf the rest of the patch.
> 
> Would be nice if we had a way for some buildfarm animal or something to
> pull in the entire suite and test it, imv..  If anyone wants to
> volunteer, I'd be happy to explain how to make that happen (it's not
> hard though- download/unzip the files, drop them in the directory,
> update the test script to add all the files into the array).

Yes, do we have a place to store more comprehensive tests outside of our
git tree?   Has this been done before?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan  8, 2021 at 03:34:23PM -0500, Stephen Frost wrote:
> > All the tests pass now.  The current src/test directory is 19MB, and
> > adding these tests takes it to 23MB, or a 20% increase.  That seems like
> > a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
> > tests, or just test 256, or use gzip to compress the tests by 50%? 
> > (Does every platform have gzip?)
> 
> Thanks a lot for working on this and figuring out what the issue was and
> fixing it!  That's great that we got all those cases passing for you
> too.

Yes, I was relieved.  The pattern of when zero-length strings fail in
which modes is still very odd, but at least it reports an error, so it
isn't returning incorrect data.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Jan  8, 2021 at 03:33:44PM -0500, Stephen Frost wrote:
> > > Anyway, I think we need to figure out how to trim.  The first part would
> > > be to figure out whether we need 128 _and_ 256-bit tests, and then see
> > > what items are really useful.  Stephen, do you have any ideas on that?
> > > We currently have 10296 tests, and I think we could get away with 100.
> >
> > Yeah, it's probably still too much, but I don't have any particularly
> > justifiable suggestions as to exactly what we should remove or what we
> > should keep.
> >
> > Perhaps it'd make sense to try and cover the cases that are more likely
> > to be issues between our wrapper functions and OpenSSL, and not stress
> > too much about constantly testing cases that should really be up to
> > OpenSSL.  As such, I'd propose:
> >
> > - Add back in some 192-bit tests, so we cover all three bit lengths.
> > - Add back in some additional authenticated test cases, just to make
> >   sure that, until/unless we implement support, the test code properly
> >   skips over those.
> > - Keep tests for various length plaintext/ciphertext (including 0-byte
> >   cases, so we make sure those work, since they really should).
> > - Keep at least one test for each length of tag that's included in the
> >   test suite.
>
> Makes sense.  I did a simplistic trim-down to 90 tests but it still was
> 40% of the patch;  attached.  The hex strings are very long.

I don't think we actually need to stress over the size of the test data
relative to the size of the patch- it's not like it's all that much perl
code.  I can appreciate that we don't want to add megabytes worth of
test data to the git repo though.

> > I'm not sure how many tests we'd end up with from that, but my swag /
> > gut feeling is that it'd probably be on the order of 100ish and a small
> > enough set that it won't dwarf the rest of the patch.
> >
> > Would be nice if we had a way for some buildfarm animal or something to
> > pull in the entire suite and test it, imv..  If anyone wants to
> > volunteer, I'd be happy to explain how to make that happen (it's not
> > hard though- download/unzip the files, drop them in the directory,
> > update the test script to add all the files into the array).
>
> Yes, do we have a place to store more comprehensive tests outside of our
> git tree?   Has this been done before?

Not that I'm aware of.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan  1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote:
> On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote:
> > I have completed the key management patch with tests created by Stephen
> > Frost.  Original patch by Masahiko Sawada.  It requires the hex
> > reorganization patch first.  The key patch is now 2.1MB because of the
> > tests, so attaching it here seems unwise:
> > 
> >     https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff
> >     https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> > 
> > I will add it to the commitfest.  I think we need to figure out how much
> > of the tests we want to add.
> 
> I am getting regression test errors using OpenSSL 1.1.1d  10 Sep 2019
> with zero-length input data (no -p), while Stephen is able for those
> tests to pass.   This needs more research, plus I think higher-level
> tests.

I know we are still working on the hex patch (dest-len) and the crypto
tests, but I wanted to post this so people can see where we are, and we
can get some current cfbot testing.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Sat, Jan  9, 2021 at 01:17:36PM -0500, Bruce Momjian wrote:
> On Fri, Jan  1, 2021 at 01:07:50AM -0500, Bruce Momjian wrote:
> > On Thu, Dec 31, 2020 at 11:50:47PM -0500, Bruce Momjian wrote:
> > > I have completed the key management patch with tests created by Stephen
> > > Frost.  Original patch by Masahiko Sawada.  It requires the hex
> > > reorganization patch first.  The key patch is now 2.1MB because of the
> > > tests, so attaching it here seems unwise:
> > > 
> > >     https://github.com/postgres/postgres/compare/master...bmomjian:hex.diff
> > >     https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> > > 
> > > I will add it to the commitfest.  I think we need to figure out how much
> > > of the tests we want to add.
> > 
> > I am getting regression test errors using OpenSSL 1.1.1d  10 Sep 2019
> > with zero-length input data (no -p), while Stephen is able for those
> > tests to pass.   This needs more research, plus I think higher-level
> > tests.
> 
> I know we are still working on the hex patch (dest-len) and the crypto
> tests, but I wanted to post this so people can see where we are, and we
> can get some current cfbot testing.

Here is an updated version that covers all the possible
testing/configuration options.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Sat, Jan  9, 2021 at 08:08:16PM -0500, Bruce Momjian wrote:
> On Sat, Jan  9, 2021 at 01:17:36PM -0500, Bruce Momjian wrote:
> > I know we are still working on the hex patch (dest-len) and the crypto
> > tests, but I wanted to post this so people can see where we are, and we
> > can get some current cfbot testing.
> 
> Here is an updated version that covers all the possible
> testing/configuration options.

Does anyone know why the cfbot applied the patch listed second first
here?

    http://cfbot.cputube.org/patch_31_2925.log

Specifically, it applied hex..key.diff.gz before hex.diff.gz.  I assumed
it would apply attachments in the order they appear in the email.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Thomas Munro
Date:
On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> Does anyone know why the cfbot applied the patch listed second first
> here?
>
>         http://cfbot.cputube.org/patch_31_2925.log
>
> Specifically, it applied hex..key.diff.gz before hex.diff.gz.  I assumed
> it would apply attachments in the order they appear in the email.

It sorts the filenames (in this case after decompressing step removes
the .gz endings).  That works pretty well for the patches that "git
format-patch" spits out, but it's a bit hit and miss with cases like
yours.



Re: Key management with tests

From
Bruce Momjian
Date:
On Sun, Jan 10, 2021 at 06:04:12PM +1300, Thomas Munro wrote:
> On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Does anyone know why the cfbot applied the patch listed second first
> > here?
> >
> >         http://cfbot.cputube.org/patch_31_2925.log
> >
> > Specifically, it applied hex..key.diff.gz before hex.diff.gz.  I assumed
> > it would apply attachments in the order they appear in the email.
> 
> It sorts the filenames (in this case after decompressing step removes
> the .gz endings).  That works pretty well for the patches that "git
> format-patch" spits out, but it's a bit hit and miss with cases like
> yours.

OK, here they are with numeric prefixes.  It was actually tricky to
figure out how to create a squashed format-patch based on another branch.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Masahiko Sawada
Date:
On Sun, Jan 10, 2021 at 11:51 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Sun, Jan 10, 2021 at 06:04:12PM +1300, Thomas Munro wrote:
> > On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > Does anyone know why the cfbot applied the patch listed second first
> > > here?
> > >
> > >         http://cfbot.cputube.org/patch_31_2925.log
> > >
> > > Specifically, it applied hex..key.diff.gz before hex.diff.gz.  I assumed
> > > it would apply attachments in the order they appear in the email.
> >
> > It sorts the filenames (in this case after decompressing step removes
> > the .gz endings).  That works pretty well for the patches that "git
> > format-patch" spits out, but it's a bit hit and miss with cases like
> > yours.
>
> OK, here they are with numeric prefixes.  It was actually tricky to
> figure out how to create a squashed format-patch based on another branch.
>

Thank you for attaching the patches. It passes all cfbot tests, great.

Looking at the patch, it supports three algorithms but only
PG_CIPHER_AES_KWP is used in the core for now:

+/*
+ * Supported symmetric encryption algorithm. These identifiers are passed
+ * to pg_cipher_ctx_create() function, and then actual encryption
+ * implementations need to initialize their context of the given encryption
+ * algorithm.
+ */
+#define PG_CIPHER_AES_GCM          0
+#define PG_CIPHER_AES_KW           1
+#define PG_CIPHER_AES_KWP          2
+#define PG_MAX_CIPHER_ID           3

Are we in the process of experimenting which algorithms are better? If
we support one algorithm that is actually used in the core, we would
reduce the tests as well.

FWIW, I've written a PoC patch for buffer encryption to make sure the
kms patch would be workable with other components using the encryption
key managed by kmgr.

Overall it’s good. While the buffer encryption patch is still PoC
quality and there are some problems regarding nonce generation we need
to deal with, it easily can use the relation key managed by the kmgr
to encrypt/decrypt buffers.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 11, 2021 at 08:12:00PM +0900, Masahiko Sawada wrote:
> On Sun, Jan 10, 2021 at 11:51 PM Bruce Momjian <bruce@momjian.us> wrote:
> > OK, here they are with numeric prefixes.  It was actually tricky to
> > figure out how to create a squashed format-patch based on another branch.
> 
> Thank you for attaching the patches. It passes all cfbot tests, great.

Yeah, I saw that.  :-)  I head to learn a lot about how to create
squashed format-patches on non-master branches.  I have now automated it
so it will be easy going forward.

> Looking at the patch, it supports three algorithms but only
> PG_CIPHER_AES_KWP is used in the core for now:
> 
> +/*
> + * Supported symmetric encryption algorithm. These identifiers are passed
> + * to pg_cipher_ctx_create() function, and then actual encryption
> + * implementations need to initialize their context of the given encryption
> + * algorithm.
> + */
> +#define PG_CIPHER_AES_GCM          0
> +#define PG_CIPHER_AES_KW           1
> +#define PG_CIPHER_AES_KWP          2
> +#define PG_MAX_CIPHER_ID           3
> 
> Are we in the process of experimenting which algorithms are better? If
> we support one algorithm that is actually used in the core, we would
> reduce the tests as well.

I think we are only using KWP (Key Wrap with Padding) because that is
for wrapping keys:

    https://csrc.nist.gov/CSRC/media/Projects/Cryptographic-Algorithm-Validation-Program/documents/mac/KWVS.pdf

I am not sure about KW.  I think we are using GCM for the WAP/heap/index
pages.  Stephen would know more.

> FWIW, I've written a PoC patch for buffer encryption to make sure the
> kms patch would be workable with other components using the encryption
> key managed by kmgr.

Wow, it is a small patch --- nice.
 
-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Jan 11, 2021 at 08:12:00PM +0900, Masahiko Sawada wrote:
> > Looking at the patch, it supports three algorithms but only
> > PG_CIPHER_AES_KWP is used in the core for now:
> >
> > +/*
> > + * Supported symmetric encryption algorithm. These identifiers are passed
> > + * to pg_cipher_ctx_create() function, and then actual encryption
> > + * implementations need to initialize their context of the given encryption
> > + * algorithm.
> > + */
> > +#define PG_CIPHER_AES_GCM          0
> > +#define PG_CIPHER_AES_KW           1
> > +#define PG_CIPHER_AES_KWP          2
> > +#define PG_MAX_CIPHER_ID           3
> >
> > Are we in the process of experimenting which algorithms are better? If
> > we support one algorithm that is actually used in the core, we would
> > reduce the tests as well.
>
> I think we are only using KWP (Key Wrap with Padding) because that is
> for wrapping keys:
>
>     https://csrc.nist.gov/CSRC/media/Projects/Cryptographic-Algorithm-Validation-Program/documents/mac/KWVS.pdf

Yes.

> I am not sure about KW.  I think we are using GCM for the WAP/heap/index
> pages.  Stephen would know more.

KW was more-or-less 'for free' and there were tests for it, which is why
it was included.  Yes, GCM would be for WAL/heap/index pages, it
wouldn't be appropriate to use KW or KWP for that.  Using KW/KWP for the
key wrapping also makes the API simpler- and therefore easier for other
implementations to be written which provide the same API.

> > FWIW, I've written a PoC patch for buffer encryption to make sure the
> > kms patch would be workable with other components using the encryption
> > key managed by kmgr.
>
> Wow, it is a small patch --- nice.

I agree that the actual encryption patch, for just the main heap/index,
won't be too bad.  The larger part will be dealing with all of the
temporary files we create that have user data in them...  I've been
contemplating a way to try and make that part of the patch smaller
though and hopefully that will bear fruit and we can avoid having to
change a lof of, eg, reorderbuffer.c and pgstat.c.

There's a few places where we need to be sure to be updating the LSN for
both logged and unlogged relations properly, including dealing with
things like the magic GIST "GistBuildLSN" fake-LSN too, and we will
absolutely need to have a bit used in the IV to distinguish if it's a
real LSN or an unlogged LSN.

Although, another approach and one that I've discussed a bit with Bruce,
is to have more keys- such as a key for temporary files, and perhaps
even a key for logged relations and a different for unlogged..  Or
perhaps sets of keys for each which automatically are rotating every X
number of GB based on the LSN...  Which is a big part of why key
management is such an important part of this effort.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote:
> Although, another approach and one that I've discussed a bit with Bruce,
> is to have more keys- such as a key for temporary files, and perhaps
> even a key for logged relations and a different for unlogged..  Or

Yes, we have to make sure the nonce (computed as LSN/pageno) is never
reused, so if we have several LSN usage "spaces", they need different
data keys. 

> perhaps sets of keys for each which automatically are rotating every X
> number of GB based on the LSN...  Which is a big part of why key
> management is such an important part of this effort.

Yes, this would avoid the need to failover to a standby for data key
rotation.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote:
> > Although, another approach and one that I've discussed a bit with Bruce,
> > is to have more keys- such as a key for temporary files, and perhaps
> > even a key for logged relations and a different for unlogged..  Or
>
> Yes, we have to make sure the nonce (computed as LSN/pageno) is never
> reused, so if we have several LSN usage "spaces", they need different
> data keys.

Right, or ensure that the actual IV used is distinct (such as by using
another bit in the IV to distinguish logged-vs-unlogged), but it seems
saner to just use a different key, ultimately.

> > perhaps sets of keys for each which automatically are rotating every X
> > number of GB based on the LSN...  Which is a big part of why key
> > management is such an important part of this effort.
>
> Yes, this would avoid the need to failover to a standby for data key
> rotation.

Yes, and it avoids the issue of using a single key for too much, which
is also a concern.  The remaining larger issues are to figure out a
place to put the tag for each page, and the relatively simple matter of
programming a mechanism to cache the keys we're commonly using (current
key for encryption, recently used keys for decryption) since we'll
eventually get to a point of having written out more data than we are
going to keep keys in memory for.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 11, 2021 at 01:23:27PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote:
> > > Although, another approach and one that I've discussed a bit with Bruce,
> > > is to have more keys- such as a key for temporary files, and perhaps
> > > even a key for logged relations and a different for unlogged..  Or
> > 
> > Yes, we have to make sure the nonce (computed as LSN/pageno) is never
> > reused, so if we have several LSN usage "spaces", they need different
> > data keys. 
> 
> Right, or ensure that the actual IV used is distinct (such as by using
> another bit in the IV to distinguish logged-vs-unlogged), but it seems
> saner to just use a different key, ultimately.

Yes, we have eight unused bit in the Nonce right now.

> > > perhaps sets of keys for each which automatically are rotating every X
> > > number of GB based on the LSN...  Which is a big part of why key
> > > management is such an important part of this effort.
> > 
> > Yes, this would avoid the need to failover to a standby for data key
> > rotation.
> 
> Yes, and it avoids the issue of using a single key for too much, which
> is also a concern.  The remaining larger issues are to figure out a
> place to put the tag for each page, and the relatively simple matter of
> programming a mechanism to cache the keys we're commonly using (current
> key for encryption, recently used keys for decryption) since we'll
> eventually get to a point of having written out more data than we are
> going to keep keys in memory for.

I thought the LSN range would be stored with the keys, so there is no
need to tag the LSN on each page.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Jan 11, 2021 at 01:23:27PM -0500, Stephen Frost wrote:
> > Yes, and it avoids the issue of using a single key for too much, which
> > is also a concern.  The remaining larger issues are to figure out a
> > place to put the tag for each page, and the relatively simple matter of
> > programming a mechanism to cache the keys we're commonly using (current
> > key for encryption, recently used keys for decryption) since we'll
> > eventually get to a point of having written out more data than we are
> > going to keep keys in memory for.
>
> I thought the LSN range would be stored with the keys, so there is no
> need to tag the LSN on each page.

Yes, LSN range would be stored with the keys in some fashion (maybe just
the start of a particular LSN range would be in the filename of the key
for that range...).  The 'tag' that I'm referring to there is one of the
outputs from the GCM encryption and is what provides the integrity /
authentication of the encrypted data to be able to detect if it's been
modified.  Unfortunately, while the page checksum will continue to be
used and available for checking against disk corruption, it's not
sufficient.  Hence, ideally, we'd find a spot to stick the 128-bit tag
on each page.

Given that, clearly, it's not possible to go from an unencrypted cluster
to an encrypted cluster without rewriting the entire cluster, we aren't
bound to maintain the on-disk page format, we should be able to
accomadate including the tag somewhere.  Unfortuantely, it doesn't seem
quite as trivial as I'd hoped since there are parts of the code which
make assumptions about the page beyond perhaps what they should be, but
I'm still hopeful that it won't be *too* hard to do.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 11, 2021 at 02:19:22PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Mon, Jan 11, 2021 at 01:23:27PM -0500, Stephen Frost wrote:
> > > Yes, and it avoids the issue of using a single key for too much, which
> > > is also a concern.  The remaining larger issues are to figure out a
> > > place to put the tag for each page, and the relatively simple matter of
> > > programming a mechanism to cache the keys we're commonly using (current
> > > key for encryption, recently used keys for decryption) since we'll
> > > eventually get to a point of having written out more data than we are
> > > going to keep keys in memory for.
> > 
> > I thought the LSN range would be stored with the keys, so there is no
> > need to tag the LSN on each page.
> 
> Yes, LSN range would be stored with the keys in some fashion (maybe just
> the start of a particular LSN range would be in the filename of the key
> for that range...).  The 'tag' that I'm referring to there is one of the

Oh, that tag, yes, we need to add that to each page.  I thought you mean
an LSN-range-key tag.

> outputs from the GCM encryption and is what provides the integrity /
> authentication of the encrypted data to be able to detect if it's been
> modified.  Unfortunately, while the page checksum will continue to be
> used and available for checking against disk corruption, it's not
> sufficient.  Hence, ideally, we'd find a spot to stick the 128-bit tag
> on each page.

Agreed.  Would checksums be of any value with GCM?

> Given that, clearly, it's not possible to go from an unencrypted cluster
> to an encrypted cluster without rewriting the entire cluster, we aren't
> bound to maintain the on-disk page format, we should be able to
> accomadate including the tag somewhere.  Unfortuantely, it doesn't seem
> quite as trivial as I'd hoped since there are parts of the code which
> make assumptions about the page beyond perhaps what they should be, but
> I'm still hopeful that it won't be *too* hard to do.

OK, thanks.  Are there other page improvements we should make when we
are requiring a page rewrite?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Jan 11, 2021 at 02:19:22PM -0500, Stephen Frost wrote:
> > outputs from the GCM encryption and is what provides the integrity /
> > authentication of the encrypted data to be able to detect if it's been
> > modified.  Unfortunately, while the page checksum will continue to be
> > used and available for checking against disk corruption, it's not
> > sufficient.  Hence, ideally, we'd find a spot to stick the 128-bit tag
> > on each page.
>
> Agreed.  Would checksums be of any value with GCM?

The value would be to allow testing of the database integrity, to the
amount allowed by the checksum, to be done without having access to the
encryption keys, and because there's not much else we'd be using those
bits for if we didn't.

> > Given that, clearly, it's not possible to go from an unencrypted cluster
> > to an encrypted cluster without rewriting the entire cluster, we aren't
> > bound to maintain the on-disk page format, we should be able to
> > accomadate including the tag somewhere.  Unfortuantely, it doesn't seem
> > quite as trivial as I'd hoped since there are parts of the code which
> > make assumptions about the page beyond perhaps what they should be, but
> > I'm still hopeful that it won't be *too* hard to do.
>
> OK, thanks.  Are there other page improvements we should make when we
> are requiring a page rewrite?

This is an interesting question but ultimately I don't think we should
be looking at this from the perspective of allowing arbitrary changes to
the page format.  The challenge is that much of the page format, today,
is defined by a C struct and changing the way that works would require a
great deal of code to be modified and turn this into a massive effort,
assuming we wish to have the same compiled binary able to work with both
unencrypted and encrypted clusters, which I do believe is a requirement.

The thought that I had was to, instead, try to figure out if we could
fudge some space by, say, putting a 128-bit 'hole' at the end of the
page and just move pd_special back, effectively making the page seem
'smaller' to all of the code that uses it, except for the code that
knows how to do the decryption.  I ran into some trouble with that but
haven't quite sorted out what happened yet.  Other ideas would be to put
it before pd_special, or maybe somewhere else, but a lot depends on the
code's expectations.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Masahiko Sawada
Date:
On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Mon, Jan 11, 2021 at 12:54:49PM -0500, Stephen Frost wrote:
> > > Although, another approach and one that I've discussed a bit with Bruce,
> > > is to have more keys- such as a key for temporary files, and perhaps
> > > even a key for logged relations and a different for unlogged..  Or
> >
> > Yes, we have to make sure the nonce (computed as LSN/pageno) is never
> > reused, so if we have several LSN usage "spaces", they need different
> > data keys.
>
> Right, or ensure that the actual IV used is distinct (such as by using
> another bit in the IV to distinguish logged-vs-unlogged), but it seems
> saner to just use a different key, ultimately.

Agreed.

I think we also need to consider how to make sure nonce is unique when
making a page dirty by updating hint bits. Hint bit update changes the
page contents but doesn't change the page lsn if we already write a
full page write. In the PoC patch, I logged a dummy WAL record
(XLOG_NOOP) just to move the page lsn forward, but since this is
required even when changing the page is not the first time since the
last checkpoint we might end up logging too many dummy WAL records.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 09:32:54AM +0900, Masahiko Sawada wrote:
> On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Right, or ensure that the actual IV used is distinct (such as by using
> > another bit in the IV to distinguish logged-vs-unlogged), but it seems
> > saner to just use a different key, ultimately.
> 
> Agreed.
> 
> I think we also need to consider how to make sure nonce is unique when
> making a page dirty by updating hint bits. Hint bit update changes the
> page contents but doesn't change the page lsn if we already write a
> full page write. In the PoC patch, I logged a dummy WAL record
> (XLOG_NOOP) just to move the page lsn forward, but since this is
> required even when changing the page is not the first time since the
> last checkpoint we might end up logging too many dummy WAL records.

This says:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements

    wal_log_hints will be enabled automatically in encryption mode. 

Does that help?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Neil Chen
Date:
Hi Stephen,

On Tue, Jan 12, 2021 at 10:47 AM Stephen Frost <sfrost@snowman.net> wrote:

This is an interesting question but ultimately I don't think we should
be looking at this from the perspective of allowing arbitrary changes to
the page format.  The challenge is that much of the page format, today,
is defined by a C struct and changing the way that works would require a
great deal of code to be modified and turn this into a massive effort,
assuming we wish to have the same compiled binary able to work with both
unencrypted and encrypted clusters, which I do believe is a requirement.

The thought that I had was to, instead, try to figure out if we could
fudge some space by, say, putting a 128-bit 'hole' at the end of the
page and just move pd_special back, effectively making the page seem
'smaller' to all of the code that uses it, except for the code that
knows how to do the decryption.  I ran into some trouble with that but
haven't quite sorted out what happened yet.  Other ideas would be to put
it before pd_special, or maybe somewhere else, but a lot depends on the
code's expectations.


I agree that we should not make too many changes to affect the use of unencrypted clusters. But as a personal opinion only, I don't think it's a good idea to add some "implicit" tricks. To provide an inspiration, can we add a flag to mark whether the page format has been changed:

--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -181,8 +185,9 @@ typedef PageHeaderData *PageHeader;
 #define PD_PAGE_FULL 0x0002 /* not enough free space for new tuple? */
 #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
  * everyone */
+#define PD_PAGE_ENCRYPTED 0x0008 /* Is page encrypted? */
 
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
 
 /*
  * Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -389,6 +394,13 @@ PageValidateSpecialPointer(Page page)
 #define PageClearAllVisible(page) \
  (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
 
+#define PageIsEncrypted(page) \
+ (((PageHeader) (page))->pd_flags & PD_PAGE_ENCRYPTED)
+#define PageSetEncrypted(page) \
+ (((PageHeader) (page))->pd_flags |= PD_PAGE_ENCRYPTED)
+#define PageClearEncrypted(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_PAGE_ENCRYPTED)
+
 #define PageIsPrunable(page, oldestxmin) \
 ( \
  AssertMacro(TransactionIdIsNormal(oldestxmin)), \

In this way, I think it has little effect on the unencrypted cluster, and we can also modify the page format as we wish. Of course, it's also possible that I didn't understand your design correctly, or there's something wrong with my idea. :D

--
There is no royal road to learning.
HighGo Software Co.

Re: Key management with tests

From
Masahiko Sawada
Date:
On Tue, Jan 12, 2021 at 11:09 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Jan 12, 2021 at 09:32:54AM +0900, Masahiko Sawada wrote:
> > On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > Right, or ensure that the actual IV used is distinct (such as by using
> > > another bit in the IV to distinguish logged-vs-unlogged), but it seems
> > > saner to just use a different key, ultimately.
> >
> > Agreed.
> >
> > I think we also need to consider how to make sure nonce is unique when
> > making a page dirty by updating hint bits. Hint bit update changes the
> > page contents but doesn't change the page lsn if we already write a
> > full page write. In the PoC patch, I logged a dummy WAL record
> > (XLOG_NOOP) just to move the page lsn forward, but since this is
> > required even when changing the page is not the first time since the
> > last checkpoint we might end up logging too many dummy WAL records.
>
> This says:
>
>         https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
>
>         wal_log_hints will be enabled automatically in encryption mode.
>
> Does that help?

IIUC it helps but not enough. When wal_log_hints is enabled, we write
a full-page image when updating hint bits if it's the first time
change for the page since the last checkpoint. But I'm concerned that
what if we change hint bits again after the page is flushed. We would
mark the page as dirtied but not write any WAL, leaving the page lsn
as it is.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Neil Chen (carpenter.nail.cz@gmail.com) wrote:
> On Tue, Jan 12, 2021 at 10:47 AM Stephen Frost <sfrost@snowman.net> wrote:
> > This is an interesting question but ultimately I don't think we should
> > be looking at this from the perspective of allowing arbitrary changes to
> > the page format.  The challenge is that much of the page format, today,
> > is defined by a C struct and changing the way that works would require a
> > great deal of code to be modified and turn this into a massive effort,
> > assuming we wish to have the same compiled binary able to work with both
> > unencrypted and encrypted clusters, which I do believe is a requirement.
> >
> > The thought that I had was to, instead, try to figure out if we could
> > fudge some space by, say, putting a 128-bit 'hole' at the end of the
> > page and just move pd_special back, effectively making the page seem
> > 'smaller' to all of the code that uses it, except for the code that
> > knows how to do the decryption.  I ran into some trouble with that but
> > haven't quite sorted out what happened yet.  Other ideas would be to put
> > it before pd_special, or maybe somewhere else, but a lot depends on the
> > code's expectations.
>
> I agree that we should not make too many changes to affect the use of
> unencrypted clusters. But as a personal opinion only, I don't think it's a
> good idea to add some "implicit" tricks. To provide an inspiration, can we
> add a flag to mark whether the page format has been changed:

Sure, of course we could add such a flag, but I don't see how that would
actually help with the issue?

> In this way, I think it has little effect on the unencrypted cluster, and
> we can also modify the page format as we wish. Of course, it's also
> possible that I didn't understand your design correctly, or there's
> something wrong with my idea. :D

No, we can't 'modify the page format as we wish'- if we change away from
using a C structure then we're going to be modifying quite a bit of
code which otherwise doesn't need to be changed.  The proposed flag
doesn't actually make a different page format work, the only thing it
would do would be to allow some parts of the cluster to be encrypted and
other parts not be, but I don't know that that's actually a useful
capability or a good reason to use one of those bits.  Having it handled
on a cluster level, at initdb time through pg_control, seems like it'd
work just fine.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Sun, Jan 10, 2021 at 09:51:16AM -0500, Bruce Momjian wrote:
> On Sun, Jan 10, 2021 at 06:04:12PM +1300, Thomas Munro wrote:
> > On Sun, Jan 10, 2021 at 3:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > Does anyone know why the cfbot applied the patch listed second first
> > > here?
> > >
> > >         http://cfbot.cputube.org/patch_31_2925.log
> > >
> > > Specifically, it applied hex..key.diff.gz before hex.diff.gz.  I assumed
> > > it would apply attachments in the order they appear in the email.
> > 
> > It sorts the filenames (in this case after decompressing step removes
> > the .gz endings).  That works pretty well for the patches that "git
> > format-patch" spits out, but it's a bit hit and miss with cases like
> > yours.
> 
> OK, here they are with numeric prefixes.  It was actually tricky to
> figure out how to create a squashed format-patch based on another branch.

Here is an updated version built on top of Michael Paquier's patch
posted here:

    https://www.postgresql.org/message-id/X/0IChOPHd+aYC1w@paquier.xyz

and included as my first attachment.  This will give Michael's patch
cfbot testing too since the second attachment calls many of the first
attachment's functions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 09:40:53PM +0900, Masahiko Sawada wrote:
> > This says:
> >
> >         https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
> >
> >         wal_log_hints will be enabled automatically in encryption mode.
> >
> > Does that help?
> 
> IIUC it helps but not enough. When wal_log_hints is enabled, we write
> a full-page image when updating hint bits if it's the first time
> change for the page since the last checkpoint. But I'm concerned that
> what if we change hint bits again after the page is flushed. We would
> mark the page as dirtied but not write any WAL, leaving the page lsn
> as it is.

I updated the wiki to be:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
    
    wal_log_hints will be enabled automatically in encryption mode. However,
    more than one hit change between checkpoints does not cause WAL
    activity, which would cause the same LSN to be used for different pages
    images. 

I think one big question is that, since we are using a streaming cipher,
do we care about hint bit changes showing to users?  I actually don't
know.  If we do, some kind of dummy LSN record might be required, as you
suggested.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Andres Freund
Date:
On 2021-01-12 13:03:14 -0500, Bruce Momjian wrote:
> I think one big question is that, since we are using a streaming cipher,
> do we care about hint bit changes showing to users?  I actually don't
> know.  If we do, some kind of dummy LSN record might be required, as you
> suggested.

That'd lead to a *massive* increase of WAL record volume. It's one thing
to WAL log hint bit writes once per page per checkpoint. It's another to
do so on every single hint bit write.



Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Jan 12, 2021 at 09:40:53PM +0900, Masahiko Sawada wrote:
> > > This says:
> > >
> > >         https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
> > >
> > >         wal_log_hints will be enabled automatically in encryption mode.
> > >
> > > Does that help?
> >
> > IIUC it helps but not enough. When wal_log_hints is enabled, we write
> > a full-page image when updating hint bits if it's the first time
> > change for the page since the last checkpoint. But I'm concerned that
> > what if we change hint bits again after the page is flushed. We would
> > mark the page as dirtied but not write any WAL, leaving the page lsn
> > as it is.
>
> I updated the wiki to be:
>
>     https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
>
>     wal_log_hints will be enabled automatically in encryption mode. However,
>     more than one hit change between checkpoints does not cause WAL
>     activity, which would cause the same LSN to be used for different pages
>     images.
>
> I think one big question is that, since we are using a streaming cipher,
> do we care about hint bit changes showing to users?  I actually don't
> know.  If we do, some kind of dummy LSN record might be required, as you
> suggested.

I don't think there's any doubt that we need to make sure that the IV is
distinct and advancing the LSN to get a new one when needed for this
case seems like it's probably the way to do that.  Hint bit change
visibility to users isn't really at issue here- we can't use the same IV
multiple times.  The two options that we have are to either not actually
update the hint bit in such a case, or to make sure to change the
LSN/IV.  Another option would be to, if we're able to make a hole to put
the GCM tag on to the page somewhere, further widen that hole to include
an additional space for a counter that would be mixed into the IV, to
avoid having to do an XLOG NOOP.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote:
> > I think one big question is that, since we are using a streaming cipher,
> > do we care about hint bit changes showing to users?  I actually don't
> > know.  If we do, some kind of dummy LSN record might be required, as you
> > suggested.
> 
> I don't think there's any doubt that we need to make sure that the IV is
> distinct and advancing the LSN to get a new one when needed for this
> case seems like it's probably the way to do that.  Hint bit change
> visibility to users isn't really at issue here- we can't use the same IV
> multiple times.  The two options that we have are to either not actually
> update the hint bit in such a case, or to make sure to change the
> LSN/IV.  Another option would be to, if we're able to make a hole to put
> the GCM tag on to the page somewhere, further widen that hole to include
> an additional space for a counter that would be mixed into the IV, to
> avoid having to do an XLOG NOOP.

Well, we have eight unused bits in the IV, so we could just increment
that for every hint bit change that uses the same LSN, and then force a
dummy WAL record when that 8-bit counter overflows --- that seems
simpler than logging hint bits.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote:
> > > I think one big question is that, since we are using a streaming cipher,
> > > do we care about hint bit changes showing to users?  I actually don't
> > > know.  If we do, some kind of dummy LSN record might be required, as you
> > > suggested.
> >
> > I don't think there's any doubt that we need to make sure that the IV is
> > distinct and advancing the LSN to get a new one when needed for this
> > case seems like it's probably the way to do that.  Hint bit change
> > visibility to users isn't really at issue here- we can't use the same IV
> > multiple times.  The two options that we have are to either not actually
> > update the hint bit in such a case, or to make sure to change the
> > LSN/IV.  Another option would be to, if we're able to make a hole to put
> > the GCM tag on to the page somewhere, further widen that hole to include
> > an additional space for a counter that would be mixed into the IV, to
> > avoid having to do an XLOG NOOP.
>
> Well, we have eight unused bits in the IV, so we could just increment
> that for every hint bit change that uses the same LSN, and then force a
> dummy WAL record when that 8-bit counter overflows --- that seems
> simpler than logging hint bits.

Sure, as long as we have a place to store that information..  We need to
have the full IV available when we go to decrypt the page.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 01:15:44PM -0500, Bruce Momjian wrote:
> On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote:
> > I don't think there's any doubt that we need to make sure that the IV is
> > distinct and advancing the LSN to get a new one when needed for this
> > case seems like it's probably the way to do that.  Hint bit change
> > visibility to users isn't really at issue here- we can't use the same IV
> > multiple times.  The two options that we have are to either not actually
> > update the hint bit in such a case, or to make sure to change the
> > LSN/IV.  Another option would be to, if we're able to make a hole to put
> > the GCM tag on to the page somewhere, further widen that hole to include
> > an additional space for a counter that would be mixed into the IV, to
> > avoid having to do an XLOG NOOP.
> 
> Well, we have eight unused bits in the IV, so we could just increment
> that for every hint bit change that uses the same LSN, and then force a
> dummy WAL record when that 8-bit counter overflows --- that seems
> simpler than logging hint bits.

Sorry, I was incorrect.  The IV is 16 bytes, made up of the LSN (8
bytes), and the page number (4 bytes).  That leaves 4 bytes unused or
2^32 values for hint bit changes before we have to generate a dummy LSN
record.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > Well, we have eight unused bits in the IV, so we could just increment
> > that for every hint bit change that uses the same LSN, and then force a
> > dummy WAL record when that 8-bit counter overflows --- that seems
> > simpler than logging hint bits.
> 
> Sure, as long as we have a place to store that information..  We need to
> have the full IV available when we go to decrypt the page.

Oh, yeah, we would need that counter recorded since previously the IV
was made up of already-recorded information, i.e., the page LSN and page
number.  However, the reason don't WAL-log hint bits always is because
we can afford to lose them, but in this case, any counter we need to
store will need to be WAL logged since we can't affort to lose that
counter value for decryption --- that gets us back to WAL-logging
something during hint bit changes.  :-(

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > Well, we have eight unused bits in the IV, so we could just increment
> > > that for every hint bit change that uses the same LSN, and then force a
> > > dummy WAL record when that 8-bit counter overflows --- that seems
> > > simpler than logging hint bits.
> >
> > Sure, as long as we have a place to store that information..  We need to
> > have the full IV available when we go to decrypt the page.
>
> Oh, yeah, we would need that counter recorded since previously the IV
> was made up of already-recorded information, i.e., the page LSN and page
> number.  However, the reason don't WAL-log hint bits always is because
> we can afford to lose them, but in this case, any counter we need to
> store will need to be WAL logged since we can't affort to lose that
> counter value for decryption --- that gets us back to WAL-logging
> something during hint bit changes.  :-(

I don't think that's actually the case..?  The hole I'm talking about is
there exclusively for post-encryption storage of the tag and maybe this
part of the IV and would be zero'd out in the FPIs that actually go into
the WAL (which would be encrypted with the WAL key, not the data key).
All we would need to be confident of is that if the page with the hint
bit update gets encrypted and written out that the IV counter gets
incremented and also written out as part of that write.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 01:57:11PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote:
> > > * Bruce Momjian (bruce@momjian.us) wrote:
> > > > Well, we have eight unused bits in the IV, so we could just increment
> > > > that for every hint bit change that uses the same LSN, and then force a
> > > > dummy WAL record when that 8-bit counter overflows --- that seems
> > > > simpler than logging hint bits.
> > > 
> > > Sure, as long as we have a place to store that information..  We need to
> > > have the full IV available when we go to decrypt the page.
> > 
> > Oh, yeah, we would need that counter recorded since previously the IV
> > was made up of already-recorded information, i.e., the page LSN and page
> > number.  However, the reason don't WAL-log hint bits always is because
> > we can afford to lose them, but in this case, any counter we need to
> > store will need to be WAL logged since we can't affort to lose that
> > counter value for decryption --- that gets us back to WAL-logging
> > something during hint bit changes.  :-(
> 
> I don't think that's actually the case..?  The hole I'm talking about is
> there exclusively for post-encryption storage of the tag and maybe this
> part of the IV and would be zero'd out in the FPIs that actually go into
> the WAL (which would be encrypted with the WAL key, not the data key).
> All we would need to be confident of is that if the page with the hint
> bit update gets encrypted and written out that the IV counter gets
> incremented and also written out as part of that write.

OK, got it.  I have added this to the wiki:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
    
    wal_log_hints will be enabled automatically in encryption mode. However,
    more than one hit change between checkpoints does not cause WAL
    activity, which would cause the same LSN to be used for different page
    images. This means we need a page-stored counter, to be used in the four
    unused bytes of the IV. This prevents multiple page writes during the
    same checkpoint interval from using the same IV. Counter changes do not
    need to be WAL logged since we either get the page from the WAL (which
    is only encrypted with the WAL data key), or from disk, which is
    durable. 

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Andres Freund
Date:
Hi,

On 2021-01-11 20:12:00 +0900, Masahiko Sawada wrote:

> diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
> index 32b5d62e1f..d474af753c 100644
> --- a/contrib/bloom/blinsert.c
> +++ b/contrib/bloom/blinsert.c
> @@ -177,6 +177,7 @@ blbuildempty(Relation index)
>       * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
>       * this even when wal_level=minimal.
>       */
> +    PageEncryptInplace(metapage, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO);
>      PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
>      smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
>                (char *) metapage, true);

There's quite a few places doing encryption + checksum + smgwrite now. I
strongly suggest splitting that off into a helper routine in a
preparatory patch.


> @@ -528,6 +529,8 @@ BootstrapModeMain(void)
>  
>      InitPostgres(NULL, InvalidOid, NULL, InvalidOid, NULL, false);
>  
> +    InitializeBufferEncryption();
> +
>      /* Initialize stuff for bootstrap-file processing */
>      for (i = 0; i < MAXATTR; i++)
>      {

Why are we initializing this here instead of postmaster? As far as I can
tell that just leads to redundant work instead of doing it once?


> +/*-------------------------------------------------------------------------
> + * We use both page LSN and page number to create a nonce for each page. Page
> + * LSN is 8 byte, page number is 4 byte, and the maximum required counter for
> + * AES-CTR is 2048, which fits in 3 byte. Since the length of IV is 16 byte
> + * it's fine. Using the LSN and page number as part of the nonce has
> + * three benefits:
> + *
> + * 1. We don't need to decrypt/re-encrypt during CREATE DATABASE since the page
> + * contents are the same in both places, and once one database changes its pages,
> + * it gets a new LSN, and hence a new nonce.
> + * 2. For each change of an 8k page, we get a new nonce, so we are not encrypting
> + * different data with the same nonce/IV.
> + * 3. We avoid requiring pg_upgrade to preserve database oids, tablespace oids,
> + * relfilenodes.

I think 3) also has a few minor downsides - by not including information
identifying a relation a potential attacker with access to the data
directory has more chances to get the database to decrypt data by
e.g. switching relation files around.



> @@ -2792,12 +2793,15 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>       */
>      bufBlock = BufHdrGetBlock(buf);
>  
> +    bufToWrite = PageEncryptCopy((Page) bufBlock, buf->tag.forkNum,
> +                                 buf->tag.blockNum);
> +
>      /*
>       * Update page checksum if desired.  Since we have only shared lock on the
>       * buffer, other processes might be updating hint bits in it, so we must
>       * copy the page to private storage if we do checksumming.
>       */
> -    bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
> +    bufToWrite = PageSetChecksumCopy((Page) bufToWrite, buf->tag.blockNum);
>  
>      if (track_io_timing)
>          INSTR_TIME_SET_CURRENT(io_start);

So now we copy the page twice, not just once, if both checksums and
encryption is enabled? That doesn't seem right.


> @@ -3677,6 +3683,21 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
>          {
>              dirtied = true;        /* Means "will be dirtied by this action" */
>  
> +            /*
> +             * We will dirty the page but the page lsn is not changed if we
> +             * doesn't write a backup block. We don't want to encrypt the
> +             * different bits stream with the same combination of nonce and key
> +             * since in buffer encryption the page lsn is a part of nonce.
> +             * Therefore we WAL-log no-op record just to move page lsn forward if
> +             * we doesn't write a backup block, even when this is not the first
> +             * modification in this checkpoint round.
> +             */
> +            if (XLogRecPtrIsInvalid(lsn) && DataEncryptionEnabled())
> +            {
> +                lsn = log_noop();
> +                Assert(!XLogRecPtrIsInvalid(lsn));
> +            }
> +

Aren't you doing a WAL record while holding the buffer header lock here?
You can't do things like WAL insertions while holding a spinlock.


I don't see how it is safe / correct to use a noop record here. A noop
record isn't associated with the page, so WAL replay isn't going to
perform the same LSN modification.

Also, why is it OK to modify the LSN without, if necessary, logging an FPI?



> +char *
> +PageEncryptCopy(Page page, ForkNumber forknum, BlockNumber blkno)
> +{
> +    static char *pageCopy = NULL;
> +
> +    /* If we don't need a checksum, just return the passed-in data */
> +    if (PageIsNew(page) || !PageNeedsToBeEncrypted(forknum))
> +        return (char *) page;

Why is it OK to not encrypt new pages?


> +#define PageEncryptOffset    offsetof(PageHeaderData, pd_special)
> +#define SizeOfPageEncryption (BLCKSZ - PageEncryptOffset)

I think you need a detailed explanation somewhere about what you're
doing here, and why it's a good idea.

Greetings,

Andres Freund



Re: Key management with tests

From
Neil Chen
Date:
Thank you for your reply,

On Wed, Jan 13, 2021 at 12:08 AM Stephen Frost <sfrost@snowman.net> wrote:

No, we can't 'modify the page format as we wish'- if we change away from
using a C structure then we're going to be modifying quite a bit of
code which otherwise doesn't need to be changed.  The proposed flag
doesn't actually make a different page format work, the only thing it
would do would be to allow some parts of the cluster to be encrypted and
other parts not be, but I don't know that that's actually a useful
capability or a good reason to use one of those bits.  Having it handled
on a cluster level, at initdb time through pg_control, seems like it'd
work just fine.


Yes, I realized that for cluster-level encryption, it would be unwise to flag a single page(Unless we want to do it at relation-level). Forgive me for not describing clearly, the 'modify the page' I said means the method you mentioned, not modifying the C structure. My original motivation is to avoid storing in an unconventional format without a description of the C structure. However, as I just said, it seems that we should not set the flag for a single page. Maybe it's enough to just add a comment description?

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 01:46:53PM -0500, Bruce Momjian wrote:
> On Tue, Jan 12, 2021 at 01:15:44PM -0500, Bruce Momjian wrote:
> > Well, we have eight unused bits in the IV, so we could just increment
> > that for every hint bit change that uses the same LSN, and then force a
> > dummy WAL record when that 8-bit counter overflows --- that seems
> > simpler than logging hint bits.
> 
> Sorry, I was incorrect.  The IV is 16 bytes, made up of the LSN (8
> bytes), and the page number (4 bytes).  That leaves 4 bytes unused or
> 2^32 values for hint bit changes before we have to generate a dummy LSN
> record.

I just did a massive update to the Transparent Data Encryption wiki page
to make it more readable and updated it with current decisions:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 12, 2021 at 12:04:09PM -0500, Bruce Momjian wrote:
> On Sun, Jan 10, 2021 at 09:51:16AM -0500, Bruce Momjian wrote:
> > OK, here they are with numeric prefixes.  It was actually tricky to
> > figure out how to create a squashed format-patch based on another branch.
> 
> Here is an updated version built on top of Michael Paquier's patch
> posted here:
> 
>     https://www.postgresql.org/message-id/X/0IChOPHd+aYC1w@paquier.xyz
> 
> and included as my first attachment.  This will give Michael's patch
> cfbot testing too since the second attachment calls many of the first
> attachment's functions.

Now that Michael's hex encoding patch is committed, I am reposting my
key management patch without Michael's patch.  It is improved since the
mid-December version:

*  TAP tests for encrypt/decryption, wrapped key creation and decryption,
   and KEK rotation
*  built on top of new hex encoding functions in /common
*  passes cfbot testing
*  handles disabled OpenSSL library properly
*  handles Windows builds properly

I also learned a lot about format-patch, cfbot testing, and TAP tests.
:-)

It still can't test everything, like prompting from /dev/tty.  Also, if
we don't get data encryption into PG 14, we are going to need to hide
the user interface for some of this until it is useful.  Prompting from
/dev/tty for the TLS private key passphrase already works and will be a
useful PG 14 feature, so that part of the API will be visible in PG 14.

I am planning to apply this next week.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Robert Haas
Date:
On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> I am planning to apply this next week.

I don't think that's appropriate. Several prominent community members
have told you that the patch, as committed the first time, needed a
lot more work. There hasn't been enough time between then and now for
you, or anyone, to do that amount of work. This patch needs detailed
and substantial review from senior community members, and multiple
rounds of feedback and improvement, before it should be considered for
commit.

I am not even sure there is a consensus on the design, without which
any commit is always premature.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan 15, 2021 at 04:23:22PM -0500, Robert Haas wrote:
> On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> > I am planning to apply this next week.
> 
> I don't think that's appropriate. Several prominent community members
> have told you that the patch, as committed the first time, needed a
> lot more work. There hasn't been enough time between then and now for
> you, or anyone, to do that amount of work. This patch needs detailed
> and substantial review from senior community members, and multiple
> rounds of feedback and improvement, before it should be considered for
> commit.
> 
> I am not even sure there is a consensus on the design, without which
> any commit is always premature.

If people want changes, I need to hear about it here.  I have address
everything people have mentioned in these threads so far.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Robert Haas
Date:
On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian <bruce@momjian.us> wrote:
> If people want changes, I need to hear about it here.  I have address
> everything people have mentioned in these threads so far.

That does not match my perception of the situation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Key management with tests

From
"David G. Johnston"
Date:
On Fri, Jan 15, 2021 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian <bruce@momjian.us> wrote:
> If people want changes, I need to hear about it here.  I have address
> everything people have mentioned in these threads so far.

That does not match my perception of the situation.


Looking at the Commitfest there are three authors and no reviewers.  Given the previous incident at minimum each of the people in the Commitfest should add their approval to commit this patch to this thread.  And while committers get some leeway, in this case having a non-author review and sign-off on it being ready-to-commit seems like it should be required.

David J.

Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan 15, 2021 at 04:59:17PM -0500, Robert Haas wrote:
> On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian <bruce@momjian.us> wrote:
> > If people want changes, I need to hear about it here.  I have address
> > everything people have mentioned in these threads so far.
> 
> That does not match my perception of the situation.

Well, that's not very specific, is it?  You might be confusing the POC
data encryption patch that was posted in this thread with the key
management patch that I am working on.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Andres Freund
Date:
Hi,

On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote:
> On Fri, Jan 15, 2021 at 04:23:22PM -0500, Robert Haas wrote:
> > On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> > I don't think that's appropriate. Several prominent community members
> > have told you that the patch, as committed the first time, needed a
> > lot more work. There hasn't been enough time between then and now for
> > you, or anyone, to do that amount of work. This patch needs detailed
> > and substantial review from senior community members, and multiple
> > rounds of feedback and improvement, before it should be considered for
> > commit.
> >
> > I am not even sure there is a consensus on the design, without which
> > any commit is always premature.
>
> If people want changes, I need to hear about it here.  I have address
> everything people have mentioned in these threads so far.

I don't even know how anybody is supposed to realistically review the
design or the patch:

This thread started at
https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no
reference to any discussion of the design at all and the supposed links
to code are dead.

The last version of the code that I see posted ([1]), has the useless
commit message of "key squash commit" - nothing else. There's no design
documentation included in the patch either, as far as I can tell.

Manually searching for the topic brings me to
https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us
, a thread of 52 messages, which provides a bit more context, but
largely just references another thread and a wiki article. The link to
the other thread is into the middle of a 112 message thread.

The wiki page doesn't really describe a design either. It has a very
long todo, a bunch of implementation details, but no design.

Nor did 978f869b99 include much in the way of design description.

You cannot expect anybody to review a patch if developing some basic
understanding of the intended design requires reading hundreds of
messages in which the design evolved. And I don't think it's acceptable
to push it due to lack of further feedback, given this situation - the
lack of design description is a blocker in itself.


There's a few things that stand out on a very very brief scan:
- the patch badly needs to be split up into independently reviewable
  pieces
- tests:
  - wait, a .sh test script? No, we shouldn't add any more of those,
    they're nightmare across platforms
  - Do the tests actually do anything useful? It's not clear to me what
    they are trying to achieve. En/Decrypting test vectors doesn't seem to
    buy that much?
  - the new pg_alterckey is completely untested
  - the pg_upgrade paths is untested
  - ..
- Without further comment BootStrapKmgr() does "copy cluster file
  encryption keys from an old cluster?", but there's no explanation as
  to why / when that's the case. Presumably pg_upgrade, but, uh, explain
  that.

- pg_alterckey.c
  - appears to create it's own cluster lock file, using its
    own routine for doing so. How does that lock file  interact with the
    running server?
  - retrieve_cluster_keys() is missing (void).

I think this is at the very least a month away from being committable,
even if the design were completely correct (which I do not know, see
above).

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210115204926.GD8740%40momjian.us



Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan 15, 2021 at 02:37:56PM -0800, Andres Freund wrote:
> On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote:
> > > I am not even sure there is a consensus on the design, without which
> > > any commit is always premature.
> >
> > If people want changes, I need to hear about it here.  I have address
> > everything people have mentioned in these threads so far.
> 
> I don't even know how anybody is supposed to realistically review the
> design or the patch:
> 
> This thread started at
> https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no
> reference to any discussion of the design at all and the supposed links
> to code are dead.

You have to understand cryptography and Postgres internals to understand
the design, and I don't think it is realistic to explain that all to the
community.  We did much of this in voice calls over months because it
was too much of a burden to explain all the cryptographic details so
everyone could follow along.

> The last version of the code that I see posted ([1]), has the useless
> commit message of "key squash commit" - nothing else. There's no design
> documentation included in the patch either, as far as I can tell.
> 
> Manually searching for the topic brings me to
> https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us
> , a thread of 52 messages, which provides a bit more context, but
> largely just references another thread and a wiki article. The link to
> the other thread is into the middle of a 112 message thread.
> 
> The wiki page doesn't really describe a design either. It has a very
> long todo, a bunch of implementation details, but no design.

I am not sure what design document you are requesting.  I thought the
TODO was that.

> Nor did 978f869b99 include much in the way of design description.
> 
> You cannot expect anybody to review a patch if developing some basic
> understanding of the intended design requires reading hundreds of
> messages in which the design evolved. And I don't think it's acceptable
> to push it due to lack of further feedback, given this situation - the
> lack of design description is a blocker in itself.

OK, I will just move on to something else then.  It is not worth the
feature to go into that kind of discussion again.  I am willing to have
voice calls with individuals to explain the logic, but repeatedly
explaining it to the entire group I find unproductive.  I don't think
another 400-email thread would help anyone.

> There's a few things that stand out on a very very brief scan:
> - the patch badly needs to be split up into independently reviewable
>   pieces

I can do that, but there are enough complaints above that I feel it
would not be worth it.

> - tests:
>   - wait, a .sh test script? No, we shouldn't add any more of those,
>     they're nightmare across platforms

The script originatad from pg_upgrade.  I don't know how to do things
like initdb and stuff another way, at least in our code.

>   - Do the tests actually do anything useful? It's not clear to me what
>     they are trying to achieve. En/Decrypting test vectors doesn't seem to
>     buy that much?

Uh, that's because the key manager doesn't do anything useful yet.

>   - the new pg_alterckey is completely untested

Wow, I was so excited I tested the data keys that I forgot to add the
pg_alterckey tests.  My tests had that already.  I have added it to the
attached patch.

>   - the pg_upgrade paths is untested

Uh, I was waiting until we were actually encrypting some data to test
that.

>   - ..
> - Without further comment BootStrapKmgr() does "copy cluster file
>   encryption keys from an old cluster?", but there's no explanation as
>   to why / when that's the case. Presumably pg_upgrade, but, uh, explain
>   that.

Uh, the heap/index files are, in the future, encrypted with the keys of
the old cluster, so we just copy them to the new cluster and they keep
working.  Potentially we could replace the WAL key at that point since
we don't move WAL from the old cluster to the new one, but we also need
a command-line tool to do that, so I figured I would just wait for that
to be done.

> - pg_alterckey.c
>   - appears to create it's own cluster lock file, using its
>     own routine for doing so. How does that lock file  interact with the
>     running server?

pg_alterckey runs fine while the old cluster is running, which is why I
used a new lock file.  The keys are only read at db boot time.

>   - retrieve_cluster_keys() is missing (void).

Oops, fixed.

> I think this is at the very least a month away from being committable,
> even if the design were completely correct (which I do not know, see
> above).

Those comments were very helpful, and I could certainly use more
feedback on the patch.  Updated patch attached.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Andres Freund
Date:
Hi,

On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote:
> On Fri, Jan 15, 2021 at 02:37:56PM -0800, Andres Freund wrote:
> > On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote:
> > > > I am not even sure there is a consensus on the design, without which
> > > > any commit is always premature.
> > >
> > > If people want changes, I need to hear about it here.  I have address
> > > everything people have mentioned in these threads so far.
> > 
> > I don't even know how anybody is supposed to realistically review the
> > design or the patch:
> > 
> > This thread started at
> > https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no
> > reference to any discussion of the design at all and the supposed links
> > to code are dead.
> 
> You have to understand cryptography and Postgres internals to understand
> the design, and I don't think it is realistic to explain that all to the
> community.  We did much of this in voice calls over months because it
> was too much of a burden to explain all the cryptographic details so
> everyone could follow along.

I think that's not at all acceptable. I don't mind hashing out details
on calls / off-list, but the design needs to be public, documented, and
reviewable.  And if it's something the community can't understand, then
it can't get in. We're going to have to maintain this going forward.

I don't mean to say that we need to re-hash all design details from
scratch - but that there needs to be an explanation somewhere that
describes what's being done on a medium-high level, and what drove those
design decisions.


> > The last version of the code that I see posted ([1]), has the useless
> > commit message of "key squash commit" - nothing else. There's no design
> > documentation included in the patch either, as far as I can tell.
> > 
> > Manually searching for the topic brings me to
> > https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us
> > , a thread of 52 messages, which provides a bit more context, but
> > largely just references another thread and a wiki article. The link to
> > the other thread is into the middle of a 112 message thread.
> > 
> > The wiki page doesn't really describe a design either. It has a very
> > long todo, a bunch of implementation details, but no design.
> 
> I am not sure what design document you are requesting.  I thought the
> TODO was that.

The TODO in https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
is a design document?



> > Nor did 978f869b99 include much in the way of design description.
> > 
> > You cannot expect anybody to review a patch if developing some basic
> > understanding of the intended design requires reading hundreds of
> > messages in which the design evolved. And I don't think it's acceptable
> > to push it due to lack of further feedback, given this situation - the
> > lack of design description is a blocker in itself.
> 
> OK, I will just move on to something else then.  It is not worth the
> feature to go into that kind of discussion again.  I am willing to have
> voice calls with individuals to explain the logic, but repeatedly
> explaining it to the entire group I find unproductive.  I don't think
> another 400-email thread would help anyone.

Explaining something over voice doesn't help with people in a year or
five trying to understand the code and the design, so they can adapt it
when making half-related changes. Nor do I see why another 400 email
thread would be a necessary consequence of you explaining the design
that you came up with.

This isn't specific to this topic? I don't really understand why this
specific feature gets to avoid normal community development processes?



> > - tests:
> >   - wait, a .sh test script? No, we shouldn't add any more of those,
> >     they're nightmare across platforms
> 
> The script originatad from pg_upgrade.  I don't know how to do things
> like initdb and stuff another way, at least in our code.

We have had perl tap tests for quite a while now? And all new tests that
aren't regression / isolation tests are expected to be written in it.

Greetings,

Andres Freund



Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan 15, 2021 at 04:56:24PM -0800, Andres Freund wrote:
> On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote:
> > You have to understand cryptography and Postgres internals to understand
> > the design, and I don't think it is realistic to explain that all to the
> > community.  We did much of this in voice calls over months because it
> > was too much of a burden to explain all the cryptographic details so
> > everyone could follow along.
> 
> I think that's not at all acceptable. I don't mind hashing out details
> on calls / off-list, but the design needs to be public, documented, and
> reviewable.  And if it's something the community can't understand, then
> it can't get in. We're going to have to maintain this going forward.

OK, so we don't want it.  That's fine with me.

> I don't mean to say that we need to re-hash all design details from
> scratch - but that there needs to be an explanation somewhere that
> describes what's being done on a medium-high level, and what drove those
> design decisions.

I thought the TODO list was that, and the email threads.

> > > The wiki page doesn't really describe a design either. It has a very
> > > long todo, a bunch of implementation details, but no design.
> > 
> > I am not sure what design document you are requesting.  I thought the
> > TODO was that.
> 
> The TODO in https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
> is a design document?

Yes.

> > > Nor did 978f869b99 include much in the way of design description.
> > > 
> > > You cannot expect anybody to review a patch if developing some basic
> > > understanding of the intended design requires reading hundreds of
> > > messages in which the design evolved. And I don't think it's acceptable
> > > to push it due to lack of further feedback, given this situation - the
> > > lack of design description is a blocker in itself.
> > 
> > OK, I will just move on to something else then.  It is not worth the
> > feature to go into that kind of discussion again.  I am willing to have
> > voice calls with individuals to explain the logic, but repeatedly
> > explaining it to the entire group I find unproductive.  I don't think
> > another 400-email thread would help anyone.
> 
> Explaining something over voice doesn't help with people in a year or
> five trying to understand the code and the design, so they can adapt it
> when making half-related changes. Nor do I see why another 400 email
> thread would be a necessary consequence of you explaining the design
> that you came up with.

I have underestimated the amount of discussion this has required
repeatedly, and I don't want to make that mistake again.

> This isn't specific to this topic? I don't really understand why this
> specific feature gets to avoid normal community development processes?

What is being avoided?

> > > - tests:
> > >   - wait, a .sh test script? No, we shouldn't add any more of those,
> > >     they're nightmare across platforms
> > 
> > The script originatad from pg_upgrade.  I don't know how to do things
> > like initdb and stuff another way, at least in our code.
> 
> We have had perl tap tests for quite a while now? And all new tests that
> aren't regression / isolation tests are expected to be written in it.

What Perl tap tests run initdb and manage the cluster?  I didn't find
any.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Andres Freund
Date:
Hi,

On 2021-01-15 20:49:10 -0500, Bruce Momjian wrote:
> On Fri, Jan 15, 2021 at 04:56:24PM -0800, Andres Freund wrote:
> > On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote:
> > > You have to understand cryptography and Postgres internals to understand
> > > the design, and I don't think it is realistic to explain that all to the
> > > community.  We did much of this in voice calls over months because it
> > > was too much of a burden to explain all the cryptographic details so
> > > everyone could follow along.
> > 
> > I think that's not at all acceptable. I don't mind hashing out details
> > on calls / off-list, but the design needs to be public, documented, and
> > reviewable.  And if it's something the community can't understand, then
> > it can't get in. We're going to have to maintain this going forward.
> 
> OK, so we don't want it.  That's fine with me.

That's not what I said...


> > This isn't specific to this topic? I don't really understand why this
> > specific feature gets to avoid normal community development processes?
> 
> What is being avoided?

You previously pushed a patch without tests, now you want to push a
patch that was barely reviewed and also doesn't contain an explanation
of the design. I mean:

> > > You have to understand cryptography and Postgres internals to understand
> > > the design, and I don't think it is realistic to explain that all to the
> > > community.  We did much of this in voice calls over months because it
> > > was too much of a burden to explain all the cryptographic details so
> > > everyone could follow along.

really is very far from the normal community process. Again, how is this
supposed to be maintained in the future, if it's based on a design
that's only understandable to the people on those phone calls?


> > We have had perl tap tests for quite a while now? And all new tests that
> > aren't regression / isolation tests are expected to be written in it.
> 
> What Perl tap tests run initdb and manage the cluster?  I didn't find
> any.

find . -name '*.pl'|xargs grep 'use PostgresNode;'

should give you a nearly complete list.

Greetings,

Andres Freund



Re: Key management with tests

From
Michael Paquier
Date:
On Fri, Jan 15, 2021 at 08:20:36PM -0800, Andres Freund wrote:
> On 2021-01-15 20:49:10 -0500, Bruce Momjian wrote:
>> What Perl tap tests run initdb and manage the cluster?  I didn't find
>> any.
>
> find . -name '*.pl'|xargs grep 'use PostgresNode;'
>
> should give you a nearly complete list.

Just to add that all the perl modules we use for the tests are within
src/test/perl/.  The coolest tests are within src/bin/ and src/test/.
--
Michael

Attachment

Re: Key management with tests

From
Tom Kincaid
Date:
> > > I think that's not at all acceptable. I don't mind hashing out details
> > > on calls / off-list, but the design needs to be public, documented, and
> > > reviewable.  And if it's something the community can't understand, then
> > > it can't get in. We're going to have to maintain this going forward.
> >
> > OK, so we don't want it.  That's fine with me.
>
> That's not what I said...
>


 I think the majority of us believe that it is important we take this
first step towards a solid TDE implementation in PostgreSQL that is
built around the community processes which involves general consensus.

Before this feature falls into the “we will never do it because we
will never build consensus" category and community PostgreSQL
potentially gets locked out of more deployment scenarios that require
this feature I would like to see if I can help with this current
attempt at it. I will share that I am concerned that if the people who
have been involved in this to date can’t get this in, it will never
happen.

Admittedly I am a novice on this topic, and the majority of the
PostgreSQL source code, however I am hopeful enough (those of you who
know me understand that I suffer from eternal optimism) that I am
going to attempt to help.

Is there a design document for a Postgres feature of this size and
scope that people feel would serve as a good example? Alternatively,
is there a design document template that has been successfully used in
the past? I could guess based on things I have observed reading this
list for many years. However, if there is something that those who are
deeply involved in the development effort feel would suffice as an
example of a "good design document" or a "good design template"
sharing it would be greatly appreciated.



Re: Key management with tests

From
Amit Kapila
Date:
On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid <tomjohnkincaid@gmail.com> wrote:
>
> > > > I think that's not at all acceptable. I don't mind hashing out details
> > > > on calls / off-list, but the design needs to be public, documented, and
> > > > reviewable.  And if it's something the community can't understand, then
> > > > it can't get in. We're going to have to maintain this going forward.
> > >
> > > OK, so we don't want it.  That's fine with me.
> >
> > That's not what I said...
> >
>
>
>  I think the majority of us believe that it is important we take this
> first step towards a solid TDE implementation in PostgreSQL that is
> built around the community processes which involves general consensus.
>
> Before this feature falls into the “we will never do it because we
> will never build consensus" category and community PostgreSQL
> potentially gets locked out of more deployment scenarios that require
> this feature I would like to see if I can help with this current
> attempt at it. I will share that I am concerned that if the people who
> have been involved in this to date can’t get this in, it will never
> happen.
>
> Admittedly I am a novice on this topic, and the majority of the
> PostgreSQL source code, however I am hopeful enough (those of you who
> know me understand that I suffer from eternal optimism) that I am
> going to attempt to help.
>
> Is there a design document for a Postgres feature of this size and
> scope that people feel would serve as a good example? Alternatively,
> is there a design document template that has been successfully used in
> the past?
>

We normally write the design considerations and choices we made with
the reasons in README and code comments. Personally, I am not sure if
there is a need for any specific document per-se but a README and
detailed comments in the code should suffice what people are worried
about here. It is mostly from the perspective that other developers
reading the code, want to do bug-fix, or later enhance that code
should be able to understand it. One recent example I can give is
Peter's work on bottom-up deletion [1] which I have read today where I
find that the design is captured via README, appropriate comments in
the code, and documentation. This feature is quite different and
probably a lot more new concepts are being introduced but I hope that
will give you some clue.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf

--
With Regards,
Amit Kapila.



Re: Key management with tests

From
Andres Freund
Date:
Hi,

On 2021-01-17 11:54:57 +0530, Amit Kapila wrote:
> On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid <tomjohnkincaid@gmail.com> wrote:
> > Admittedly I am a novice on this topic, and the majority of the
> > PostgreSQL source code, however I am hopeful enough (those of you who
> > know me understand that I suffer from eternal optimism) that I am
> > going to attempt to help.
> >
> > Is there a design document for a Postgres feature of this size and
> > scope that people feel would serve as a good example? Alternatively,
> > is there a design document template that has been successfully used in
> > the past?
> >
> 
> We normally write the design considerations and choices we made with
> the reasons in README and code comments. Personally, I am not sure if
> there is a need for any specific document per-se but a README and
> detailed comments in the code should suffice what people are worried
> about here.

Right. It could be a README file, or a long comment at a start of one of
the files. It doesn't matter too much. What matters is that people that
haven't been on those phone call can understand the design and the
implications it has.


> It is mostly from the perspective that other developers
> reading the code, want to do bug-fix, or later enhance that code
> should be able to understand it.

I'd add the perspective of code reviewers as well.


> One recent example I can give is
> Peter's work on bottom-up deletion [1] which I have read today where I
> find that the design is captured via README, appropriate comments in
> the code, and documentation. This feature is quite different and
> probably a lot more new concepts are being introduced but I hope that
> will give you some clue.
> 
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf

This is a great example.

Greetings,

Andres Freund



Re: Key management with tests

From
Robert Haas
Date:
On Fri, Jan 15, 2021 at 7:56 PM Andres Freund <andres@anarazel.de> wrote:
> I think that's not at all acceptable. I don't mind hashing out details
> on calls / off-list, but the design needs to be public, documented, and
> reviewable.  And if it's something the community can't understand, then
> it can't get in. We're going to have to maintain this going forward.

I agree. If the community is unable to clearly understand what
something is, and why we should have it, then we shouldn't have it --
even if the reason is that we're too dumb to understand, as Bruce
seems to be alleging. I don't really think I believe the theory that
community members by and large are too dumb to understand encryption.
Many features have provoked long and painful discussions about the
design and yet got into the tree in the end with documentation of that
design, and I don't see why that couldn't be done for this one, too. I
think it can and should, and the fact that the work hasn't been done
is one of several blockers for this patch. But even if I'm wrong, and
the real problem is that everyone except the select group of people on
these off-list phone calls are too stupid to understand this, then
that's still a reason not to accept the patch. The code that's in our
source tree is maintained by communal effort, and that means communal
understanding is important.

Frankly, it's more important in this particular case than in some
others. TDE is in great demand, so if it gets into the tree, it's
likely to get a lot of use. The preparatory patches, such as this one,
would at that point be getting a lot of use, too. That means many
people, not just hackers, will have to understand them and answer
questions about them. They are also likely to get a lot of scrutiny
from a security point of view, so we should have a way that we can be
confident that we know why we believe them to be secure. If a security
researcher shows up and says "your stuff is broken," we are not going
to get away with it "no it isn't, because we discussed it on a Friday
call with a closed group of people and decided it was OK." Our
reasoning is going to have to be documented. That doesn't guarantee
that it will be correct, but makes it possible to distinguish between
defects in design, defects in particular parts of the code, and
non-defects, which is otherwise impossible. Meanwhile, even if
security researches are as happy with our TDE implementation as they
could possibly be, a feature that changes the on-disk format can't
erase our ability to solve other problems with the database. Databases
using TDE are still going to have corruption, for example, but now a
corrupted page has a good chance of being completely unreadable rather
than just garbled. You certainly aren't going to be able to just run
pg_filedump on it. I think even if we do a great job explaining to
everybody what impact TDE and its preparatory patches are likely to
have on the system, there's likely to be a lot of cases where users
blame the database for eating their data when the real culprit is the
OS or the hardware, just because such cases are bound to get harder to
investigate, which could have a very negative effect on the
perceptions of PostgreSQL's quality. But if the TDE itself is magic
that only designated smart people on special calls can understand,
then it's going to be far worse, because that'll mean when any kind of
TDE problems comes up, nobody else can help debug anything.

While I would like to have TDE in PostgreSQL, I would not like to have
it on those terms.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Key management with tests

From
Bruce Momjian
Date:
On Sun, Jan 17, 2021 at 07:50:13PM -0500, Robert Haas wrote:
> On Fri, Jan 15, 2021 at 7:56 PM Andres Freund <andres@anarazel.de> wrote:
> > I think that's not at all acceptable. I don't mind hashing out details
> > on calls / off-list, but the design needs to be public, documented, and
> > reviewable.  And if it's something the community can't understand, then
> > it can't get in. We're going to have to maintain this going forward.
> 
> I agree. If the community is unable to clearly understand what
> something is, and why we should have it, then we shouldn't have it --
> even if the reason is that we're too dumb to understand, as Bruce

I am not sure why you are brining intelligence into this discussion. 
You have to understand Postgres internals and cryptography tradeoffs to
understand why some of the design decisions were made.  It is a
knowledge issue, not an intelligence issue.  The wiki page is the result
of those phone discussions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Sun, Jan 17, 2021 at 11:54:57AM +0530, Amit Kapila wrote:
> > Is there a design document for a Postgres feature of this size and
> > scope that people feel would serve as a good example? Alternatively,
> > is there a design document template that has been successfully used in
> > the past?
> 
> We normally write the design considerations and choices we made with
> the reasons in README and code comments. Personally, I am not sure if
> there is a need for any specific document per-se but a README and
> detailed comments in the code should suffice what people are worried
> about here. It is mostly from the perspective that other developers
> reading the code, want to do bug-fix, or later enhance that code
> should be able to understand it. One recent example I can give is
> Peter's work on bottom-up deletion [1] which I have read today where I
> find that the design is captured via README, appropriate comments in
> the code, and documentation. This feature is quite different and
> probably a lot more new concepts are being introduced but I hope that
> will give you some clue.
> 
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf

OK, I looked at that and it is good, and I see my patch is missing that.
Are people looking for me to take the wiki content, expand on it and tie
it to the code that will be applied, or something else like all the
various crypto options and why we chose what we did beyond what is
already on the wiki?  I can easily go from what we have on the wiki to
implementation code steps, but the other part is harder to explain and
that is why I offered to talk to people via voice.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Sat, Jan 16, 2021 at 10:58:47PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2021-01-17 11:54:57 +0530, Amit Kapila wrote:
> > On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid <tomjohnkincaid@gmail.com> wrote:
> > > Admittedly I am a novice on this topic, and the majority of the
> > > PostgreSQL source code, however I am hopeful enough (those of you who
> > > know me understand that I suffer from eternal optimism) that I am
> > > going to attempt to help.
> > >
> > > Is there a design document for a Postgres feature of this size and
> > > scope that people feel would serve as a good example? Alternatively,
> > > is there a design document template that has been successfully used in
> > > the past?
> > >
> > 
> > We normally write the design considerations and choices we made with
> > the reasons in README and code comments. Personally, I am not sure if
> > there is a need for any specific document per-se but a README and
> > detailed comments in the code should suffice what people are worried
> > about here.
> 
> Right. It could be a README file, or a long comment at a start of one of
> the files. It doesn't matter too much. What matters is that people that
> haven't been on those phone call can understand the design and the
> implications it has.

OK, so does the wiki page contain most of what you want, but is missing
the connection between the design and the code?

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote:
> OK, I looked at that and it is good, and I see my patch is missing that.
> Are people looking for me to take the wiki content, expand on it and tie
> it to the code that will be applied, or something else like all the
> various crypto options and why we chose what we did beyond what is
> already on the wiki?  I can easily go from what we have on the wiki to
> implementation code steps, but the other part is harder to explain and
> that is why I offered to talk to people via voice.

Just to clarify why voice calls can be helpful --- if you have to get
into "you have to understand X to understand Y", that's where a voice
call works best, because understanding X will require understanding
A/B/C, and everyone's missing pieces are different, so you have to
customize it for the individual.  

You can explain some of this in a README, but trying to cover all of it
leads to a combinatorial problem of trying to explain everything. 
Ideally the wiki page can be expanded so people can ask and answer all
posted issues, perhaps in a Q&A format.  Someone could go through the
archives and post why certain decisions were made, and link to the
original emails.

I have to admit I was kind of baffled that the wiki page wasn't
sufficient, because it is one of the longest Postgres feature
explanations I have seen, but I now think the missing part is tying
the wiki contents to the code implementation.  If that is it, please
confirm.  If it is something else, also explain.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Andres Freund
Date:
Hi,

On 2021-01-18 12:06:35 -0500, Bruce Momjian wrote:
> On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote:
> > OK, I looked at that and it is good, and I see my patch is missing that.
> > Are people looking for me to take the wiki content, expand on it and tie
> > it to the code that will be applied, or something else like all the
> > various crypto options and why we chose what we did beyond what is
> > already on the wiki?  I can easily go from what we have on the wiki to
> > implementation code steps, but the other part is harder to explain and
> > that is why I offered to talk to people via voice.
> 
> Just to clarify why voice calls can be helpful --- if you have to get
> into "you have to understand X to understand Y", that's where a voice
> call works best, because understanding X will require understanding
> A/B/C, and everyone's missing pieces are different, so you have to
> customize it for the individual.  

I don't think anybody argued against having voice calls.


> You can explain some of this in a README, but trying to cover all of it
> leads to a combinatorial problem of trying to explain everything. 
> Ideally the wiki page can be expanded so people can ask and answer all
> posted issues, perhaps in a Q&A format.  Someone could go through the
> archives and post why certain decisions were made, and link to the
> original emails.
> 
> I have to admit I was kind of baffled that the wiki page wasn't
> sufficient, because it is one of the longest Postgres feature
> explanations I have seen, but I now think the missing part is tying
> the wiki contents to the code implementation.  If that is it, please
> confirm.  If it is something else, also explain.

I don't think the wiki right now covers what's needed. The "Overview",
"Threat model" and "Scope of TDE" are a start, but beyond that it's
missing a bunch of things. And it's not in the source tree (we'll soon
have multiple versions of postgres with increasing levels of TDE
features, the wiki doesn't help with that)

Missing:
- talks about cluster wide encyrption being simpler, without mentioning
  what it's being compared to, and what makes it simpler
- no differentiation from file system / block level encryption
- there's no explanation of which/why specific crypto primitives were
  chosen, what the tradeoffs are
- no explanation which keys exists, stored where
- the key management patch introduces new files, not documented
- there's new types of lock files, possibility of interrupted
  operations, ... - no documentation of what that means
- there's no documentation what "key wrapping" actually precisely is,
  what the danger of the two-tier model is, ...
- are there dangers in not encrypting zero pages etc?
- ...



Personally, but I admit that there's legitimate reasons to differ on
that note, I don't think it's reasonable for a feature this invasive to
commit preliminary patches without the major subsequent patches being in
a shape that allows reviewing the whole picture.

Greetings,

Andres Freund



Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> Personally, but I admit that there's legitimate reasons to differ on
> that note, I don't think it's reasonable for a feature this invasive to
> commit preliminary patches without the major subsequent patches being in
> a shape that allows reviewing the whole picture.

OK, if that is a requirement, I can't help anymore since there are
already complaints that the patch is too large to review, even if broken
into pieces.  Please let me know what the community decides.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Tom Kincaid
Date:
> > I have to admit I was kind of baffled that the wiki page wasn't
> > sufficient, because it is one of the longest Postgres feature
> > explanations I have seen, but I now think the missing part is tying
> > the wiki contents to the code implementation.  If that is it, please
> > confirm.  If it is something else, also explain.
>
> I don't think the wiki right now covers what's needed. The "Overview",
> "Threat model" and "Scope of TDE" are a start, but beyond that it's
> missing a bunch of things. And it's not in the source tree (we'll soon
> have multiple versions of postgres with increasing levels of TDE
> features, the wiki doesn't help with that)
>

Thanks, the versioning issue makes sense for the design document
needing to be part of the the source tree.


As I was reading the README for the patch Amit referenced and as I am
going through this patch, I feel the desire to incorporate diagrams.
Are design diagrams ever incorporated in the source tree as a part of
the design description of a feature? If not, any concerns about doing
that? I think that is likely where I can contribute the most.


> Missing:
> - talks about cluster wide encyrption being simpler, without mentioning
>   what it's being compared to, and what makes it simpler
> - no differentiation from file system / block level encryption
> - there's no explanation of which/why specific crypto primitives were
>   chosen, what the tradeoffs are
> - no explanation which keys exists, stored where
> - the key management patch introduces new files, not documented
> - there's new types of lock files, possibility of interrupted
>   operations, ... - no documentation of what that means
> - there's no documentation what "key wrapping" actually precisely is,
>   what the danger of the two-tier model is, ...
> - are there dangers in not encrypting zero pages etc?
> - ...
>

Some of the missing things you mention above are about the design of
TDE  feature in general. However, this patch is about Key Management
which is going part of the larger TDE feature. So it feels as though
there is the need for a general design document about the overall
vision / approach for TDE and a specific design doc. for Key
Management. Is it appropriate to include both of those in the same
patch?

Something along the lines here is the overall design of TDE and here
is how the Key Management portion is designed and implemented. I guess
in that case, follow on patches for TDE could refer to the overall
design described in this patch.




>
>
> Personally, but I admit that there's legitimate reasons to differ on
> that note, I don't think it's reasonable for a feature this invasive to
> commit preliminary patches without the major subsequent patches being in
> a shape that allows reviewing the whole picture.
>
> Greetings,
>
> Andres Freund



-- 
Thomas John Kincaid



Re: Key management with tests

From
Andres Freund
Date:
On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote:
> On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> > Personally, but I admit that there's legitimate reasons to differ on
> > that note, I don't think it's reasonable for a feature this invasive to
> > commit preliminary patches without the major subsequent patches being in
> > a shape that allows reviewing the whole picture.
> 
> OK, if that is a requirement, I can't help anymore since there are
> already complaints that the patch is too large to review, even if broken
> into pieces.  Please let me know what the community decides.

Those aren't conflicting demands. Having later patches around to
validate the design of earlier patches doesn't necessitates that the
later patches need to be reviewed at the same time.



Re: Key management with tests

From
Robert Haas
Date:
On Mon, Jan 18, 2021 at 2:00 PM Tom Kincaid <tomjohnkincaid@gmail.com> wrote:
> Some of the missing things you mention above are about the design of
> TDE  feature in general. However, this patch is about Key Management
> which is going part of the larger TDE feature. So it feels as though
> there is the need for a general design document about the overall
> vision / approach for TDE and a specific design doc. for Key
> Management. Is it appropriate to include both of those in the same
> patch?

To me, it wouldn't make sense to commit a full README for a TDE
feature that we don't have yet with a key management patch, but the
way that they'll interact with each other has to be clear. The
doc/database-encryption.sgml file that Bruce included in the patch is
a decent start on explaining the design, though I think it needs more
work and more details, perhaps including some of the things Andres
mentioned.

To be honest, after reading over that SGML documentation a bit, I'm
somewhat skeptical about whether it really makes sense to think about
committing the key management part separately. It seems to have no use
independent of the main feature, and it in fact embeds very specific
details of how the main feature is expected to work. For example, the
documentation says that key #0 will be used for data files, and key #1
for WAL. There seems to be no suggestion that the key management
portion of this can be used to manage encryption keys generally for
whatever purposes someone might have; it's all about the needs of a
particular TDE implementation. Typically, we would not commit
something like that separately, or only once the main patch was done,
with the two commits occurring in a relatively short time period.
Otherwise, as Bruce already noted, we can end up with something that
is documented and visible to users but doesn't actually work yet.

Some more specific comments on data-encryption.sgml:

* The documentation explains that the purpose of having a WAL key
separate from the data file key is so that the data file keys can
"eventually" be rotated. It's not clear whether this means that we
might eventually have that feature or that we might eventually be able
to rotate, after failing over. If this kind of thing is possible,
we'll eventually need documentation on how to do it.

* The reasons for use a DEK and a KEK are not explained. I realize
it's not an uncommon practice and that other systems do it, but I
think a few sentences of explanation wouldn't be a bad idea. Even if
we are supposing that hackers who want to have input into this feature
have to be knowledgeable about cryptography, I don't think we can
reasonably suppose that for users.

* "For example" is at one point followed by a period rather than a
colon or comma.

* In the "Internals" subsection, the last sentence doesn't seem to be
grammatical. I wonder if it's missing the word "or"'.

* The part about integrity-checking keys on startup isn't clear. It
makes it sound like we still have a copy of the KEK lying around
someplace against which we can compare, which I assume is not the case
since it would be really insecure.

* I think it's going to be pretty important that we can easily switch
to other cryptographic algorithms as they are discovered, so I don't
like the fact that this is tied specifically to AES. (In fact,
kmgr_utils.h makes it sound like we're specifically locked into
AES256, but that contradicts the documentation, so I guess there's
some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN
is doing.) As far as possible we should try to make this generic, like
supporting any cipher that SSL has which has property X. It seems
relatively inevitable that every currently popular cryptographic
algorithm will at some point in the future be judged weak and
worthless, just as has already happened with MD5 and some variants of
SHA, both of which used to be considered state of the art. It seems
equally inevitable that new and stronger algorithms will continued to
be devised, and we'll want to adopt those easily.

I'm not sure to what extent this a serious flaw in the patch and to
what extent it's just a matter of tweaking the wording of some things,
but I think this is actually an extremely critical design point where
we had better be certain we've got it right. Few things would be
sadder than to get a TDE feature and then have to rip it out again
because it couldn't be upgraded to work with newer crypto algorithms
with reasonable effort.

Notes on other parts of the documentation:

* The documentation for initdb -K doesn't list the valid values of the
parameter, only the default. Probably we should be specifying an
algorithm here and not just a bit count. Otherwise, like I say above,
what happens when AES gives way to something else? It'd be easy to say
-K BFT256 instead of -K AES256, but if AES is assumed and it's no
longer what we want them we have problems. This kind of thing probably
needs to be cleaned up in a bunch of places.

* I don't see the point of saying "a passphrase or PIN." We don't need
to document that your passphrase might happen to only contain digits.

* pg_alterckey's description of "repair" is hard to understand. It
doesn't really explain why or how this would be necessary, and it begs
the question of why we'd ever leave things in a state that requires
repair. This is sketched out in code comments elsewhere, but I think
at least some of it needs to be explained in the documentation as
well. (Incidentally, I don't think the comments at the top of
recover_failure will survive a visit from pgindent, though I might be
wrong about that.)

* The changes to config.sgml say "Sample script" instead of "Sample scripts".

* I don't think that the documentation of %R is very clear, or
adequate for someone to make effective use of it. If I wanted to use
%R, how would I ensure that a value is available?

* The changes to allfiles.sgml add pg_alterckey.sgml in the wrong
place and include an incorrect whitespace change.

* It's odd that "pg_alterckey" describes itself as "technically"
changing the KEK. Isn't that just what it does, not a technicality? I
imagine we'll ultimately need a way to change a DEK as well, because
otherwise the use of a separate key for the WAL wouldn't accomplish
the intended goal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Key management with tests

From
Tom Kincaid
Date:
 I met with Bruce and Stephen this afternoon to discuss the feedback
we received so far (prior to Robert's note which I haven't fully
digested yet)
on this patch.

Here is what we plan to do:

1) Bruce is going to gather all the details from the Wiki and build a
README for the TDE Key Management patch. In addition, it will include
details about the implementation, the data structures involved and the
locks that are taken and general technical implementation approach.

2) Stephen is going to write up the overall design of TDE.

Between these two patches, we hope to cover what Andres is asking for
and what Robert is asking for in his reply on this thread which I
haven't fully digested yet.


Stephen's documentation patch will also make reference to Neil Chen's
TDE prototype for making use of this Key Management patch to encrypt
and
decrypt heap pages as well as index pages.

https://www.postgresql.org/message-id/CAA3qoJ=qtO5JcSBjqFDBT9iKUX9XKmC5bXCrd7rysE+XSMEuTg@mail.gmail.com

3) Tom will work to find somebody who will sign up as a reviewer upon
the next submission of this patch. (Somebody who is not an author).

Could we get feedback if this feels like enough to get this patch
(which will include just the Key Management portion of TDE) to a state
where it can be reviewed and assuming the review issues are resolved
with consensus be committed?

On Mon, Jan 18, 2021 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote:
> > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> > > Personally, but I admit that there's legitimate reasons to differ on
> > > that note, I don't think it's reasonable for a feature this invasive to
> > > commit preliminary patches without the major subsequent patches being in
> > > a shape that allows reviewing the whole picture.
> >
> > OK, if that is a requirement, I can't help anymore since there are
> > already complaints that the patch is too large to review, even if broken
> > into pieces.  Please let me know what the community decides.
>
> Those aren't conflicting demands. Having later patches around to
> validate the design of earlier patches doesn't necessitates that the
> later patches need to be reviewed at the same time.



-- 
Thomas John Kincaid



Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 18, 2021 at 04:38:47PM -0500, Robert Haas wrote:
> To me, it wouldn't make sense to commit a full README for a TDE
> feature that we don't have yet with a key management patch, but the
> way that they'll interact with each other has to be clear. The
> doc/database-encryption.sgml file that Bruce included in the patch is
> a decent start on explaining the design, though I think it needs more
> work and more details, perhaps including some of the things Andres
> mentioned.

Sure.

> To be honest, after reading over that SGML documentation a bit, I'm
> somewhat skeptical about whether it really makes sense to think about
> committing the key management part separately. It seems to have no use
> independent of the main feature, and it in fact embeds very specific

For usefulness, it does enable passphrase prompting for the TLS private
key.

> details of how the main feature is expected to work. For example, the
> documentation says that key #0 will be used for data files, and key #1
> for WAL. There seems to be no suggestion that the key management
> portion of this can be used to manage encryption keys generally for
> whatever purposes someone might have; it's all about the needs of a
> particular TDE implementation. Typically, we would not commit

We originally were going to have SQL-level keys, but many felt they
weren't useful.

> something like that separately, or only once the main patch was done,
> with the two commits occurring in a relatively short time period.
> Otherwise, as Bruce already noted, we can end up with something that
> is documented and visible to users but doesn't actually work yet.

Yep, that is the risk.

> Some more specific comments on data-encryption.sgml:
> 
> * The documentation explains that the purpose of having a WAL key
> separate from the data file key is so that the data file keys can
> "eventually" be rotated. It's not clear whether this means that we
> might eventually have that feature or that we might eventually be able
> to rotate, after failing over. If this kind of thing is possible,
> we'll eventually need documentation on how to do it.

I have clarified that saying "future release".

> * The reasons for use a DEK and a KEK are not explained. I realize
> it's not an uncommon practice and that other systems do it, but I
> think a few sentences of explanation wouldn't be a bad idea. Even if
> we are supposing that hackers who want to have input into this feature
> have to be knowledgeable about cryptography, I don't think we can
> reasonably suppose that for users.

I added a little about that in the docs.

> * "For example" is at one point followed by a period rather than a
> colon or comma.

Fixed.

> * In the "Internals" subsection, the last sentence doesn't seem to be
> grammatical. I wonder if it's missing the word "or"'.

Fixed.

> * The part about integrity-checking keys on startup isn't clear. It
> makes it sound like we still have a copy of the KEK lying around
> someplace against which we can compare, which I assume is not the case
> since it would be really insecure.

I rewored that entire section.  See if it is better now.

> * I think it's going to be pretty important that we can easily switch
> to other cryptographic algorithms as they are discovered, so I don't
> like the fact that this is tied specifically to AES. (In fact,
> kmgr_utils.h makes it sound like we're specifically locked into
> AES256, but that contradicts the documentation, so I guess there's
> some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN
> is doing.) As far as possible we should try to make this generic, like
> supporting any cipher that SSL has which has property X. It seems
> relatively inevitable that every currently popular cryptographic
> algorithm will at some point in the future be judged weak and
> worthless, just as has already happened with MD5 and some variants of
> SHA, both of which used to be considered state of the art. It seems
> equally inevitable that new and stronger algorithms will continued to
> be devised, and we'll want to adopt those easily.

That is a nifty idea.  Right now I just pass the integer length around,
and store it in pg_control, but if we define macros, we can easily
abstract this and easily allow for new methods.  If others like that, I
will start on it now.

> I'm not sure to what extent this a serious flaw in the patch and to
> what extent it's just a matter of tweaking the wording of some things,
> but I think this is actually an extremely critical design point where
> we had better be certain we've got it right. Few things would be
> sadder than to get a TDE feature and then have to rip it out again
> because it couldn't be upgraded to work with newer crypto algorithms
> with reasonable effort.

Yep.

> Notes on other parts of the documentation:
> 
> * The documentation for initdb -K doesn't list the valid values of the
> parameter, only the default. Probably we should be specifying an

Fixed.

> algorithm here and not just a bit count. Otherwise, like I say above,
> what happens when AES gives way to something else? It'd be easy to say
> -K BFT256 instead of -K AES256, but if AES is assumed and it's no
> longer what we want them we have problems. This kind of thing probably
> needs to be cleaned up in a bunch of places.

Again, I can do that if people like it.

> * I don't see the point of saying "a passphrase or PIN." We don't need
> to document that your passphrase might happen to only contain digits.

Well, PIN is what the Yubikey and PIV devices call it, so I thought I
should give specific examples of inputs.

> * pg_alterckey's description of "repair" is hard to understand. It
> doesn't really explain why or how this would be necessary, and it begs
> the question of why we'd ever leave things in a state that requires
> repair. This is sketched out in code comments elsewhere, but I think
> at least some of it needs to be explained in the documentation as
> well. (Incidentally, I don't think the comments at the top of
> recover_failure will survive a visit from pgindent, though I might be
> wrong about that.)

Fixed with rewording.  Better?

> * The changes to config.sgml say "Sample script" instead of "Sample scripts".

Fixed.

> * I don't think that the documentation of %R is very clear, or
> adequate for someone to make effective use of it. If I wanted to use
> %R, how would I ensure that a value is available?

Fixed, use -R on server start.

> * The changes to allfiles.sgml add pg_alterckey.sgml in the wrong
> place and include an incorrect whitespace change.

Uh, the whitespace change was to align the column.  I will review and
push that separately.

> * It's odd that "pg_alterckey" describes itself as "technically"
> changing the KEK. Isn't that just what it does, not a technicality? I
> imagine we'll ultimately need a way to change a DEK as well, because
> otherwise the use of a separate key for the WAL wouldn't accomplish
> the intended goal.

"technically" removed.  I kind of wanted to say "in detail" or something
like that, but removing the word is fine.  Change-only patch attached so
you can see the changes more easily.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 18, 2021 at 05:47:34PM -0500, Tom Kincaid wrote:
>  I met with Bruce and Stephen this afternoon to discuss the feedback
> we received so far (prior to Robert's note which I haven't fully
> digested yet)
> on this patch.
> 
> Here is what we plan to do:
> 
> 1) Bruce is going to gather all the details from the Wiki and build a
> README for the TDE Key Management patch. In addition, it will include
> details about the implementation, the data structures involved and the
> locks that are taken and general technical implementation approach.
...
> Could we get feedback if this feels like enough to get this patch
> (which will include just the Key Management portion of TDE) to a state
> where it can be reviewed and assuming the review issues are resolved
> with consensus be committed?

Attached is an updated patch that has the requested changes:

*  broken into seven parts
*  test script converted from shell to Perl
*  added README for every new directory
*  moved text from wiki to READMEs where appropriate
*  included Robert's suggestions, including the ability to add
   future non-AES crypto methods
*  fixes for pg_alterckey PGDATA arg processing

The patch is attached, and is also here:

    https://github.com/postgres/postgres/compare/master...bmomjian:key.patch

Questions:

*  What changes do people want to this patch set?
*  Do we want it applied, even though it might need to be hidden for PG
   14?
*  If not, how do people build on this patch?  Using the commitfest
   links or github URL?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Alvaro Herrera
Date:
In patch 1,

* The docs are not clear on what happens if --auth-prompt is not given
but an auth prompt is required for the program to work.  Should it exit
with a status other than 0?

* BootStrapKmgr claims it is called by initdb, but that doesn't seem to
be the case.

* Also, BootStrapKmgr is the only one that checks USE_OPENSSL; what if a
with-openssl build inits the datadir, and then a non-openssl runs it?
What if it's the other way around?  I think you'd get a failure in
stat() ...

* ... oh, KMGR_DIR_PID is used but not defined anywhere.  Is it defined
in some later commit?  If so, then I think you've chosen to split the
patch series wrong.


May I suggest to use "git format-patch" to produce the patch files?  When
working with a series like this, trying to do patch handling manually
like you seem to be doing, is much more time-consuming and error prone.
For example, with a branch containing individual commits, you could use 
  git rebase -i origin/master -x "make install check-world"
or similar, so that each commit is built and tested individually.

-- 
Álvaro Herrera       Valdivia, Chile
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".



Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 25, 2021 at 08:12:01PM -0300, Álvaro Herrera wrote:
> In patch 1,
> 
> * The docs are not clear on what happens if --auth-prompt is not given
> but an auth prompt is required for the program to work.  Should it exit
> with a status other than 0?

Uh, I think the docs talk about this:

    It can prompt from the terminal if
    option>--authprompt</option> is used.  In the parameter
    value, <literal>%R</literal> is replaced by a file descriptor
    number opened to the terminal that started the server.  A file
    descriptor is only available if enabled at server start via
    <option>-R</option>.  If <literal>%R</literal> is specified and
    no file descriptor is available, the server will not start.

The code is:

    case 'R':
    {
        char fd_str[20];

        if (terminal_fd == -1)
        {
            ereport(ERROR,
                    (errcode(ERRCODE_INTERNAL_ERROR),
                     errmsg("cluster key command referenced %%R, but --authprompt not specified")));
        }

Does that help?

> * BootStrapKmgr claims it is called by initdb, but that doesn't seem to
> be the case.

Well, initdb starts the postmaster in --boot mode, and that calls
BootStrapKmgr().  Does that help?

> * Also, BootStrapKmgr is the only one that checks USE_OPENSSL; what if a
> with-openssl build inits the datadir, and then a non-openssl runs it?
> What if it's the other way around?  I think you'd get a failure in
> stat() ...

Wow, I never considered that.  I have added a check to InitializeKmgr().
Thanks.

> * ... oh, KMGR_DIR_PID is used but not defined anywhere.  Is it defined
> in some later commit?  If so, then I think you've chosen to split the
> patch series wrong.

OK, fixed.  It is in include/common/kmgr_utils.c, which was in #3.

> May I suggest to use "git format-patch" to produce the patch files?  When
> working with a series like this, trying to do patch handling manually
> like you seem to be doing, is much more time-consuming and error prone.
> For example, with a branch containing individual commits, you could use 
>   git rebase -i origin/master -x "make install check-world"
> or similar, so that each commit is built and tested individually.

I used "git format-patch".  Are you asking for seven commits that then
generate seven files via one format-patch run?  Or is the primary issue
that you want compile testing for each patch?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 25, 2021 at 07:09:44PM -0500, Bruce Momjian wrote:
> > May I suggest to use "git format-patch" to produce the patch files?  When
> > working with a series like this, trying to do patch handling manually
> > like you seem to be doing, is much more time-consuming and error prone.
> > For example, with a branch containing individual commits, you could use 
> >   git rebase -i origin/master -x "make install check-world"
> > or similar, so that each commit is built and tested individually.
> 
> I used "git format-patch".  Are you asking for seven commits that then
> generate seven files via one format-patch run?  Or is the primary issue
> that you want compile testing for each patch?

The attached patch meets both criteria.  I also clarified the README on
how initdb calls those functions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Jan 25, 2021 at 10:27:18PM -0500, Bruce Momjian wrote:
> On Mon, Jan 25, 2021 at 07:09:44PM -0500, Bruce Momjian wrote:
> > > May I suggest to use "git format-patch" to produce the patch files?  When
> > > working with a series like this, trying to do patch handling manually
> > > like you seem to be doing, is much more time-consuming and error prone.
> > > For example, with a branch containing individual commits, you could use 
> > >   git rebase -i origin/master -x "make install check-world"
> > > or similar, so that each commit is built and tested individually.
> > 
> > I used "git format-patch".  Are you asking for seven commits that then
> > generate seven files via one format-patch run?  Or is the primary issue
> > that you want compile testing for each patch?
> 
> The attached patch meets both criteria.  I also clarified the README on
> how initdb calls those functions.

This version fixes OpenSSL detection and improves docs for initdb
interactions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Robert Haas
Date:
On Tue, Jan 26, 2021 at 11:15 AM Bruce Momjian <bruce@momjian.us> wrote:
> This version fixes OpenSSL detection and improves docs for initdb
> interactions.

Hi,

I'm wondering whether you've considered storing all the keys in one
file instead of a file per key. The reason I ask is that it seems to
me that the key rotation procedure would be a lot simpler if it were
all in one file. You could just create a temporary file and atomically
rename it over the existing file. If you see a temporary file you're
always free to remove it. This is a lot simpler than what you have
right now. The "repair" concept pretty much goes away completely,
which seems nice. Granted I don't know exactly how to store multiple
keys in one file, but I bet there's some way to do it.

The way in which you are posting these patches is quite unlike what
most people do when posting patches to this list. You seem to have
generated a bunch of patches using 'git format-patch' but then
concatenated them all together in a single file. It would be helpful
if you could do this more like the way that is now standard on this
list. Not only that, but the patches don't have meaningful commit
messages in them, and don't seem to be meaningfully split for easy
review. They just say things like 'crypto squash commit'. Compare this
to for example what I did on the "cleaning up a few CLOG-related
things" thread where the commits appear in a logical sequence, and
each one has a meaningful commit message. Or here's an example from
someone else --
http://postgr.es/m/be72abfa-e62e-eb81-4e70-1b57fe6dc9e2@amazon.com --
and note the inclusion of authorship information in the commit
messages, so that the source of the code can be easily understood.

The README in src/backend/crypto does not explain how the scripts in
that directory are intended to be used. If I want to use AWS Secrets
Manager with this feature, I can see that I should use
ckey_aws.sh.sample as a basis for that integration, but I don't know
what I should do with the script because the README says nothing about
it. I am frankly pretty doubtful about the idea of shipping a bunch of
/bin/sh scripts as a best practice; for one thing, that's totally
unusable on Windows, and it also means that this is dependent on
/bin/sh existing and having the behavior we expect and on all the
other executables in these scripts as well. But, at the very least,
there needs to be a clearer explanation of how the scripts are
intended to be used, which parts people are supposed to modify, what
arguments they're going to get called with, and things like that.

The comments in cipher.c and cipher_openssl.c could be improved to
explain that they are alternatives to each other. Perhaps the former
could be renamed to something like cipher_failure.c or cipher_noimpl.c
for clarity.

I believe that a StaticAssertStmt could be used to check the length of
the encryption_methods[] array, so that if someone changes
NUM_ENCRYPTION_METHODS without updating the array, compilation fails.
See UserAuthName[] for an example of how to do this.

You seem to have omitted to update the documentation with the names of
the new wait events that you added.

In process_postgres_switches(), when there's a multi-line comment
followed by a single line of actual code, I prefer to include braces
around the whole thing. There might be some disagreement on what is
best here, though.

What are the consequences of the placement of the code in
PostgresMain() for processes other than user backends and walsenders?
I think that the way you have it, background workers would not have
access to keys, nor auxiliary processes like the checkpointer ... at
least in the EXEC_BACKEND case. In the non-EXEC_BACKEND case you have
the postmaster doing it, so then I'm not sure why it has to be redone
for every backend. Won't they just inherit the data from the
postmaster? Has this code been meaningfully tested on Windows? How do
we know that it works? Maybe we need to think about adding some
asserts that guarantee that any process that attempts to access a
buffer has the key manager initialized; I bet such assertions would
fail at least on Windows as the code looks now.

I don't think it makes sense to think about committing this to v14. I
believe it only makes sense if we have a TDE patch that is relatively
close to committable that can be used with it. I also don't think that
this patch is in good enough shape to commit yet in terms of where
it's at in terms of quality; I think it needs more review first,
hopefully including review from people who can comment intelligently
specifically on the cryptography aspects of it. However, the
challenges don't seem insurmountable. There's also still some question
in my mind about whether the design choices here (one KEK, 2 DEKs, one
for data and one for WAL) have enough consensus. I don't have a
considered opinion on that, partly because I'm not quite sure what the
reasonable alternatives are, but it seems that other people had some
questions about it, IIUC.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 26, 2021 at 03:24:30PM -0500, Robert Haas wrote:
> On Tue, Jan 26, 2021 at 11:15 AM Bruce Momjian <bruce@momjian.us> wrote:
> > This version fixes OpenSSL detection and improves docs for initdb
> > interactions.
> 
> Hi,
> 
> I'm wondering whether you've considered storing all the keys in one
> file instead of a file per key. The reason I ask is that it seems to
> me that the key rotation procedure would be a lot simpler if it were
> all in one file. You could just create a temporary file and atomically
> rename it over the existing file. If you see a temporary file you're
> always free to remove it. This is a lot simpler than what you have
> right now. The "repair" concept pretty much goes away completely,
> which seems nice. Granted I don't know exactly how to store multiple
> keys in one file, but I bet there's some way to do it.

We envisioned allowing heap/index key rotation by having a standby with
the same WAL key as the primary but different heap/index keys so that we
can failover to the standby to change the heap/index key and then change
the WAL key.  This separation allows that.  We also might need some
additional keys later and this allows that.  I do like simplicity, but
the complexity here seems to serve a need.

> The way in which you are posting these patches is quite unlike what
> most people do when posting patches to this list. You seem to have
> generated a bunch of patches using 'git format-patch' but then
> concatenated them all together in a single file. It would be helpful
> if you could do this more like the way that is now standard on this
> list. Not only that, but the patches don't have meaningful commit

What is the standard?  You want seven separate files?  I can do that.

> messages in them, and don't seem to be meaningfully split for easy
> review. They just say things like 'crypto squash commit'. Compare this

Yes, the feature is at the backend, common, /bin, and test levels.  I
was able to separate out the bin, pg_alterckey and test stuff, but the
backend interactions were hard to split.

> to for example what I did on the "cleaning up a few CLOG-related
> things" thread where the commits appear in a logical sequence, and
> each one has a meaningful commit message. Or here's an example from
> someone else --
> http://postgr.es/m/be72abfa-e62e-eb81-4e70-1b57fe6dc9e2@amazon.com --
> and note the inclusion of authorship information in the commit
> messages, so that the source of the code can be easily understood.

I see.  I am not sure how to do that easily for all the pieces.

> The README in src/backend/crypto does not explain how the scripts in
> that directory are intended to be used. If I want to use AWS Secrets
> Manager with this feature, I can see that I should use
> ckey_aws.sh.sample as a basis for that integration, but I don't know
> what I should do with the script because the README says nothing about
> it. I am frankly pretty doubtful about the idea of shipping a bunch of
> /bin/sh scripts as a best practice; for one thing, that's totally
> unusable on Windows, and it also means that this is dependent on
> /bin/sh existing and having the behavior we expect and on all the
> other executables in these scripts as well. But, at the very least,
> there needs to be a clearer explanation of how the scripts are
> intended to be used, which parts people are supposed to modify, what
> arguments they're going to get called with, and things like that.

I added comments to most of the scripts.  I don't know what more I can
do, or what other language would be appropriate.

> The comments in cipher.c and cipher_openssl.c could be improved to
> explain that they are alternatives to each other. Perhaps the former
> could be renamed to something like cipher_failure.c or cipher_noimpl.c
> for clarity.

This follows the way cryptohash.c and cryptohash_openssl.c are done.  I
did just add comments to the top of cipher.c and cipher_openssl.c to be
just like cryptohash versions.

> I believe that a StaticAssertStmt could be used to check the length of
> the encryption_methods[] array, so that if someone changes
> NUM_ENCRYPTION_METHODS without updating the array, compilation fails.
> See UserAuthName[] for an example of how to do this.

Sure, good idea, done.

> You seem to have omitted to update the documentation with the names of
> the new wait events that you added.

OK, added.

> In process_postgres_switches(), when there's a multi-line comment
> followed by a single line of actual code, I prefer to include braces
> around the whole thing. There might be some disagreement on what is
> best here, though.

OK, done.

> What are the consequences of the placement of the code in
> PostgresMain() for processes other than user backends and walsenders?
> I think that the way you have it, background workers would not have
> access to keys, nor auxiliary processes like the checkpointer ... at

Well, there are three cases, --boot mode, postmaster mode, and postgres
single-user mode.  I tried to have all those cases only unwrap the keys
once and store them in shared memory, or in boot mode, in local memory.
As far as I know, the startup does it once and everyone else uses shared
memory to access it.

> least in the EXEC_BACKEND case. In the non-EXEC_BACKEND case you have
> the postmaster doing it, so then I'm not sure why it has to be redone
> for every backend. Won't they just inherit the data from the

For postgres --single.

> postmaster? Has this code been meaningfully tested on Windows? How do

No, just by the cfbot Windows machine.

> we know that it works? Maybe we need to think about adding some
> asserts that guarantee that any process that attempts to access a
> buffer has the key manager initialized; I bet such assertions would
> fail at least on Windows as the code looks now.

Are you saying we should set a global variable and throw an error if it
is accessed without the array being initialized?

> I don't think it makes sense to think about committing this to v14. I
> believe it only makes sense if we have a TDE patch that is relatively
> close to committable that can be used with it. I also don't think that
> this patch is in good enough shape to commit yet in terms of where
> it's at in terms of quality; I think it needs more review first,
> hopefully including review from people who can comment intelligently
> specifically on the cryptography aspects of it. However, the
> challenges don't seem insurmountable. There's also still some question
> in my mind about whether the design choices here (one KEK, 2 DEKs, one
> for data and one for WAL) have enough consensus. I don't have a
> considered opinion on that, partly because I'm not quite sure what the
> reasonable alternatives are, but it seems that other people had some
> questions about it, IIUC.

While I am willing to make requested adjustments to the patch, I don't
plan to work on this feaure any further, assuming your analysis above is
correct.  If after years we are still not sure this is the right
direction, I don't see any point in moving forward with the later
pieces, which are even more complicated.  I will join the group of
people that feel there will never be consensus on implementing this
feature in the community, so it is not worth trying.

I would also like to add a "not wanted" entry for this feature on the
TODO list, baaed on the feature's limited usefulness, but I already
asked about that and no one seems to feel we don't want it.

I now better understand why the OpenSSL project has had such serious
problems in the past.

Updated patch attached as seven attachments.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Jan 26, 2021 at 05:53:01PM -0500, Bruce Momjian wrote:
> On Tue, Jan 26, 2021 at 03:24:30PM -0500, Robert Haas wrote:
> > I'm wondering whether you've considered storing all the keys in one
> > file instead of a file per key. The reason I ask is that it seems to
> > me that the key rotation procedure would be a lot simpler if it were
> > all in one file. You could just create a temporary file and atomically
> > rename it over the existing file. If you see a temporary file you're
> > always free to remove it. This is a lot simpler than what you have
> > right now. The "repair" concept pretty much goes away completely,
> > which seems nice. Granted I don't know exactly how to store multiple
> > keys in one file, but I bet there's some way to do it.
> 
> We envisioned allowing heap/index key rotation by having a standby with
> the same WAL key as the primary but different heap/index keys so that we
> can failover to the standby to change the heap/index key and then change
> the WAL key.  This separation allows that.  We also might need some
> additional keys later and this allows that.  I do like simplicity, but
> the complexity here seems to serve a need.

Just to close this issue, several scripts, e,g., PIV, AWS, need to store
data to indicate the cluster encryption key used, and those need to be
kept synchronized with the wrapped data keys.  Having separate
directories for each cluster key version allows that to work cleanly.

> > The README in src/backend/crypto does not explain how the scripts in
> > that directory are intended to be used. If I want to use AWS Secrets
> > Manager with this feature, I can see that I should use
> > ckey_aws.sh.sample as a basis for that integration, but I don't know
> > what I should do with the script because the README says nothing about
> > it. I am frankly pretty doubtful about the idea of shipping a bunch of
> > /bin/sh scripts as a best practice; for one thing, that's totally
> > unusable on Windows, and it also means that this is dependent on
> > /bin/sh existing and having the behavior we expect and on all the
> > other executables in these scripts as well. But, at the very least,
> > there needs to be a clearer explanation of how the scripts are
> > intended to be used, which parts people are supposed to modify, what
> > arguments they're going to get called with, and things like that.
> 
> I added comments to most of the scripts.  I don't know what more I can
> do, or what other language would be appropriate.

I think someone would need to write Windows versions of these scripts.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Tom Kincaid
Date:


Hello,


> I don't think it makes sense to think about committing this to v14. I
> believe it only makes sense if we have a TDE patch that is relatively
> close to committable that can be used with it. I also don't think that
> this patch is in good enough shape to commit yet in terms of where
> it's at in terms of quality; I think it needs more review first,
> hopefully including review from people who can comment intelligently
> specifically on the cryptography aspects of it. However, the
> challenges don't seem insurmountable. There's also still some question
> in my mind about whether the design choices here (one KEK, 2 DEKs, one
> for data and one for WAL) have enough consensus. I don't have a
> considered opinion on that, partly because I'm not quite sure what the
> reasonable alternatives are, but it seems that other people had some
> questions about it, IIUC.

While I am willing to make requested adjustments to the patch, I don't
plan to work on this feaure any further, assuming your analysis above is
correct.  If after years we are still not sure this is the right
direction, I don't see any point in moving forward with the later
pieces, which are even more complicated.  I will join the group of
people that feel there will never be consensus on implementing this
feature in the community, so it is not worth trying.

I would also like to add a "not wanted" entry for this feature on the
TODO list, baaed on the feature's limited usefulness, but I already
asked about that and no one seems to feel we don't want it.

I want to avoid seeing this happen. As a result of a lot of customer and user discussions, around their criteria for choosing a database, I believe TDE is an important feature and having it appear with a "not-wanted" tag will keep the version of PostgreSQL released by the community out of certain (and possibly growing) number of deployment scenarios which I don't think anybody wants to see.

I think the current situation to be as follows (if I missed something please let me know):

1) We need to get the current patch for Key Management reviewed and tested further. 

I spoke to Bruce just now he will see if can get somebody to do this.


2) We need to start working on the actual TDE implementation and get it pretty close to final before we start committing smaller portions of the feature.

Unfortunately, on this front, the only things, I think I can offer are:

a) Ask for volunteers to work on the TDE implementation.
b) Facilitate the work between volunteers.
c) Prod folks along and cheer as we go.

So I will start with (a), do we have any volunteers who feel they can contribute regularly for a while and would like to be part of a team that moves this forward?



I now better understand why the OpenSSL project has had such serious
problems in the past.

Updated patch attached as seven attachments.

--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



--
Thomas John Kincaid

Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Jan 28, 2021 at 02:41:09PM -0500, Tom Kincaid wrote:
>     I would also like to add a "not wanted" entry for this feature on the
>     TODO list, baaed on the feature's limited usefulness, but I already
>     asked about that and no one seems to feel we don't want it.
> 
> 
> I want to avoid seeing this happen. As a result of a lot of customer and user
> discussions, around their criteria for choosing a database, I believe TDE is an
> important feature and having it appear with a "not-wanted" tag will keep the
> version of PostgreSQL released by the community out of certain (and possibly
> growing) number of deployment scenarios which I don't think anybody wants to
> see.

With pg_upgrade, I could work on it out of the tree until it became
popular, with a small non-user-visible part in the backend.  With the
Windows port, the port wasn't really visible to users until it we ready.

For the key management part of TDE, it can't be done outside the tree,
and it is user-visible before it is useful, so that restricts how much
incremental work can be committed to the tree for TDE.  I highlighted
that concern emails months ago, but never got any feedback --- now it
seems people are realizing the ramifications of that.

> I think the current situation to be as follows (if I missed something please
> let me know):
> 
> 1) We need to get the current patch for Key Management reviewed and tested
> further. 
> 
> I spoke to Bruce just now he will see if can get somebody to do this.

Well, if we don't get anyone committed to working on the data encryption
part of TDE, the key management part is useless, so why review/test it
further?

Although Sawada-san and Stephen Frost worked on the patch, they have not
commented much on my additions, and only a few others have commented on
the code, and there has been no discussion on who is working on the next
steps.  This indicates to me that there is little interest in moving
this feature forward, which is why I started asking if it could be
labeled as "not wanted".

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Masahiko Sawada
Date:
On Fri, Jan 29, 2021 at 5:22 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Jan 28, 2021 at 02:41:09PM -0500, Tom Kincaid wrote:
> >     I would also like to add a "not wanted" entry for this feature on the
> >     TODO list, baaed on the feature's limited usefulness, but I already
> >     asked about that and no one seems to feel we don't want it.
> >
> >
> > I want to avoid seeing this happen. As a result of a lot of customer and user
> > discussions, around their criteria for choosing a database, I believe TDE is an
> > important feature and having it appear with a "not-wanted" tag will keep the
> > version of PostgreSQL released by the community out of certain (and possibly
> > growing) number of deployment scenarios which I don't think anybody wants to
> > see.
>
> With pg_upgrade, I could work on it out of the tree until it became
> popular, with a small non-user-visible part in the backend.  With the
> Windows port, the port wasn't really visible to users until it we ready.
>
> For the key management part of TDE, it can't be done outside the tree,
> and it is user-visible before it is useful, so that restricts how much
> incremental work can be committed to the tree for TDE.  I highlighted
> that concern emails months ago, but never got any feedback --- now it
> seems people are realizing the ramifications of that.
>
> > I think the current situation to be as follows (if I missed something please
> > let me know):
> >
> > 1) We need to get the current patch for Key Management reviewed and tested
> > further.
> >
> > I spoke to Bruce just now he will see if can get somebody to do this.
>
> Well, if we don't get anyone committed to working on the data encryption
> part of TDE, the key management part is useless, so why review/test it
> further?
>
> Although Sawada-san and Stephen Frost worked on the patch, they have not
> commented much on my additions, and only a few others have commented on
> the code, and there has been no discussion on who is working on the next
> steps.  This indicates to me that there is little interest in moving
> this feature forward,

TBH I’m confused a bit about the recent situation of this patch, but I
can contribute to KMS work by discussing, writing, reviewing, and
testing the patch. Also, I can work on the data encryption part of TDE
(we need more discussion on that though). If the community concerns
about the high-level design and thinks the design reviews by
cryptography experts are still needed, we would need to do that first
since the data encryption part of TDE depends on KMS. As far as I
know, we have done that many times on pgsql-hackers, on offl-line and
including the discussion on the past proposal, etc but given that the
community still has a concern, it seems that we haven’t been able to
share the details of the discussion enough that led to the design
decision or the design is still not good. Honestly, I’m not sure how
this feature can get consensus. But maybe we would need to have a
break from refining the patch now and we need to marshal the
discussions so far and the point behind the design so that everyone
can understand why this feature is designed in that way. To do that,
it might be a good start to sort the wiki page since it has data
encryption part, KMS, and ToDo mixed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> On Fri, Jan 29, 2021 at 5:22 AM Bruce Momjian <bruce@momjian.us> wrote:
> > On Thu, Jan 28, 2021 at 02:41:09PM -0500, Tom Kincaid wrote:
> > >     I would also like to add a "not wanted" entry for this feature on the
> > >     TODO list, baaed on the feature's limited usefulness, but I already
> > >     asked about that and no one seems to feel we don't want it.
> > >
> > >
> > > I want to avoid seeing this happen. As a result of a lot of customer and user
> > > discussions, around their criteria for choosing a database, I believe TDE is an
> > > important feature and having it appear with a "not-wanted" tag will keep the
> > > version of PostgreSQL released by the community out of certain (and possibly
> > > growing) number of deployment scenarios which I don't think anybody wants to
> > > see.
> >
> > With pg_upgrade, I could work on it out of the tree until it became
> > popular, with a small non-user-visible part in the backend.  With the
> > Windows port, the port wasn't really visible to users until it we ready.
> >
> > For the key management part of TDE, it can't be done outside the tree,
> > and it is user-visible before it is useful, so that restricts how much
> > incremental work can be committed to the tree for TDE.  I highlighted
> > that concern emails months ago, but never got any feedback --- now it
> > seems people are realizing the ramifications of that.
> >
> > > I think the current situation to be as follows (if I missed something please
> > > let me know):
> > >
> > > 1) We need to get the current patch for Key Management reviewed and tested
> > > further.
> > >
> > > I spoke to Bruce just now he will see if can get somebody to do this.
> >
> > Well, if we don't get anyone committed to working on the data encryption
> > part of TDE, the key management part is useless, so why review/test it
> > further?
> >
> > Although Sawada-san and Stephen Frost worked on the patch, they have not
> > commented much on my additions, and only a few others have commented on
> > the code, and there has been no discussion on who is working on the next
> > steps.  This indicates to me that there is little interest in moving
> > this feature forward,
>
> TBH I’m confused a bit about the recent situation of this patch, but I
> can contribute to KMS work by discussing, writing, reviewing, and
> testing the patch. Also, I can work on the data encryption part of TDE
> (we need more discussion on that though). If the community concerns
> about the high-level design and thinks the design reviews by
> cryptography experts are still needed, we would need to do that first
> since the data encryption part of TDE depends on KMS. As far as I
> know, we have done that many times on pgsql-hackers, on offl-line and
> including the discussion on the past proposal, etc but given that the
> community still has a concern, it seems that we haven’t been able to
> share the details of the discussion enough that led to the design
> decision or the design is still not good. Honestly, I’m not sure how
> this feature can get consensus. But maybe we would need to have a
> break from refining the patch now and we need to marshal the
> discussions so far and the point behind the design so that everyone
> can understand why this feature is designed in that way. To do that,
> it might be a good start to sort the wiki page since it has data
> encryption part, KMS, and ToDo mixed.

I hope it's pretty clear that I'm also very much in support of both this
effort with the KMS and of TDE in general- TDE is specifically,
repeatedly, called out as a capability whose lack is blocking PG from
being able to be used for certain use-cases that it would otherwise be
well suited for, and that's really unfortunate.

I appreciate the recent discussion and reviews of the KMS in particular,
and of the patches which have been sent enabling TDE based on the KMS
patches.  Having them be relatively independent seems to be an ongoing
concern and perhaps we should figure out a way to more clearly put them
together.  That is- the KMS patches have been posted on one thread, and
TDE PoC patches which use the KMS patches have been on another thread,
leading some to not realize that there's already been TDE PoC work done
based on the KMS patches.  Seems like it might make sense to get one
patch set which goes all the way from the KMS and includes the TDE PoC,
even if they don't all go in at once.

I'm happy to go look over the KMS patches again if that'd be helpful and
to comment on the TDE PoC.  I can also spend some time trying to improve
on each, as I've already done.  A few of the larger concerns that I have
revolve around how to store integrity information (I've tried to find a
way to make room for such information in our existing page layout and,
perhaps unsuprisingly, it's far from trivial to do so in a way that will
avoid breaking the existing page layout, or where the same set of
binaries could work on both unencrypted pages and encrypted pages with
integrity validation information, and that's a problem that we really
should consider trying to solve...), and how to automate key rotation
(one of the nice things about Bruce's approach to storing the keys is
that we're leveraging the filesystem as an index- it's easy to see how
we might extend the key-per-file approach to allow us to, say, have a
different key for every 32GB of LSN, but if we tried to put all of the
keys into a single file then we'd have to figure out an indexing
solution for it which would allow us to find the key we need to decrypt
a given page...).  I tend to agree with Bruce that we need to take
these things in steps, getting each piece implemented as we go.  Maybe
we can do that in a separate repo for a time and then bring it all
together, as a few on this thread have voiced, but there's no doubt that
this is a large project and it's hard to see how we could possibly
commit all of it at once.

Thanks!

Stephen

Attachment

Re: Key management with tests

From
Tom Kincaid
Date:




Thanks Stephen, Bruce and Masahiko,


> discussions so far and the point behind the design so that everyone
> can understand why this feature is designed in that way. To do that,
> it might be a good start to sort the wiki page since it has data
> encryption part, KMS, and ToDo mixed.

I hope it's pretty clear that I'm also very much in support of both this
effort with the KMS and of TDE in general- TDE is specifically,
repeatedly, called out as a capability whose lack is blocking PG from
being able to be used for certain use-cases that it would otherwise be
well suited for, and that's really unfortunate.

It is clear you are supportive.

As you know, I share your point of view that PG adoption is suffering for certain use cases because it does not have TDE.

I appreciate the recent discussion and reviews of the KMS in particular,
and of the patches which have been sent enabling TDE based on the KMS
patches.  Having them be relatively independent seems to be an ongoing
concern and perhaps we should figure out a way to more clearly put them
together.  That is- the KMS patches have been posted on one thread, and
TDE PoC patches which use the KMS patches have been on another thread,
leading some to not realize that there's already been TDE PoC work done
based on the KMS patches.  Seems like it might make sense to get one
patch set which goes all the way from the KMS and includes the TDE PoC,
even if they don't all go in at once.

Sounds good, thanks Masahiko, let's see if we can get consensus on the approach for moving this forward see below.
 

together, as a few on this thread have voiced, but there's no doubt that
this is a large project and it's hard to see how we could possibly
commit all of it at once.

I propose that we meet to discuss what approach we want to use to move TDE forward.  We then start a new thread with a proposal on the approach and finalize it via community consensus. I will invite Bruce, Stephen and Masahiko to this meeting. If anybody else would like to participate in this discussion and subsequently in the effort to get TDE in PG1x, please let me know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer from this meeting) will post the proposal for how we move this patch forward in another thread. Hopefully, we can get consensus on that and subsequently restart the execution of delivering this feature.





Thanks!

Stephen


--
Thomas John Kincaid

Re: Key management with tests

From
"Moon, Insung"
Date:
Dear All.

Thank you for all opinions and discussions regarding the KMS/TDE function.

First of all, to get to the point of this email,
I want to participate in anything I can do (review or development)
when TDE related development is in progress.
If there is a meeting related to it, I can't communicate because of my
poor English skills, but I would like to attend if it is only possible
to listen.

I didn't understand KMS and didn't participate in the direct
development, so I didn't comment on anything so far. Still, when TDE
development starts, I wanted to join in the discussion and meeting if
there was anything I could do.
However, since I have a complicated and insufficient English ability
to communicate in English, maybe I will rarely say anything in
meetings (voice and video meetings).
But I would like to attend the discussion if it is only possible to listen.

Also, if the wiki page and other mail threads related to TDE start,
I'll join in discussions if there is anything I can do.

Best regards.
Moon.

On Sat, Jan 30, 2021 at 10:23 PM Tom Kincaid <tomjohnkincaid@gmail.com> wrote:
>
>
>
>
>
> Thanks Stephen, Bruce and Masahiko,
>
>>
>> > discussions so far and the point behind the design so that everyone
>> > can understand why this feature is designed in that way. To do that,
>> > it might be a good start to sort the wiki page since it has data
>> > encryption part, KMS, and ToDo mixed.
>>
>> I hope it's pretty clear that I'm also very much in support of both this
>> effort with the KMS and of TDE in general- TDE is specifically,
>> repeatedly, called out as a capability whose lack is blocking PG from
>> being able to be used for certain use-cases that it would otherwise be
>> well suited for, and that's really unfortunate.
>
>
> It is clear you are supportive.
>
> As you know, I share your point of view that PG adoption is suffering for certain use cases because it does not have
TDE.
>
>> I appreciate the recent discussion and reviews of the KMS in particular,
>> and of the patches which have been sent enabling TDE based on the KMS
>> patches.  Having them be relatively independent seems to be an ongoing
>> concern and perhaps we should figure out a way to more clearly put them
>> together.  That is- the KMS patches have been posted on one thread, and
>> TDE PoC patches which use the KMS patches have been on another thread,
>> leading some to not realize that there's already been TDE PoC work done
>> based on the KMS patches.  Seems like it might make sense to get one
>> patch set which goes all the way from the KMS and includes the TDE PoC,
>> even if they don't all go in at once.
>
>
> Sounds good, thanks Masahiko, let's see if we can get consensus on the approach for moving this forward see below.
>
>>
>>
>> together, as a few on this thread have voiced, but there's no doubt that
>> this is a large project and it's hard to see how we could possibly
>> commit all of it at once.
>
>
> I propose that we meet to discuss what approach we want to use to move TDE forward.  We then start a new thread with
aproposal on the approach and finalize it via community consensus. I will invite Bruce, Stephen and Masahiko to this
meeting.If anybody else would like to participate in this discussion and subsequently in the effort to get TDE in PG1x,
pleaselet me know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer from this meeting) will
postthe proposal for how we move this patch forward in another thread. Hopefully, we can get consensus on that and
subsequentlyrestart the execution of delivering this feature. 
>
>
>
>
>>
>> Thanks!
>>
>> Stephen
>
>
>
> --
> Thomas John Kincaid
>



Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan 29, 2021 at 05:05:06PM +0900, Masahiko Sawada wrote:
> TBH I’m confused a bit about the recent situation of this patch, but
> I

Yes, it is easy to get confused.

> can contribute to KMS work by discussing, writing, reviewing, and
> testing the patch. Also, I can work on the data encryption part of TDE

Great.

> (we need more discussion on that though). If the community concerns
> about the high-level design and thinks the design reviews by
> cryptography experts are still needed, we would need to do that first
> since the data encryption part of TDE depends on KMS. As far as I

I totally agree.  While we don't need to commit the key management patch
to the tree before moving forward, we should have agreement on the key
management patch before doing more work on this.  If we can't agree on
the key management part, there is no value in working further, as I
stated in an earlier email.

> know, we have done that many times on pgsql-hackers, on offl-line and
> including the discussion on the past proposal, etc but given that the
> community still has a concern, it seems that we haven’t been able
> to share the details of the discussion enough that led to the design
> decision or the design is still not good. Honestly, I’m not sure how
> this feature can get consensus. But maybe we would need to have a

Yes, I am also confused.

> break from refining the patch now and we need to marshal the
> discussions so far and the point behind the design so that everyone
> can understand why this feature is designed in that way. To do that,
> it might be a good start to sort the wiki page since it has data
> encryption part, KMS, and ToDo mixed.

What I ended up doing is to moving the majority of the
non-data-encryption part of the wiki into the patch, either in docs or
README files, since people asked for more of this in the patch, and
having the information in two places is confusing.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Jan 29, 2021 at 05:40:37PM -0500, Stephen Frost wrote:
> I hope it's pretty clear that I'm also very much in support of both this
> effort with the KMS and of TDE in general- TDE is specifically,

Yes, thanks.  I know we have privately talked about this recently, but
it is nice to have it in public like this.

> repeatedly, called out as a capability whose lack is blocking PG from
> being able to be used for certain use-cases that it would otherwise be
> well suited for, and that's really unfortunate.

So, below, I am going to copy two doc paragraphs from the patch:

  The purpose of cluster file encryption is to prevent users with read
  access to the directories used to store database files and write-ahead
  log files from being able to access the data stored in those files.
  For example, when using cluster file encryption, users who have read
  access to the cluster directories for backup purposes will not be able
  to decrypt the data stored in these files.  It also protects against
  decrypted data access after media theft.

  File system write access can allow for unauthorized file system data
  decryption if the writes can be used to weaken the system's security
  and this weakened system is later supplied with externally-stored keys.
  This also does not protect from users who have read access to system
  memory.  This also does not detect or protect against users with write
  access from removing or modifying database files.

Given what I said above, is the value of this feature for compliance, or
for actual additional security?  If it just compliance, are we willing
to add all of this code just for that, even if it has limited security
value?  We should answer this question now, and if we don't want it,
let's document that so users know and can consider alternatives.

FYI, I don't think we can detect or protect against writers modifying
the data files --- even if we could do it on a block level, they could
remove trailing pages (might cause index lookup failures) or copy
pages from other tables at the same offset.  Therefore, I think we can
only offer viewing security, not modification detection/prevention.

> I appreciate the recent discussion and reviews of the KMS in particular,
> and of the patches which have been sent enabling TDE based on the KMS
> patches.  Having them be relatively independent seems to be an ongoing

I was thinking some more and I have received productive feedback from at
least eight people on the key management patch, which is very good.

> concern and perhaps we should figure out a way to more clearly put them
> together.  That is- the KMS patches have been posted on one thread, and
> TDE PoC patches which use the KMS patches have been on another thread,
> leading some to not realize that there's already been TDE PoC work done
> based on the KMS patches.  Seems like it might make sense to get one
> patch set which goes all the way from the KMS and includes the TDE PoC,
> even if they don't all go in at once.

Uh, it is worse than that.  Some people saw comments about the TDE PoC
patch (e.g., buffer pins) and thought they were related to the KMS
patch, so they thought the KMS patch wasn't ready.  Now, I am not saying
the KMS patch is ready, but comments on the TDE PoC patch are unrelated
to the KMS patch being ready.

I think the TDE PoC was a big positive because it showed the KMS patch
being used for the actual use-case we are planning, so it was truly a
proof-of-concept.

> I'm happy to go look over the KMS patches again if that'd be helpful and
> to comment on the TDE PoC.  I can also spend some time trying to improve

I think we eventually need a full review of the TDE PoC, combined with
the Cybertec patch, and the wiki, to get them all aligned.  However, as
I said already, let's get the KMS patch approved, even if we don't apply
it now, so we know we are on an approved foundation.

> on each, as I've already done.  A few of the larger concerns that I have
> revolve around how to store integrity information (I've tried to find a
> way to make room for such information in our existing page layout and,
> perhaps unsuprisingly, it's far from trivial to do so in a way that will
> avoid breaking the existing page layout, or where the same set of
> binaries could work on both unencrypted pages and encrypted pages with
> integrity validation information, and that's a problem that we really

As stated above, I think we only need a byte or two for the hint bit
counter (used in the IV), as I don't think the GCM verification bytes
will add any additional security, and I bet we can find a byte or two. 
We do need a separate discussion on this, either here or privately.

> should consider trying to solve...), and how to automate key rotation
> (one of the nice things about Bruce's approach to storing the keys is
> that we're leveraging the filesystem as an index- it's easy to see how
> we might extend the key-per-file approach to allow us to, say, have a
> different key for every 32GB of LSN, but if we tried to put all of the
> keys into a single file then we'd have to figure out an indexing
> solution for it which would allow us to find the key we need to decrypt
> a given page...).  I tend to agree with Bruce that we need to take

Yeah, yuck on that plan.  I was very happy how the per-version directory
worked with scripts that needed to store matching state.

> these things in steps, getting each piece implemented as we go.  Maybe
> we can do that in a separate repo for a time and then bring it all
> together, as a few on this thread have voiced, but there's no doubt that
> this is a large project and it's hard to see how we could possibly
> commit all of it at once.

I was putting stuff in a git tree/URL;  you can see it here:

   https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
   https://github.com/postgres/postgres/compare/master...bmomjian:key.patch
   https://github.com/postgres/postgres/compare/master...bmomjian:key

However, people wanted persistent patches attached, so I started doing that.
Attached is the current patch set.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote:
> I propose that we meet to discuss what approach we want to use to move TDE
> forward.  We then start a new thread with a proposal on the approach
> and finalize it via community consensus. I will invite Bruce, Stephen and
> Masahiko to this meeting. If anybody else would like to participate in this
> discussion and subsequently in the effort to get TDE in PG1x, please let me
> know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer
> from this meeting) will post the proposal for how we move this patch forward in
> another thread. Hopefully, we can get consensus on that and subsequently
> restart the execution of delivering this feature.

We got complaints that decisions were not publicly discussed, or were
too long, so I am not sure this helps.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Jan 29, 2021 at 05:40:37PM -0500, Stephen Frost wrote:
> > I hope it's pretty clear that I'm also very much in support of both this
> > effort with the KMS and of TDE in general- TDE is specifically,
>
> Yes, thanks.  I know we have privately talked about this recently, but
> it is nice to have it in public like this.

Certainly happy to lend my support and to spend some time working on
this to move it forward.

> > repeatedly, called out as a capability whose lack is blocking PG from
> > being able to be used for certain use-cases that it would otherwise be
> > well suited for, and that's really unfortunate.
>
> So, below, I am going to copy two doc paragraphs from the patch:
>
>   The purpose of cluster file encryption is to prevent users with read
>   access to the directories used to store database files and write-ahead
>   log files from being able to access the data stored in those files.
>   For example, when using cluster file encryption, users who have read
>   access to the cluster directories for backup purposes will not be able
>   to decrypt the data stored in these files.  It also protects against
>   decrypted data access after media theft.

That's one valid use-case and it particularly makes sense to consider,
now that we support group read-access to the data cluster.  The last
line seems a bit unclear- I would update it to say:

Cluster file encryption also provides data-at-rest security, protecting
users from data loss should the physical media on which the cluster is
stored be stolen, improperly deprovisioned (not wiped or destroyed), or
otherwise ends up in the hands of an attacker.

>   File system write access can allow for unauthorized file system data
>   decryption if the writes can be used to weaken the system's security
>   and this weakened system is later supplied with externally-stored keys.

This isn't very clear as to exactly what the concern is or how an
attacker would be able to thwart the system if they had write access to
it.  An attacker with write access could possibly attempt to replace the
existing keys, but with the key wrapping that we're using, that should
result in just a decryption failure (unless, of course, the attacker has
the actual KEK that was used, but that's not terribly interesting to
worry about since then they could just go access the files directly).

Until and unless we solve the issue around storing the GCM tags for each
page, we will have the risk that an attacker could modify a page in a
manner that we wouldn't detect.  This is the biggest concern that I have
currently with the existing TDE patch sets.

There's two options that I see around how to address that issue- either
we arrange to create space in the page for the tag, such as by making
the 'special' space on a page a bit bigger and making sure that
everything understands that, or we'll need to add another fork in which
we store the tags (and possibly other TDE/encryption related
information).  If we go with a fork then it should be possible to do WAL
streaming from an unencrypted cluster to an encrypted one, which would
be pretty neat, but it means another fork and another page that has to
be read/written every time we modify a page.  Getting some input into
the trade-offs here would be really helpful.  I don't think it's really
reasonable to go out with TDE without having figured out the integrity
side.  Certainly, when I review things like NIST 800-53, it's very clear
that the requirement is for both confidentiality *and* integrity.

>   This also does not protect from users who have read access to system
>   memory.  This also does not detect or protect against users with write
>   access from removing or modifying database files.

The last seems a bit obvious, but the first sentence quoted above is
important to make clear.  I might even say:

All of the pages in memory and all of the keys which are used for the
encryption and decryption are stored in the clear in memory and
therefore an attacker who is able to read the memory allocated by
PostgreSQL would be able to decrypt the enitre cluster.

> Given what I said above, is the value of this feature for compliance, or
> for actual additional security?  If it just compliance, are we willing
> to add all of this code just for that, even if it has limited security
> value?  We should answer this question now, and if we don't want it,
> let's document that so users know and can consider alternatives.

The feature is for both compliance and additional security.  While there
are other ways to achieve data-at-rest encryption, they are not always
available, for a variety of reasons.

> FYI, I don't think we can detect or protect against writers modifying
> the data files --- even if we could do it on a block level, they could
> remove trailing pages (might cause index lookup failures) or copy
> pages from other tables at the same offset.  Therefore, I think we can
> only offer viewing security, not modification detection/prevention.

Protecting against file modification isn't about finding some way to
make it so that an attacker isn't able to modify the files, it's about
detecting the case where an unauthorized modification has happened.
Clearly if an attacker has gained write access to the system then we
can't protect against the attacker using the access they've gained, but
we can in most cases detect it and that's what we should be doing.  It
would be really unfortunate to end up with a solution here that only
provides confidentiality and doesn't address integrity at all, and I
don't really think it's *that* hard to do both.  That said, if we must
work at this in pieces and we can get agreement to handle
confidentiality initially and then add integrity later, that might be
reasonable.

> > I appreciate the recent discussion and reviews of the KMS in particular,
> > and of the patches which have been sent enabling TDE based on the KMS
> > patches.  Having them be relatively independent seems to be an ongoing
>
> I was thinking some more and I have received productive feedback from at
> least eight people on the key management patch, which is very good.

Agreed.

> > concern and perhaps we should figure out a way to more clearly put them
> > together.  That is- the KMS patches have been posted on one thread, and
> > TDE PoC patches which use the KMS patches have been on another thread,
> > leading some to not realize that there's already been TDE PoC work done
> > based on the KMS patches.  Seems like it might make sense to get one
> > patch set which goes all the way from the KMS and includes the TDE PoC,
> > even if they don't all go in at once.
>
> Uh, it is worse than that.  Some people saw comments about the TDE PoC
> patch (e.g., buffer pins) and thought they were related to the KMS
> patch, so they thought the KMS patch wasn't ready.  Now, I am not saying
> the KMS patch is ready, but comments on the TDE PoC patch are unrelated
> to the KMS patch being ready.

I do agree with that and that it can lend to some confusion.  I'm not
sure what the right solution there is except to continue to try and work
with those who are interested and to clarify the separation.

> I think the TDE PoC was a big positive because it showed the KMS patch
> being used for the actual use-case we are planning, so it was truly a
> proof-of-concept.

Agreed.

> > I'm happy to go look over the KMS patches again if that'd be helpful and
> > to comment on the TDE PoC.  I can also spend some time trying to improve
>
> I think we eventually need a full review of the TDE PoC, combined with
> the Cybertec patch, and the wiki, to get them all aligned.  However, as
> I said already, let's get the KMS patch approved, even if we don't apply
> it now, so we know we are on an approved foundation.

While the Cybertec patch is interesting, I'd really like to see
something that's a bit less invasive when it comes to how temporary
files are handled.  In particular, I think it'd be possible to have an
API that's very similar to the existing one for serial reading and
writing of files which wouldn't require nearly as many changes to things
like reorderbuffer.c.  I also believe there's some things we could do to
avoid having to modify quite as many places when it comes to LSN
assignment, so the base patch isn't as big.

> > on each, as I've already done.  A few of the larger concerns that I have
> > revolve around how to store integrity information (I've tried to find a
> > way to make room for such information in our existing page layout and,
> > perhaps unsuprisingly, it's far from trivial to do so in a way that will
> > avoid breaking the existing page layout, or where the same set of
> > binaries could work on both unencrypted pages and encrypted pages with
> > integrity validation information, and that's a problem that we really
>
> As stated above, I think we only need a byte or two for the hint bit
> counter (used in the IV), as I don't think the GCM verification bytes
> will add any additional security, and I bet we can find a byte or two.
> We do need a separate discussion on this, either here or privately.

I have to disagree here- the GCM tag adds integrity which is really
quite important.  Happy to chat about it independently, of course.

> > should consider trying to solve...), and how to automate key rotation
> > (one of the nice things about Bruce's approach to storing the keys is
> > that we're leveraging the filesystem as an index- it's easy to see how
> > we might extend the key-per-file approach to allow us to, say, have a
> > different key for every 32GB of LSN, but if we tried to put all of the
> > keys into a single file then we'd have to figure out an indexing
> > solution for it which would allow us to find the key we need to decrypt
> > a given page...).  I tend to agree with Bruce that we need to take
>
> Yeah, yuck on that plan.  I was very happy how the per-version directory
> worked with scripts that needed to store matching state.

I don't know that it's going to ultimately be the best answer, as we're
essentially using the filesystem as an index, as I mentioned above, but,
yeah, trying to do all of that ourselves during WAL replay doesn't seem
like it would be fun to try and figure out.  This is an area that I
would think we'd be able to improve on in the future too- if someone
wants to spend the time coming up with a single-file format that is
indexed in some manner and still provides the guarantees that we need,
we could very likely teach pg_upgrade how to handle that and the data
set we're talking about here is quite small, even if we've got a bunch
of key rotation that's happened.

> > these things in steps, getting each piece implemented as we go.  Maybe
> > we can do that in a separate repo for a time and then bring it all
> > together, as a few on this thread have voiced, but there's no doubt that
> > this is a large project and it's hard to see how we could possibly
> > commit all of it at once.
>
> I was putting stuff in a git tree/URL;  you can see it here:
>
>    https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
>    https://github.com/postgres/postgres/compare/master...bmomjian:key.patch
>    https://github.com/postgres/postgres/compare/master...bmomjian:key
>
> However, people wanted persistent patches attached, so I started doing that.
> Attached is the current patch set.

Doing both seems likely to be the best option and hopefully will help
everyone see the complete picture.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote:
> > I propose that we meet to discuss what approach we want to use to move TDE
> > forward.  We then start a new thread with a proposal on the approach
> > and finalize it via community consensus. I will invite Bruce, Stephen and
> > Masahiko to this meeting. If anybody else would like to participate in this
> > discussion and subsequently in the effort to get TDE in PG1x, please let me
> > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer
> > from this meeting) will post the proposal for how we move this patch forward in
> > another thread. Hopefully, we can get consensus on that and subsequently
> > restart the execution of delivering this feature.
>
> We got complaints that decisions were not publicly discussed, or were
> too long, so I am not sure this helps.

If the notes are published afterwords as an explanation of why certain
choices were made, I suspect it'd be reasonably well received.  The
concern about back-room discussions is more that decisions are made
without explanation as to why, provided we avoid that, I believe they
can be helpful.

So, +1 for my part to have the conversation.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Feb  1, 2021 at 06:34:53PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote:
> > > I propose that we meet to discuss what approach we want to use to move TDE
> > > forward.  We then start a new thread with a proposal on the approach
> > > and finalize it via community consensus. I will invite Bruce, Stephen and
> > > Masahiko to this meeting. If anybody else would like to participate in this
> > > discussion and subsequently in the effort to get TDE in PG1x, please let me
> > > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a volunteer
> > > from this meeting) will post the proposal for how we move this patch forward in
> > > another thread. Hopefully, we can get consensus on that and subsequently
> > > restart the execution of delivering this feature.
> > 
> > We got complaints that decisions were not publicly discussed, or were
> > too long, so I am not sure this helps.
> 
> If the notes are published afterwords as an explanation of why certain
> choices were made, I suspect it'd be reasonably well received.  The
> concern about back-room discussions is more that decisions are made
> without explanation as to why, provided we avoid that, I believe they
> can be helpful.

Well, I thought that was what the wiki was, but I guess not.  I did
remove some of the decision logic recently since we had made a final
decision.  However, most of the questions were not covered on the wiki,
since, as I said, everyone comes with a different need for details.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Feb  1, 2021 at 06:31:32PM -0500, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> >   The purpose of cluster file encryption is to prevent users with read
> >   access to the directories used to store database files and write-ahead
> >   log files from being able to access the data stored in those files.
> >   For example, when using cluster file encryption, users who have read
> >   access to the cluster directories for backup purposes will not be able
> >   to decrypt the data stored in these files.  It also protects against
> >   decrypted data access after media theft.
> 
> That's one valid use-case and it particularly makes sense to consider,
> now that we support group read-access to the data cluster.  The last

Do enough people use group read-access to be useful?

> line seems a bit unclear- I would update it to say:
> Cluster file encryption also provides data-at-rest security, protecting
> users from data loss should the physical media on which the cluster is
> stored be stolen, improperly deprovisioned (not wiped or destroyed), or
> otherwise ends up in the hands of an attacker.

I have split the section into three paragraphs, trimmed down some of the
suggested text, and added it.  Full version below.

> >   File system write access can allow for unauthorized file system data
> >   decryption if the writes can be used to weaken the system's security
> >   and this weakened system is later supplied with externally-stored keys.
> 
> This isn't very clear as to exactly what the concern is or how an
> attacker would be able to thwart the system if they had write access to
> it.  An attacker with write access could possibly attempt to replace the
> existing keys, but with the key wrapping that we're using, that should
> result in just a decryption failure (unless, of course, the attacker has
> the actual KEK that was used, but that's not terribly interesting to
> worry about since then they could just go access the files directly).

Uh, well, they could modify postgresql.conf to change the script to save
the secret returned by the script before returning it to the PG server. 
We could require postgresql.conf to be somewhere secure, but then how do
we know that is secure?  I just don't see a clean solution here, but the
idea that you write and then wait for the key to show up seems like a
very valid way of attack, and it took me a while to be able to
articulate it.

> Until and unless we solve the issue around storing the GCM tags for each
> page, we will have the risk that an attacker could modify a page in a
> manner that we wouldn't detect.  This is the biggest concern that I have
> currently with the existing TDE patch sets.

Well, GCM certainly can detect page modification, but it can't detect
removing pages from the end of the table, or, since the nonce is
LSN/pageno, you could copy a page from another table that has the same
offset into another table, particularly with partitioning where the
tables have the same columns.  We might be able to protect against the
later with some kind of table-id in the nonce, but I don't see how table
truncation can be detected without adding a whole lot of overhead and
complexity.  And if we can't protect against those two, why bother with
detecting single-page modifications?  We have to do a full job for it to
be useful.

> There's two options that I see around how to address that issue- either
> we arrange to create space in the page for the tag, such as by making
> the 'special' space on a page a bit bigger and making sure that
> everything understands that, or we'll need to add another fork in which
> we store the tags (and possibly other TDE/encryption related
> information).  If we go with a fork then it should be possible to do WAL
> streaming from an unencrypted cluster to an encrypted one, which would
> be pretty neat, but it means another fork and another page that has to
> be read/written every time we modify a page.  Getting some input into
> the trade-offs here would be really helpful.  I don't think it's really
> reasonable to go out with TDE without having figured out the integrity
> side.  Certainly, when I review things like NIST 800-53, it's very clear
> that the requirement is for both confidentiality *and* integrity.

Wow, well, if they are both required, and we can't do both, is it
valuable to do just one?  Yes, we can do something later, but what if we
have no idea how to implement the second part?  Your fork idea above
might need to store some table-id used for the nonce (to prevent copying
from another table) and the number of pages in the table, which fixes
the integrity check issue, but adds a lot of complexity and perhaps
overhead.

> >   This also does not protect from users who have read access to system
> >   memory.  This also does not detect or protect against users with write
> >   access from removing or modifying database files.
> 
> The last seems a bit obvious, but the first sentence quoted above is
> important to make clear.  I might even say:
> 
> All of the pages in memory and all of the keys which are used for the
> encryption and decryption are stored in the clear in memory and
> therefore an attacker who is able to read the memory allocated by
> PostgreSQL would be able to decrypt the enitre cluster.

Same as above, full version below.

> > Given what I said above, is the value of this feature for compliance, or
> > for actual additional security?  If it just compliance, are we willing
> > to add all of this code just for that, even if it has limited security
> > value?  We should answer this question now, and if we don't want it,
> > let's document that so users know and can consider alternatives.
> 
> The feature is for both compliance and additional security.  While there
> are other ways to achieve data-at-rest encryption, they are not always
> available, for a variety of reasons.

True.

> > FYI, I don't think we can detect or protect against writers modifying
> > the data files --- even if we could do it on a block level, they could
> > remove trailing pages (might cause index lookup failures) or copy
> > pages from other tables at the same offset.  Therefore, I think we can
> > only offer viewing security, not modification detection/prevention.
> 
> Protecting against file modification isn't about finding some way to
> make it so that an attacker isn't able to modify the files, it's about
> detecting the case where an unauthorized modification has happened.
> Clearly if an attacker has gained write access to the system then we
> can't protect against the attacker using the access they've gained, but
> we can in most cases detect it and that's what we should be doing.  It
> would be really unfortunate to end up with a solution here that only
> provides confidentiality and doesn't address integrity at all, and I
> don't really think it's *that* hard to do both.  That said, if we must
> work at this in pieces and we can get agreement to handle
> confidentiality initially and then add integrity later, that might be
> reasonable.

See above.

> > > I'm happy to go look over the KMS patches again if that'd be helpful and
> > > to comment on the TDE PoC.  I can also spend some time trying to improve
> > 
> > I think we eventually need a full review of the TDE PoC, combined with
> > the Cybertec patch, and the wiki, to get them all aligned.  However, as
> > I said already, let's get the KMS patch approved, even if we don't apply
> > it now, so we know we are on an approved foundation.
> 
> While the Cybertec patch is interesting, I'd really like to see
> something that's a bit less invasive when it comes to how temporary
> files are handled.  In particular, I think it'd be possible to have an
> API that's very similar to the existing one for serial reading and
> writing of files which wouldn't require nearly as many changes to things
> like reorderbuffer.c.  I also believe there's some things we could do to
> avoid having to modify quite as many places when it comes to LSN
> assignment, so the base patch isn't as big.

Yes, I think we would get the best ideas from all patches.

> > > on each, as I've already done.  A few of the larger concerns that I have
> > > revolve around how to store integrity information (I've tried to find a
> > > way to make room for such information in our existing page layout and,
> > > perhaps unsuprisingly, it's far from trivial to do so in a way that will
> > > avoid breaking the existing page layout, or where the same set of
> > > binaries could work on both unencrypted pages and encrypted pages with
> > > integrity validation information, and that's a problem that we really
> > 
> > As stated above, I think we only need a byte or two for the hint bit
> > counter (used in the IV), as I don't think the GCM verification bytes
> > will add any additional security, and I bet we can find a byte or two. 
> > We do need a separate discussion on this, either here or privately.
> 
> I have to disagree here- the GCM tag adds integrity which is really
> quite important.  Happy to chat about it independently, of course.

Yeah, see above.

> > > should consider trying to solve...), and how to automate key rotation
> > > (one of the nice things about Bruce's approach to storing the keys is
> > > that we're leveraging the filesystem as an index- it's easy to see how
> > > we might extend the key-per-file approach to allow us to, say, have a
> > > different key for every 32GB of LSN, but if we tried to put all of the
> > > keys into a single file then we'd have to figure out an indexing
> > > solution for it which would allow us to find the key we need to decrypt
> > > a given page...).  I tend to agree with Bruce that we need to take
> > 
> > Yeah, yuck on that plan.  I was very happy how the per-version directory
> > worked with scripts that needed to store matching state.
> 
> I don't know that it's going to ultimately be the best answer, as we're
> essentially using the filesystem as an index, as I mentioned above, but,
> yeah, trying to do all of that ourselves during WAL replay doesn't seem
> like it would be fun to try and figure out.  This is an area that I
> would think we'd be able to improve on in the future too- if someone
> wants to spend the time coming up with a single-file format that is
> indexed in some manner and still provides the guarantees that we need,
> we could very likely teach pg_upgrade how to handle that and the data
> set we're talking about here is quite small, even if we've got a bunch
> of key rotation that's happened.

I thought we were going to use failover to a standby as our data key
rotation method.

Here is the full doc part you wanted improved:

  The purpose of cluster file encryption is to prevent users with read
  access to the directories used to store database files and write-ahead
  log files from being able to access the data stored in those files.
  For example, when using cluster file encryption, users who have read
  access to the cluster directories for backup purposes will not be able
  to decrypt the data stored in these files.  It also provides data-at-rest
  security, protecting users from data loss should the physical storage
  media be stolen or improperly erased before disposal.

  File system write access can allow for unauthorized file system data
  decryption if the writes can be used to weaken the system's security
  and this weakened system is later supplied with externally-stored keys.
  This also does not always detect if users with write access remove or
  modify database files.

  This also does not protect from users who have read access to system
  memory — all in-memory data pages and data encryption keys are
  stored unencrypted in memory, so an attacker who is able to read the
  PostgreSQL process's memory can decrypt the entire cluster.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Feb  1, 2021 at 07:47:57PM -0500, Bruce Momjian wrote:
> On Mon, Feb  1, 2021 at 06:31:32PM -0500, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > >   The purpose of cluster file encryption is to prevent users with read
> > >   access to the directories used to store database files and write-ahead
> > >   log files from being able to access the data stored in those files.
> > >   For example, when using cluster file encryption, users who have read
> > >   access to the cluster directories for backup purposes will not be able
> > >   to decrypt the data stored in these files.  It also protects against
> > >   decrypted data access after media theft.
> > 
> > That's one valid use-case and it particularly makes sense to consider,
> > now that we support group read-access to the data cluster.  The last
> 
> Do enough people use group read-access to be useful?

I am thinking group read-access might be a requirement for cluster file
encryption to be effective.

> > line seems a bit unclear- I would update it to say:
> > Cluster file encryption also provides data-at-rest security, protecting
> > users from data loss should the physical media on which the cluster is
> > stored be stolen, improperly deprovisioned (not wiped or destroyed), or
> > otherwise ends up in the hands of an attacker.
> 
> I have split the section into three paragraphs, trimmed down some of the
> suggested text, and added it.  Full version below.

Here is an updated doc description of memory reading:

    This also does not protect against users who have read access to
    database process memory — all in-memory data pages and data
    encryption keys are stored unencrypted in memory, so an attacker who
-->    is able to read memory can decrypt the entire cluster.  The Postgres
-->    operating system user and the operating system administrator, e.g.,
-->    the <literal>root</literal> user, have such access.

> > >   File system write access can allow for unauthorized file system data
> > >   decryption if the writes can be used to weaken the system's security
> > >   and this weakened system is later supplied with externally-stored keys.
> > 
> > This isn't very clear as to exactly what the concern is or how an
> > attacker would be able to thwart the system if they had write access to
> > it.  An attacker with write access could possibly attempt to replace the
> > existing keys, but with the key wrapping that we're using, that should
> > result in just a decryption failure (unless, of course, the attacker has
> > the actual KEK that was used, but that's not terribly interesting to
> > worry about since then they could just go access the files directly).
> 
> Uh, well, they could modify postgresql.conf to change the script to save
> the secret returned by the script before returning it to the PG server. 
> We could require postgresql.conf to be somewhere secure, but then how do
> we know that is secure?  I just don't see a clean solution here, but the
> idea that you write and then wait for the key to show up seems like a
> very valid way of attack, and it took me a while to be able to
> articulate it.

Let's suppose you lock down your cluster --- the non-PGDATA files are
owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA
and are not writable by the database OS user, or we have the PGDATA
directory on another server, so the adversary can only write to the
remote PGDATA directory.

What can they do?  Well, they can't modify pg_proc to add a shared
library since pg_proc is encrypted, so we have to focus on files needed
before encryption starts or files that can't be easily encrypted.  They
could create postgresql.conf.auto in PGDATA, and modify
cluster_key_command to capture the key, or they could modify preload
libraries or archive command to call a command to read memory as the PG
OS user and write the key out somewhere, or use the key to rewrite the
database files --- those wouldn't even need a database restart, just a
reload.

They could also modify pg_xact files so that, even though the heap/index
files are encrypted, how the contents of those files are interpreted
would change.

In summary, to detect malicious user writes, you would need to protect
the files used before encryption starts (root owned or owned by another
user?), and encrypt all files after encryption starts --- any other
approach would probably leave open attack vectors, and I don't think
there is sufficient community desire to add such boundaries.

How do other database systems guarantee to detect malicious writes?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Feb  1, 2021 at 07:47:57PM -0500, Bruce Momjian wrote:
> > On Mon, Feb  1, 2021 at 06:31:32PM -0500, Stephen Frost wrote:
> > > * Bruce Momjian (bruce@momjian.us) wrote:
> > > >   The purpose of cluster file encryption is to prevent users with read
> > > >   access to the directories used to store database files and write-ahead
> > > >   log files from being able to access the data stored in those files.
> > > >   For example, when using cluster file encryption, users who have read
> > > >   access to the cluster directories for backup purposes will not be able
> > > >   to decrypt the data stored in these files.  It also protects against
> > > >   decrypted data access after media theft.
> > >
> > > That's one valid use-case and it particularly makes sense to consider,
> > > now that we support group read-access to the data cluster.  The last
> >
> > Do enough people use group read-access to be useful?
>
> I am thinking group read-access might be a requirement for cluster file
> encryption to be effective.

People certainly do use group read-access, but I don't see that as being
a requirement for cluster file encryption to be effective, it's just one
thing TDE can address, among others, as discussed.

> > > line seems a bit unclear- I would update it to say:
> > > Cluster file encryption also provides data-at-rest security, protecting
> > > users from data loss should the physical media on which the cluster is
> > > stored be stolen, improperly deprovisioned (not wiped or destroyed), or
> > > otherwise ends up in the hands of an attacker.
> >
> > I have split the section into three paragraphs, trimmed down some of the
> > suggested text, and added it.  Full version below.
>
> Here is an updated doc description of memory reading:
>
>     This also does not protect against users who have read access to
>     database process memory — all in-memory data pages and data
>     encryption keys are stored unencrypted in memory, so an attacker who
> -->    is able to read memory can decrypt the entire cluster.  The Postgres
> -->    operating system user and the operating system administrator, e.g.,
> -->    the <literal>root</literal> user, have such access.

That's helpful, +1.

> > > >   File system write access can allow for unauthorized file system data
> > > >   decryption if the writes can be used to weaken the system's security
> > > >   and this weakened system is later supplied with externally-stored keys.
> > >
> > > This isn't very clear as to exactly what the concern is or how an
> > > attacker would be able to thwart the system if they had write access to
> > > it.  An attacker with write access could possibly attempt to replace the
> > > existing keys, but with the key wrapping that we're using, that should
> > > result in just a decryption failure (unless, of course, the attacker has
> > > the actual KEK that was used, but that's not terribly interesting to
> > > worry about since then they could just go access the files directly).
> >
> > Uh, well, they could modify postgresql.conf to change the script to save
> > the secret returned by the script before returning it to the PG server.
> > We could require postgresql.conf to be somewhere secure, but then how do
> > we know that is secure?  I just don't see a clean solution here, but the
> > idea that you write and then wait for the key to show up seems like a
> > very valid way of attack, and it took me a while to be able to
> > articulate it.

postgresql.conf isn't always writable by the postgres user, though
postgresql.auto.conf is likely to always be.  I'm not sure how much of a
concern that is, but it we wanted to take steps to explicitly address
this issue, we could have some kind of 'secure' postgresql.conf file
which we would encourage users to make owned by root and whose values
wouldn't be allowed to be overridden once set.

> Let's suppose you lock down your cluster --- the non-PGDATA files are
> owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA
> and are not writable by the database OS user, or we have the PGDATA
> directory on another server, so the adversary can only write to the
> remote PGDATA directory.
>
> What can they do?  Well, they can't modify pg_proc to add a shared
> library since pg_proc is encrypted, so we have to focus on files needed
> before encryption starts or files that can't be easily encrypted.

This isn't accurate- just because it's encrypted doesn't mean they can't
modify it.  That's exactly why integrity is important, because an
attacker absolutely could modify the files directly and potentially
exploit the system through those modifications.

> They could create postgresql.conf.auto in PGDATA, and modify
> cluster_key_command to capture the key, or they could modify preload
> libraries or archive command to call a command to read memory as the PG
> OS user and write the key out somewhere, or use the key to rewrite the
> database files --- those wouldn't even need a database restart, just a
> reload.

They would need to actually be able to effect that reload though.  This
is where the question comes up as to just what attack vector we're
trying to address.  It's certainly possible that an attacker has only
access to the stored data in an off-line fashion (eg: a hard drive that
was mistakenly thrown away without being properly wiped) and that's one
of the cases which is addressed by cluster encryption.  An attacker
might have access to the LUN that PG is running on but not to the
running server itself, which it seems like is what you're contemplating
here.  That's a much harder attack vector to fully protect against and
we might need to do more than we're currently contemplating to address
it- but I don't think we necessarily must solve for all cases in the
first pass at this.

> They could also modify pg_xact files so that, even though the heap/index
> files are encrypted, how the contents of those files are interpreted
> would change.

Yes, ideally, we'd encrypt/integrity check just about every part of the
running system and that's one area the patch doesn't address- things
like temporary files and other parts.

> In summary, to detect malicious user writes, you would need to protect
> the files used before encryption starts (root owned or owned by another
> user?), and encrypt all files after encryption starts --- any other
> approach would probably leave open attack vectors, and I don't think
> there is sufficient community desire to add such boundaries.

There's going to be some attack vectors that TDE doesn't address.  We
should identify and document those where we're able to.  We could offer
up some mitigations (eg: strongly suggest monitoring of key utilization
such that if the KEK is used without a reboot of the system or similar
happening that it is reported and someone goes to look into it).  While
such mitigations aren't perfect, they can be enough to allow approval of
a system to go operational (ultimately it comes down to what the
relevant security officer is willing to accept).

> How do other database systems guarantee to detect malicious writes?

I doubt anyone would actually stipulate that they *guarantee* detection
of malicious writes, and I don't think we should either, but certainly
the other systems which provide TDE do so in a manner that provides both
confidentiality and integrity.  The big O, at least, documents that they
use SHA-1 for their integrity checking, though they also provide an
option which disables it.  If we used an additional fork to provide the
integrity then we could also give users the option of either having
integrity included or not.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Wed, Feb  3, 2021 at 10:33:57AM -0500, Stephen Frost wrote:
> > I am thinking group read-access might be a requirement for cluster file
> > encryption to be effective.
> 
> People certainly do use group read-access, but I don't see that as being
> a requirement for cluster file encryption to be effective, it's just one
> thing TDE can address, among others, as discussed.

Agreed.

> >     This also does not protect against users who have read access to
> >     database process memory — all in-memory data pages and data
> >     encryption keys are stored unencrypted in memory, so an attacker who
> > -->    is able to read memory can decrypt the entire cluster.  The Postgres
> > -->    operating system user and the operating system administrator, e.g.,
> > -->    the <literal>root</literal> user, have such access.
> 
> That's helpful, +1.

Good.

> > > Uh, well, they could modify postgresql.conf to change the script to save
> > > the secret returned by the script before returning it to the PG server. 
> > > We could require postgresql.conf to be somewhere secure, but then how do
> > > we know that is secure?  I just don't see a clean solution here, but the
> > > idea that you write and then wait for the key to show up seems like a
> > > very valid way of attack, and it took me a while to be able to
> > > articulate it.
> 
> postgresql.conf isn't always writable by the postgres user, though
> postgresql.auto.conf is likely to always be.  I'm not sure how much of a
> concern that is, but it we wanted to take steps to explicitly address
> this issue, we could have some kind of 'secure' postgresql.conf file
> which we would encourage users to make owned by root and whose values
> wouldn't be allowed to be overridden once set.

Well, I think there is a lot more than postgresql.conf to worry about ---
see below.

> > Let's suppose you lock down your cluster --- the non-PGDATA files are
> > owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA
> > and are not writable by the database OS user, or we have the PGDATA
> > directory on another server, so the adversary can only write to the
> > remote PGDATA directory.
> > 
> > What can they do?  Well, they can't modify pg_proc to add a shared
> > library since pg_proc is encrypted, so we have to focus on files needed
> > before encryption starts or files that can't be easily encrypted.
> 
> This isn't accurate- just because it's encrypted doesn't mean they can't
> modify it.  That's exactly why integrity is important, because an
> attacker absolutely could modify the files directly and potentially
> exploit the system through those modifications.

They can't easily modify it to inject a shared object referenced into a
system column, was my point --- also see below.

> > They could create postgresql.conf.auto in PGDATA, and modify
> > cluster_key_command to capture the key, or they could modify preload
> > libraries or archive command to call a command to read memory as the PG
> > OS user and write the key out somewhere, or use the key to rewrite the
> > database files --- those wouldn't even need a database restart, just a
> > reload.
> 
> They would need to actually be able to effect that reload though.  This
> is where the question comes up as to just what attack vector we're
> trying to address.  It's certainly possible that an attacker has only
> access to the stored data in an off-line fashion (eg: a hard drive that
> was mistakenly thrown away without being properly wiped) and that's one
> of the cases which is addressed by cluster encryption.  An attacker
> might have access to the LUN that PG is running on but not to the
> running server itself, which it seems like is what you're contemplating
> here.  That's a much harder attack vector to fully protect against and
> we might need to do more than we're currently contemplating to address
> it- but I don't think we necessarily must solve for all cases in the
> first pass at this.

See below.

> > They could also modify pg_xact files so that, even though the heap/index
> > files are encrypted, how the contents of those files are interpreted
> > would change.
> 
> Yes, ideally, we'd encrypt/integrity check just about every part of the
> running system and that's one area the patch doesn't address- things
> like temporary files and other parts.

It is worse than that --- see below.

> > In summary, to detect malicious user writes, you would need to protect
> > the files used before encryption starts (root owned or owned by another
> > user?), and encrypt all files after encryption starts --- any other
> > approach would probably leave open attack vectors, and I don't think
> > there is sufficient community desire to add such boundaries.
> 
> There's going to be some attack vectors that TDE doesn't address.  We
> should identify and document those where we're able to.  We could offer
> up some mitigations (eg: strongly suggest monitoring of key utilization
> such that if the KEK is used without a reboot of the system or similar
> happening that it is reported and someone goes to look into it).  While
> such mitigations aren't perfect, they can be enough to allow approval of
> a system to go operational (ultimately it comes down to what the
> relevant security officer is willing to accept).

I ended up adding to the feature description in the docs to clearly
outline what this feature provides, and what it does not:

    The purpose of cluster file encryption is to prevent users with read
    access on the directories used to store database files and write-ahead
    log files from being able to access the data stored in those files.
    For example, when using cluster file encryption, users who have read
    access to the cluster directories for backup purposes will not be able
    to decrypt the data stored in these files.  Read-only access for a group
    of users can be enabled using the <application>initdb</application>
    <option>--allow-group-access</option> option.  Cluster file encryption
    also provides data-at-rest security, protecting users from data loss
    should the physical storage media be stolen or improperly erased before
    disposal.
    
    Cluster file encryption does not protect against unauthorized file
    system writes.  Such writes can allow data decryption if used to weaken
    the system's security and the weakened system is later supplied with
    the externally-stored cluster encryption key.  This also does not always
    detect if users with write access remove or modify database files.
    
    This also does not protect against users who have read access to database
    process memory because all in-memory data pages and data encryption keys
    are stored unencrypted in memory.  Therefore, an attacker who is able
    to read memory can read the data encryption keys and decrypt the entire
    cluster.  The Postgres operating system user and the operating system
    administrator, e.g., the <literal>root</literal> user, have such access.

> > How do other database systems guarantee to detect malicious writes?
> 
> I doubt anyone would actually stipulate that they *guarantee* detection
> of malicious writes, and I don't think we should either, but certainly
> the other systems which provide TDE do so in a manner that provides both
> confidentiality and integrity.  The big O, at least, documents that they
> use SHA-1 for their integrity checking, though they also provide an
> option which disables it.  If we used an additional fork to provide the
> integrity then we could also give users the option of either having
> integrity included or not.

I thought more about this at an abstract level.  If you are worried
about malicious users _reading_ data, you can encrypt the sensitive
parts, e.g., heap/index/WAL/temp, and leave some unencrypted, like
pg_xact.  Reading pg_xact is pretty useless if you can't read the heap
pages.  Reading postgresql.conf.auto, the external key retrieval
scripts, etc. are useless too.

However, when you are trying to protect against write access, you have
to really encrypt _everything_, because the system is very
interdependent, and changing one part where _reading_ is safe can affect
other parts that must remain secure.  You can modify
postgresql.conf.auto to capture the cluster key, or maybe even change
something to dump out the data keys from memory.  You can modify pg_xact
to affect how heap pages are interpreted.

My point is that being able to detect malicious heap/index writes really
doesn't gain us any security since there are much more serious writes
that can be made, and protecting against those more serious writes would
cause unacceptable Postgres source code changes which will probably
never be implemented.

My summary point is that we should clearly spell out exactly what
protections we are offering, and an estimate of the code impact, before
moving forward so the community can agree it is worthwhile to add this.

Also, looking at the PCI DSS 3.2.1 spec from May 2018 (click-through
required):

    https://www.pcisecuritystandards.org/document_library?category=pcidss&document=pci_dss#agreement

or open PDF link here:

    https://commerce.uwo.ca/pdf/PCI_DSS_v3-2-1.pdf

Page 41 covers what they expect from an encrypted file system, and from
key encryption key and data encryption keys.  There is a v4.0 spec in
draft but I can't find a PDF available online.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Wed, Feb  3, 2021 at 01:16:32PM -0500, Bruce Momjian wrote:
> On Wed, Feb  3, 2021 at 10:33:57AM -0500, Stephen Frost wrote:
> > I doubt anyone would actually stipulate that they *guarantee* detection
> > of malicious writes, and I don't think we should either, but certainly
> > the other systems which provide TDE do so in a manner that provides both
> > confidentiality and integrity.  The big O, at least, documents that they
> > use SHA-1 for their integrity checking, though they also provide an
> > option which disables it.  If we used an additional fork to provide the
> > integrity then we could also give users the option of either having
> > integrity included or not.
> 
> I thought more about this at an abstract level.  If you are worried
> about malicious users _reading_ data, you can encrypt the sensitive
> parts, e.g., heap/index/WAL/temp, and leave some unencrypted, like
> pg_xact.  Reading pg_xact is pretty useless if you can't read the heap
> pages.  Reading postgresql.conf.auto, the external key retrieval
> scripts, etc. are useless too.
> 
> However, when you are trying to protect against write access, you have
> to really encrypt _everything_, because the system is very
> interdependent, and changing one part where _reading_ is safe can affect
> other parts that must remain secure.  You can modify
> postgresql.conf.auto to capture the cluster key, or maybe even change
> something to dump out the data keys from memory.  You can modify pg_xact
> to affect how heap pages are interpreted.
> 
> My point is that being able to detect malicious heap/index writes really
> doesn't gain us any security since there are much more serious writes
> that can be made, and protecting against those more serious writes would
> cause unacceptable Postgres source code changes which will probably
> never be implemented.

I looked further.  First, I don't think we are going to be able to
protect at all against users who have _write_ access on the OS running
Postgres.  It would be too easy to just read process memory, or modify
~/.profile.

I think the only possible option would be to try to give some protection
against users with write access to PGDATA, where PGDATA is on another
server, e.g., via NFS.  We can't protect against all db modifications,
for reasons outlined above, but we might be able to protect against
write users being able to _read_ the keys and therefore decrypt data. 
Looking at PGDATA, we have, at least:

    postgresql.conf
    pg_hba.conf
    postmaster.opts
    postgresql.conf.auto

which could be exploited to cause reading of the cluster key or process
memory.  The first two can be located outside of PGDATA but the last two
currently cannot.

The problem is that this is a limited use-case, and there are probably
other problems I am not considering.  It seems too error-prone to even
try protect against this, but it does limit the value of this feature.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Wed, Feb  3, 2021 at 01:16:32PM -0500, Bruce Momjian wrote:
> > On Wed, Feb  3, 2021 at 10:33:57AM -0500, Stephen Frost wrote:
> > > I doubt anyone would actually stipulate that they *guarantee* detection
> > > of malicious writes, and I don't think we should either, but certainly
> > > the other systems which provide TDE do so in a manner that provides both
> > > confidentiality and integrity.  The big O, at least, documents that they
> > > use SHA-1 for their integrity checking, though they also provide an
> > > option which disables it.  If we used an additional fork to provide the
> > > integrity then we could also give users the option of either having
> > > integrity included or not.
> >
> > I thought more about this at an abstract level.  If you are worried
> > about malicious users _reading_ data, you can encrypt the sensitive
> > parts, e.g., heap/index/WAL/temp, and leave some unencrypted, like
> > pg_xact.  Reading pg_xact is pretty useless if you can't read the heap
> > pages.  Reading postgresql.conf.auto, the external key retrieval
> > scripts, etc. are useless too.
> >
> > However, when you are trying to protect against write access, you have
> > to really encrypt _everything_, because the system is very
> > interdependent, and changing one part where _reading_ is safe can affect
> > other parts that must remain secure.  You can modify
> > postgresql.conf.auto to capture the cluster key, or maybe even change
> > something to dump out the data keys from memory.  You can modify pg_xact
> > to affect how heap pages are interpreted.
> >
> > My point is that being able to detect malicious heap/index writes really
> > doesn't gain us any security since there are much more serious writes
> > that can be made, and protecting against those more serious writes would
> > cause unacceptable Postgres source code changes which will probably
> > never be implemented.
>
> I looked further.  First, I don't think we are going to be able to
> protect at all against users who have _write_ access on the OS running
> Postgres.  It would be too easy to just read process memory, or modify
> ~/.profile.

I don't think anyone is really expecting that we'll be able to come up
with a way to protect against attackers who have fully compromised the
OS to the point where they can read/write OS memory, or even the PG unix
account.  I'm certainly not suggesting that there is a way to do that or
that it's an attack vector we are trying to address here.

> I think the only possible option would be to try to give some protection
> against users with write access to PGDATA, where PGDATA is on another
> server, e.g., via NFS.  We can't protect against all db modifications,
> for reasons outlined above, but we might be able to protect against
> write users being able to _read_ the keys and therefore decrypt data.

That certainly seems like a worthy goal.  I also really want to stress
that I don't think anyone is expecting us to be able to "protect"
against users who have write access to the system- write access to files
is really an OS level issue and there's not much we can do once someone
has found a way to circumvent that (we can try to help the OS by doing
things like using SELinux, of course, but that's a different
discussion).  At the point that an attacker has gotten write access, the
best we can do is complain loudly if we detect unexpected modifications.
Ideally, we would be able to do that for everything, but certainly doing
it for the principal data would go a long way and is far better than
nothing.

Now, that said, I don't know that we absolutely must have that in the
first release of TDE support for PG.  In thinking about this, I would
say we have two basic options:

- Keep the same page layout, requiring that integrity data must be
  stored elsewhere, eg: another fork
- Use a different page layout when TDE is enabled, making room for
  integrity information to be included on each page

There's a set of pros and cons for these:

Same page layout pros:

- Simpler and less impactful on the overall system
- With integrity data stored elsewhere, could possibly be something
  that's optional to enable/disable on a per-table basis
- Potential to do things like have an unencrypted primary and an
  encrypted replica, providing an easier migration path

Same page layout cons:

- Integrity information must be stored elsewhere
- Increases the reads/memory that is needed, since we have to look up
  the integrity information on every read.
- Increases the writes that have to be done since we'd be dirtying
  multiple pages instead of just the main fork (though this isn't
  exactly unusual- there's the vis map, and indexes, etc, but it'd be
  yet another thing we're updating)

Different page layout pros:

- Avoids extra reads/writes for the integrity information
- Once done, this might provide us with a way to add other page level
  information in the future while still being able to work with older
  page formats

Different page layout cons:

- Wouldn't be able to have an encrypted replica follow an unencrypted
  primary, migration would require logical replication or similar
- More core code changes, and extensions, to handle a different page
  layout when cluster is initialized with TDE+integrity

While I've been thinking about this, I have to admit that either
approach could be done later and it's probably best to accept that and
push it off until we have the initial TDE work done.  I had been
thinking that changing the page layout would be better to do in the same
release as TDE, but having been playing around with that approach for a
while it just seems like it's too much to try and include at the same
time.  We should be sure to be clear and document that though.

> Looking at PGDATA, we have, at least:
>
>     postgresql.conf
>     pg_hba.conf
>     postmaster.opts
>     postgresql.conf.auto
>
> which could be exploited to cause reading of the cluster key or process
> memory.  The first two can be located outside of PGDATA but the last two
> currently cannot.

There are certainly already users out there who intentionally make
postgresql.auto.conf owned by root/root, zero-sized, and monitor it to
make sure that it isn't updated.  postgresql.conf actually is also often
monitored for changes by a change management system of some kind and may
also be owned by root/root already.  I suspect that postmaster.opts is
not monitored as closely, but that's probably due more to the fact that
we don't really document it as a configuration system file and it can't
be put outside of PGDATA.  Having a way to move it outside of PGDATA or
just not have it be used at all (do we really need it..?) would be
another way to address that risk though.

> The problem is that this is a limited use-case, and there are probably
> other problems I am not considering.  It seems too error-prone to even
> try protect against this, but it does limit the value of this feature.

I don't think we need to consider it a failing of the capability every
time we think of something else that really should be addressed when
considering this attack vector.  We aren't going to be releasing this
and saying "we guarantee that this protects against an attacker who has
write access to PGDATA".  Instead, we would be documenting "XYZ, when
enabled, is used to validate the integrity of ABC data.  Individuals
concerned with unexpected modifications to their system should consider
independently monitoring files D, E, F.  Note that there is currently no
explicit protection against or detection of unexpected or malicious
modification of other parts of the system such as the transaction
record.", or something along those lines.  Hardening guidelines would
also recommend things like having postgresql.conf moved out of PGDATA
and owned by root/root, etc.  Users would then have the ability to
evaluate if what we're providing is sufficient for their requirements
or not, and to then provide us with feedback about what they feel is
still missing before they would be able to use PG for their use-case.

To that end, I would hope that we'd eventually develop a way to detect
unexpected modifications in other parts of the system, both as a way to
discover filesystem corruption earlier but also in the case of a
malicious attacker.  The latter would involve more work, of course, but
it doesn't seem insurmountable.  I don't think it's necessary to get
into that today though.

I am concerned when statements are made that we are just never going to
do something-or-other because we think it'd be a lot of source code
changes or won't be completely perfect against every attack we can think
of.  There was a good bit of that with RLS which also made it a
particularly difficult feature to push forward, but, thanks to clearly
documenting what was and wasn't addressed, clearly admitting that there
are covert channel attacks that might be possible due to how it works,
it's been pretty well accepted and there hasn't been some huge number of
issues or CVEs that have been associated with it or mismatched
expectations that users of it have had regarding what it does and
doesn't protect against.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Feb  5, 2021 at 01:14:35PM -0500, Stephen Frost wrote:
> > I looked further.  First, I don't think we are going to be able to
> > protect at all against users who have _write_ access on the OS running
> > Postgres.  It would be too easy to just read process memory, or modify
> > ~/.profile.
> 
> I don't think anyone is really expecting that we'll be able to come up
> with a way to protect against attackers who have fully compromised the
> OS to the point where they can read/write OS memory, or even the PG unix
> account.  I'm certainly not suggesting that there is a way to do that or
> that it's an attack vector we are trying to address here.

OK, that's good.

> > I think the only possible option would be to try to give some protection
> > against users with write access to PGDATA, where PGDATA is on another
> > server, e.g., via NFS.  We can't protect against all db modifications,
> > for reasons outlined above, but we might be able to protect against
> > write users being able to _read_ the keys and therefore decrypt data. 
> 
> That certainly seems like a worthy goal.  I also really want to stress
> that I don't think anyone is expecting us to be able to "protect"
> against users who have write access to the system- write access to files
> is really an OS level issue and there's not much we can do once someone
> has found a way to circumvent that (we can try to help the OS by doing
> things like using SELinux, of course, but that's a different
> discussion).  At the point that an attacker has gotten write access, the

Agreed.

> best we can do is complain loudly if we detect unexpected modifications.
> Ideally, we would be able to do that for everything, but certainly doing
> it for the principal data would go a long way and is far better than
> nothing.

I disagree.  If we only warn about some parts, attackers will just
attack other parts.  It will also give users a false sense of security. 
If you can get the keys, it doesn't matter if there is one or ten ways
of getting them, if they are all of equal difficulty.  Same with
modifying the system files.

> Now, that said, I don't know that we absolutely must have that in the
> first release of TDE support for PG.  In thinking about this, I would
> say we have two basic options:

I skipped this part since I think we need a fully secure plan before
considering page format changes.  We don't need it for our currently
outlined feature-set.

> > Looking at PGDATA, we have, at least:
> > 
> >     postgresql.conf
> >     pg_hba.conf
> >     postmaster.opts
> >     postgresql.conf.auto
> > 
> > which could be exploited to cause reading of the cluster key or process
> > memory.  The first two can be located outside of PGDATA but the last two
> > currently cannot.
> 
> There are certainly already users out there who intentionally make
> postgresql.auto.conf owned by root/root, zero-sized, and monitor it to
> make sure that it isn't updated.  postgresql.conf actually is also often
> monitored for changes by a change management system of some kind and may
> also be owned by root/root already.  I suspect that postmaster.opts is
> not monitored as closely, but that's probably due more to the fact that
> we don't really document it as a configuration system file and it can't
> be put outside of PGDATA.  Having a way to move it outside of PGDATA or
> just not have it be used at all (do we really need it..?) would be
> another way to address that risk though.

I think postmaster.opts is used for pg_ctl reload.  I think the question
is whether the value of maliciously writable PGDATA being able to read
the keys, while not protecting or detecting all malicious
writes/db-modifications, is worth it.  And, while I listed the files
above, there are probably many more ways to break the system.

> > The problem is that this is a limited use-case, and there are probably
> > other problems I am not considering.  It seems too error-prone to even
> > try protect against this, but it does limit the value of this feature.
> 
> I don't think we need to consider it a failing of the capability every
> time we think of something else that really should be addressed when
> considering this attack vector.  We aren't going to be releasing this
> and saying "we guarantee that this protects against an attacker who has
> write access to PGDATA".  Instead, we would be documenting "XYZ, when
> enabled, is used to validate the integrity of ABC data.  Individuals
> concerned with unexpected modifications to their system should consider
> independently monitoring files D, E, F.  Note that there is currently no
> explicit protection against or detection of unexpected or malicious
> modification of other parts of the system such as the transaction
> record.", or something along those lines.  Hardening guidelines would
> also recommend things like having postgresql.conf moved out of PGDATA
> and owned by root/root, etc.  Users would then have the ability to
> evaluate if what we're providing is sufficient for their requirements
> or not, and to then provide us with feedback about what they feel is
> still missing before they would be able to use PG for their use-case.

See above --- I think we can't just say we close _most_ of the doors
here, and I am afraid there will be more and more cases we miss.  It
feels too open-ended.  For example, imagine modifying a PGDATA file so
it is a symbolic link to another file that is not in PGDATA?  Seems that
would break all sorts of security restrictions, and that's just a new
idea I came up with today.

What I don't want to do is to add a lot of complexity to the system, and
not really gain any meaningful security.

> To that end, I would hope that we'd eventually develop a way to detect
> unexpected modifications in other parts of the system, both as a way to
> discover filesystem corruption earlier but also in the case of a
> malicious attacker.  The latter would involve more work, of course, but
> it doesn't seem insurmountable.  I don't think it's necessary to get
> into that today though.
> 
> I am concerned when statements are made that we are just never going to
> do something-or-other because we think it'd be a lot of source code
> changes or won't be completely perfect against every attack we can think
> of.  There was a good bit of that with RLS which also made it a
> particularly difficult feature to push forward, but, thanks to clearly
> documenting what was and wasn't addressed, clearly admitting that there
> are covert channel attacks that might be possible due to how it works,
> it's been pretty well accepted and there hasn't been some huge number of
> issues or CVEs that have been associated with it or mismatched
> expectations that users of it have had regarding what it does and
> doesn't protect against.

Oh, that is a very meaningful lesson.  I do think that for cluster file
encryption, if we have a vulnerability, someone will write a script for
it, and it could be widely exploited.  I think RLS gets a little more
flexibility since someone is already in the database when using it.

I am not against adding more security features, but I need agreement
that the existing features/protections, with the planned source code
impact, is acceptable.  I don't want to go down the road of getting the
feature with the _hope_ that later changes will make the feature
acceptable --- for me, either what we are planning now is acceptable
given its code impact, or it is not.  If the feature is not sufficient,
then I would not move forward until we had a reasonable plan of when the
feature would have acceptable usefulness, and acceptable source code
impact.

The big problem, as you outlined above, is that adding to the
protections, like malicious write detection for a remote PGDATA, greatly
increases the code impact, and ultimately, might be unsolvable.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Feb  5, 2021 at 01:14:35PM -0500, Stephen Frost wrote:
> > > I looked further.  First, I don't think we are going to be able to
> > > protect at all against users who have _write_ access on the OS running
> > > Postgres.  It would be too easy to just read process memory, or modify
> > > ~/.profile.
> >
> > I don't think anyone is really expecting that we'll be able to come up
> > with a way to protect against attackers who have fully compromised the
> > OS to the point where they can read/write OS memory, or even the PG unix
> > account.  I'm certainly not suggesting that there is a way to do that or
> > that it's an attack vector we are trying to address here.
>
> OK, that's good.
>
> > > I think the only possible option would be to try to give some protection
> > > against users with write access to PGDATA, where PGDATA is on another
> > > server, e.g., via NFS.  We can't protect against all db modifications,
> > > for reasons outlined above, but we might be able to protect against
> > > write users being able to _read_ the keys and therefore decrypt data.
> >
> > That certainly seems like a worthy goal.  I also really want to stress
> > that I don't think anyone is expecting us to be able to "protect"
> > against users who have write access to the system- write access to files
> > is really an OS level issue and there's not much we can do once someone
> > has found a way to circumvent that (we can try to help the OS by doing
> > things like using SELinux, of course, but that's a different
> > discussion).  At the point that an attacker has gotten write access, the
>
> Agreed.
>
> > best we can do is complain loudly if we detect unexpected modifications.
> > Ideally, we would be able to do that for everything, but certainly doing
> > it for the principal data would go a long way and is far better than
> > nothing.
>
> I disagree.  If we only warn about some parts, attackers will just
> attack other parts.  It will also give users a false sense of security.
> If you can get the keys, it doesn't matter if there is one or ten ways
> of getting them, if they are all of equal difficulty.  Same with
> modifying the system files.

I agree that there's an additional concern around the keys and that we
would want to have a solid way to avoid having them be compromised.  We
might not be able to guarantee that attackers who can write to PGDATA
can't gain access to the keys in the first implementation, but I don't
see that as a problem- the TDE capability would still provide protection
against improper disposal and some other use-cases, which is useful.  I
do think it'd be useful to consider how we could provide protection
against an attacker who has write access from being able to acquire the
keys, but that seems like a tractable problem.  Following that, we could
look at how to provide integrity checking for principal data, using one
of the outlined approaches or maybe something else entirely.  Lastly,
perhaps we can find a way to provide confidentiality and integrity for
other parts of the system.

Each of these steps is a useful improvement in its own right and will
open up more opportunities for PG to be used.  It wasn't my intent to
suggest otherwise, but rather to see if there was an opportunity to get
a few things done at once if it wasn't too impactful.  I agree now that
it makes sense to focus on the first step, so we can hopefully get that
accomplished.

> > There are certainly already users out there who intentionally make
> > postgresql.auto.conf owned by root/root, zero-sized, and monitor it to
> > make sure that it isn't updated.  postgresql.conf actually is also often
> > monitored for changes by a change management system of some kind and may
> > also be owned by root/root already.  I suspect that postmaster.opts is
> > not monitored as closely, but that's probably due more to the fact that
> > we don't really document it as a configuration system file and it can't
> > be put outside of PGDATA.  Having a way to move it outside of PGDATA or
> > just not have it be used at all (do we really need it..?) would be
> > another way to address that risk though.
>
> I think postmaster.opts is used for pg_ctl reload.  I think the question
> is whether the value of maliciously writable PGDATA being able to read
> the keys, while not protecting or detecting all malicious
> writes/db-modifications, is worth it.  And, while I listed the files
> above, there are probably many more ways to break the system.

postmaster.opts is used for pg_ctl restart, just to be clear.

As I try to state above- I don't think we need to provide any specific
protections against a malicious writer for plain encryption to be
useful for some important use-cases.  Providing protections against a
malicious writer being able to access the keys is certainly important
as, if they acquire the keys, they would be able to trivially both
decrypt the data and modify any other data they wished to, so it seems
likely that solving that would be the first step towards protecting
against a malicious writer, after which it's useful to think about what
else we could provide integrity checking of, and principal data strikes
me as the next sensible step, followed by what's essentially metadata.

> > > The problem is that this is a limited use-case, and there are probably
> > > other problems I am not considering.  It seems too error-prone to even
> > > try protect against this, but it does limit the value of this feature.
> >
> > I don't think we need to consider it a failing of the capability every
> > time we think of something else that really should be addressed when
> > considering this attack vector.  We aren't going to be releasing this
> > and saying "we guarantee that this protects against an attacker who has
> > write access to PGDATA".  Instead, we would be documenting "XYZ, when
> > enabled, is used to validate the integrity of ABC data.  Individuals
> > concerned with unexpected modifications to their system should consider
> > independently monitoring files D, E, F.  Note that there is currently no
> > explicit protection against or detection of unexpected or malicious
> > modification of other parts of the system such as the transaction
> > record.", or something along those lines.  Hardening guidelines would
> > also recommend things like having postgresql.conf moved out of PGDATA
> > and owned by root/root, etc.  Users would then have the ability to
> > evaluate if what we're providing is sufficient for their requirements
> > or not, and to then provide us with feedback about what they feel is
> > still missing before they would be able to use PG for their use-case.
>
> See above --- I think we can't just say we close _most_ of the doors
> here, and I am afraid there will be more and more cases we miss.  It
> feels too open-ended.  For example, imagine modifying a PGDATA file so
> it is a symbolic link to another file that is not in PGDATA?  Seems that
> would break all sorts of security restrictions, and that's just a new
> idea I came up with today.

It's not clear how that would provide the attacker with much, if
anything.

> What I don't want to do is to add a lot of complexity to the system, and
> not really gain any meaningful security.

Integrity is very meaningful to security, but key management would
certainly come first because if an attacker is able to acquire the keys
then they can circumvent any integrity check being done by simply using
the key.  I appreciate that protecting the keys is non-trivial but it's
absolutely critical as everything else falls apart if the key is
compromised.  I don't think we should be thinking that we're going to be
done with key management or with providing ways to acquire keys even if
the currently proposed patches go in- we'll undoubtably need to provide
other options in the future.  There's an interesting point in this
regarding how the flexibility of the shell-script based approach also
introduces this risk that an attacker could modify it and write the key
out to somewhere that they could get at pretty easily.  Having support
for directly fetching the key from the Linux kernel or the various
vaulting systems would avoid this risk, I would think.  Maybe there's a
way to get PG to dump the key out of system memory by modifying other
files in PGDATA but that's surely quite a bit more difficult.
Ultimately, I don't think this voids the proposed approach but I do
think it means we'll want to improve on this in the future.

> > To that end, I would hope that we'd eventually develop a way to detect
> > unexpected modifications in other parts of the system, both as a way to
> > discover filesystem corruption earlier but also in the case of a
> > malicious attacker.  The latter would involve more work, of course, but
> > it doesn't seem insurmountable.  I don't think it's necessary to get
> > into that today though.
> >
> > I am concerned when statements are made that we are just never going to
> > do something-or-other because we think it'd be a lot of source code
> > changes or won't be completely perfect against every attack we can think
> > of.  There was a good bit of that with RLS which also made it a
> > particularly difficult feature to push forward, but, thanks to clearly
> > documenting what was and wasn't addressed, clearly admitting that there
> > are covert channel attacks that might be possible due to how it works,
> > it's been pretty well accepted and there hasn't been some huge number of
> > issues or CVEs that have been associated with it or mismatched
> > expectations that users of it have had regarding what it does and
> > doesn't protect against.
>
> Oh, that is a very meaningful lesson.  I do think that for cluster file
> encryption, if we have a vulnerability, someone will write a script for
> it, and it could be widely exploited.  I think RLS gets a little more
> flexibility since someone is already in the database when using it.

In the current attack we're contemplating, the attacker's got write
access to the filesystem and if that's happening then they've managed to
get through a few layers already, I would think, so it seems unlikely
that it would be widely exploited.  Of course, we'd like to avoid having
vulnerabilities where we can, but a particular behavior is only a
vulnerabiliy if there's an expectation that we protect against that kind
of attack, which is why documentation is extremely important, which is
what I was trying to get at with the RLS example.

> I am not against adding more security features, but I need agreement
> that the existing features/protections, with the planned source code
> impact, is acceptable.  I don't want to go down the road of getting the
> feature with the _hope_ that later changes will make the feature
> acceptable --- for me, either what we are planning now is acceptable
> given its code impact, or it is not.  If the feature is not sufficient,
> then I would not move forward until we had a reasonable plan of when the
> feature would have acceptable usefulness, and acceptable source code
> impact.

See above.  I do think that the proposed approach is a valuable
capability and improvement in its own right.  It seems likely that this
first step, as proposed, would allow us to support use-cases such as the
PCI one you mentioned previously.  Taking it further and adding
integrity validation would move us into even more use-cases as it would
address NIST requirements which explicitly call for confidentiality and
integrity.

> The big problem, as you outlined above, is that adding to the
> protections, like malicious write detection for a remote PGDATA, greatly
> increases the code impact, and ultimately, might be unsolvable.

I don't think we really know that it increases the code impact hugely or
is unsolveable, but ultimately those are really debates for another day
at this point.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Feb  5, 2021 at 05:21:22PM -0500, Stephen Frost wrote:
> > I disagree.  If we only warn about some parts, attackers will just
> > attack other parts.  It will also give users a false sense of security. 
> > If you can get the keys, it doesn't matter if there is one or ten ways
> > of getting them, if they are all of equal difficulty.  Same with
> > modifying the system files.
> 
> I agree that there's an additional concern around the keys and that we
> would want to have a solid way to avoid having them be compromised.  We
> might not be able to guarantee that attackers who can write to PGDATA
> can't gain access to the keys in the first implementation, but I don't
> see that as a problem- the TDE capability would still provide protection
> against improper disposal and some other use-cases, which is useful.  I

Agreed.

> do think it'd be useful to consider how we could provide protection
> against an attacker who has write access from being able to acquire the
> keys, but that seems like a tractable problem.  Following that, we could
> look at how to provide integrity checking for principal data, using one
> of the outlined approaches or maybe something else entirely.  Lastly,
> perhaps we can find a way to provide confidentiality and integrity for
> other parts of the system.

Yes, we should consider it, and I want to have this discussion.  Ideally
we could implement that now, because it might be harder later.  However,
I don't see how we can add additional security protections without
adding a lot more complexity.  You are right we might have better ideas
later.

> Each of these steps is a useful improvement in its own right and will
> open up more opportunities for PG to be used.  It wasn't my intent to
> suggest otherwise, but rather to see if there was an opportunity to get
> a few things done at once if it wasn't too impactful.  I agree now that
> it makes sense to focus on the first step, so we can hopefully get that
> accomplished.

OK, good.

> > I think postmaster.opts is used for pg_ctl reload.  I think the question
> > is whether the value of maliciously writable PGDATA being able to read
> > the keys, while not protecting or detecting all malicious
> > writes/db-modifications, is worth it.  And, while I listed the files
> > above, there are probably many more ways to break the system.
> 
> postmaster.opts is used for pg_ctl restart, just to be clear.

Yes, sorry, "restart".

> As I try to state above- I don't think we need to provide any specific
> protections against a malicious writer for plain encryption to be
> useful for some important use-cases.  Providing protections against a
> malicious writer being able to access the keys is certainly important
> as, if they acquire the keys, they would be able to trivially both
> decrypt the data and modify any other data they wished to, so it seems
> likely that solving that would be the first step towards protecting
> against a malicious writer, after which it's useful to think about what
> else we could provide integrity checking of, and principal data strikes
> me as the next sensible step, followed by what's essentially metadata.

Agreed.

> > See above --- I think we can't just say we close _most_ of the doors
> > here, and I am afraid there will be more and more cases we miss.  It
> > feels too open-ended.  For example, imagine modifying a PGDATA file so
> > it is a symbolic link to another file that is not in PGDATA?  Seems that
> > would break all sorts of security restrictions, and that's just a new
> > idea I came up with today.
> 
> It's not clear how that would provide the attacker with much, if
> anything.

Not sure myself either.

> > What I don't want to do is to add a lot of complexity to the system, and
> > not really gain any meaningful security.
> 
> Integrity is very meaningful to security, but key management would
> certainly come first because if an attacker is able to acquire the keys
> then they can circumvent any integrity check being done by simply using
> the key.  I appreciate that protecting the keys is non-trivial but it's
> absolutely critical as everything else falls apart if the key is
> compromised.  I don't think we should be thinking that we're going to be

Agreed,

> done with key management or with providing ways to acquire keys even if
> the currently proposed patches go in- we'll undoubtably need to provide
> other options in the future.  There's an interesting point in this
> regarding how the flexibility of the shell-script based approach also
> introduces this risk that an attacker could modify it and write the key
> out to somewhere that they could get at pretty easily.  Having support
> for directly fetching the key from the Linux kernel or the various
> vaulting systems would avoid this risk, I would think.  Maybe there's a

Agreed.

> way to get PG to dump the key out of system memory by modifying other
> files in PGDATA but that's surely quite a bit more difficult.
> Ultimately, I don't think this voids the proposed approach but I do
> think it means we'll want to improve on this in the future.

OK.  I was just saying we can't be sure we can improve it.

> > Oh, that is a very meaningful lesson.  I do think that for cluster file
> > encryption, if we have a vulnerability, someone will write a script for
> > it, and it could be widely exploited.  I think RLS gets a little more
> > flexibility since someone is already in the database when using it.
> 
> In the current attack we're contemplating, the attacker's got write
> access to the filesystem and if that's happening then they've managed to
> get through a few layers already, I would think, so it seems unlikely
> that it would be widely exploited.  Of course, we'd like to avoid having

Agreed.

> vulnerabilities where we can, but a particular behavior is only a
> vulnerabiliy if there's an expectation that we protect against that kind
> of attack, which is why documentation is extremely important, which is
> what I was trying to get at with the RLS example.

True.

> > I am not against adding more security features, but I need agreement
> > that the existing features/protections, with the planned source code
> > impact, is acceptable.  I don't want to go down the road of getting the
> > feature with the _hope_ that later changes will make the feature
> > acceptable --- for me, either what we are planning now is acceptable
> > given its code impact, or it is not.  If the feature is not sufficient,
> > then I would not move forward until we had a reasonable plan of when the
> > feature would have acceptable usefulness, and acceptable source code
> > impact.
> 
> See above.  I do think that the proposed approach is a valuable
> capability and improvement in its own right.  It seems likely that this
> first step, as proposed, would allow us to support use-cases such as the
> PCI one you mentioned previously.  Taking it further and adding
> integrity validation would move us into even more use-cases as it would
> address NIST requirements which explicitly call for confidentiality and
> integrity.

Good.  I wanted to express this so everyone is clear on what we are
doing, and what we are not doing but might be able to do in the future.


> > The big problem, as you outlined above, is that adding to the
> > protections, like malicious write detection for a remote PGDATA, greatly
> > increases the code impact, and ultimately, might be unsolvable.
> 
> I don't think we really know that it increases the code impact hugely or
> is unsolveable, but ultimately those are really debates for another day
> at this point.

True.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Bruce Momjian
Date:
On Fri, Feb  5, 2021 at 07:53:18PM -0500, Bruce Momjian wrote:
> On Fri, Feb  5, 2021 at 05:21:22PM -0500, Stephen Frost wrote:
> > > I disagree.  If we only warn about some parts, attackers will just
> > > attack other parts.  It will also give users a false sense of security. 
> > > If you can get the keys, it doesn't matter if there is one or ten ways
> > > of getting them, if they are all of equal difficulty.  Same with
> > > modifying the system files.
> > 
> > I agree that there's an additional concern around the keys and that we
> > would want to have a solid way to avoid having them be compromised.  We
> > might not be able to guarantee that attackers who can write to PGDATA
> > can't gain access to the keys in the first implementation, but I don't
> > see that as a problem- the TDE capability would still provide protection
> > against improper disposal and some other use-cases, which is useful.  I
> 
> Agreed.
> 
> > do think it'd be useful to consider how we could provide protection
> > against an attacker who has write access from being able to acquire the
> > keys, but that seems like a tractable problem.  Following that, we could
> > look at how to provide integrity checking for principal data, using one
> > of the outlined approaches or maybe something else entirely.  Lastly,
> > perhaps we can find a way to provide confidentiality and integrity for
> > other parts of the system.
> 
> Yes, we should consider it, and I want to have this discussion.  Ideally
> we could implement that now, because it might be harder later.  However,
> I don't see how we can add additional security protections without
> adding a lot more complexity.  You are right we might have better ideas
> later.

I added a Limitations section so we can consider future improvements:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Limitations

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote:
> > I have made significant progress on the cluster file encryption feature so
> > it is time for me to post a new set of patches.
>
> Here is a rebase, to keep the cfbot green.

Good stuff.

> >From 110358c9ce8764f0c41c12dd37dabde57a92cf1f Mon Sep 17 00:00:00 2001
> From: Bruce Momjian <bruce@momjian.us>
> Date: Mon, 15 Mar 2021 10:20:32 -0400
> Subject: [PATCH] cfe-11-persistent_over_cfe-10-hint squash commit
>
> ---
>  src/backend/access/gist/gistutil.c       |  2 +-
>  src/backend/access/heap/heapam_handler.c |  2 +-
>  src/backend/catalog/pg_publication.c     |  2 +-
>  src/backend/commands/tablecmds.c         | 10 +++++-----
>  src/backend/optimizer/util/plancat.c     |  3 +--
>  src/backend/utils/cache/relcache.c       |  2 +-
>  src/include/utils/rel.h                  | 10 ++++++++--
>  src/include/utils/snapmgr.h              |  3 +--
>  8 files changed, 19 insertions(+), 15 deletions(-)

This particular patch (introducing the RelationIsPermanent() macro)
seems like it'd be a nice thing to commit independently of the rest,
reducing the size of this patch set..?

Thanks!

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote:
> > > I have made significant progress on the cluster file encryption feature so
> > > it is time for me to post a new set of patches.
> > 
> > Here is a rebase, to keep the cfbot green.
> 
> Good stuff.

Yes, I was happy I got to a stage where the encryption actually did
something useful.

> > >From 110358c9ce8764f0c41c12dd37dabde57a92cf1f Mon Sep 17 00:00:00 2001
> > From: Bruce Momjian <bruce@momjian.us>
> > Date: Mon, 15 Mar 2021 10:20:32 -0400
> > Subject: [PATCH] cfe-11-persistent_over_cfe-10-hint squash commit
> > 
> > ---
> >  src/backend/access/gist/gistutil.c       |  2 +-
> >  src/backend/access/heap/heapam_handler.c |  2 +-
> >  src/backend/catalog/pg_publication.c     |  2 +-
> >  src/backend/commands/tablecmds.c         | 10 +++++-----
> >  src/backend/optimizer/util/plancat.c     |  3 +--
> >  src/backend/utils/cache/relcache.c       |  2 +-
> >  src/include/utils/rel.h                  | 10 ++++++++--
> >  src/include/utils/snapmgr.h              |  3 +--
> >  8 files changed, 19 insertions(+), 15 deletions(-)
> 
> This particular patch (introducing the RelationIsPermanent() macro)
> seems like it'd be a nice thing to commit independently of the rest,
> reducing the size of this patch set..? 

OK, if no one objects I will apply it in the next few days. The macro is
used more in my later patches, which I will not apply now.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Key management with tests

From
Alvaro Herrera
Date:
Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
it "relations that are not WAL-logged"?  I suppose we already have a
term for this; I'm not sure it's a good idea to invent a different term
that is only used in this new place.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W



Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
> Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
> it "relations that are not WAL-logged"?  I suppose we already have a
> term for this; I'm not sure it's a good idea to invent a different term
> that is only used in this new place.

This is discussed in src/backend/access/transam/README, specifically the
section that talks about Skipping WAL for New RelFileNode.  Basically,
it's the 'wal_level=minimal' optimization which allows WAL to be
skipped.

Thanks!

Stephen

Attachment

Re: Key management with tests

From
Alvaro Herrera
Date:
On 2021-Mar-18, Stephen Frost wrote:

> * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
> > Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
> > it "relations that are not WAL-logged"?  I suppose we already have a
> > term for this; I'm not sure it's a good idea to invent a different term
> > that is only used in this new place.
> 
> This is discussed in src/backend/access/transam/README, specifically the
> section that talks about Skipping WAL for New RelFileNode.  Basically,
> it's the 'wal_level=minimal' optimization which allows WAL to be
> skipped.

Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping
*relations*.  I thought WAL-skipping meant unlogged relations, but
I understand now that that's unrelated.  In the transam/README, WAL-skip
means a change in a transaction in a relfilenode that, if rolled back,
would disappear; and I'm not sure I understand how the code is handling
the case that a relation is under that condition.

This caught my attention because a comment says "encryption does not
support WAL-skipped relations", but there's no direct change to the
definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
am just overlooking something, since I'm just skimming anyway.

-- 
Álvaro Herrera       Valdivia, Chile



Re: Key management with tests

From
Stephen Frost
Date:
Greetings,

* Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
> On 2021-Mar-18, Stephen Frost wrote:
>
> > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
> > > Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
> > > it "relations that are not WAL-logged"?  I suppose we already have a
> > > term for this; I'm not sure it's a good idea to invent a different term
> > > that is only used in this new place.
> >
> > This is discussed in src/backend/access/transam/README, specifically the
> > section that talks about Skipping WAL for New RelFileNode.  Basically,
> > it's the 'wal_level=minimal' optimization which allows WAL to be
> > skipped.
>
> Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping
> *relations*.  I thought WAL-skipping meant unlogged relations, but
> I understand now that that's unrelated.  In the transam/README, WAL-skip
> means a change in a transaction in a relfilenode that, if rolled back,
> would disappear; and I'm not sure I understand how the code is handling
> the case that a relation is under that condition.
>
> This caught my attention because a comment says "encryption does not
> support WAL-skipped relations", but there's no direct change to the
> definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
> am just overlooking something, since I'm just skimming anyway.

This is relatively current activity and so it's entirely possible
comments and perhaps code need further updating in this area, but to
explain what's going on in a bit more detail-

Ultimately, we need to make sure that LSNs aren't re-used.  There's two
sources of LSNs today: those for relations which are being written into
the WAL and those for relations which are not (UNLOGGED relations,
specifically).  The 'minimal' WAL level introduces complications with
this requirement because tables created (or truncated) inside a
transaction are considered permanent once they're committed, but the
data pages in those relations don't go into the WAL and the LSNs on the
pages of those relations isn't guaranteed to be either unique or even
necessarily set, and if we were to generate LSNs for those it would be
required to be done by actually advancing the WAL LSN, which would
require writing into the WAL and therefore wouldn't be quite the
optimization that's expected.

I'm not sure if it's been explicitly done yet but I believe the idea is,
based on my last discussion with Bruce, at least initially, simply
disallow encrypted clusters from running with wal_level=minimal to avoid
this issue.

Thanks,

Stephen

Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Mar 18, 2021 at 02:37:43PM -0300, Álvaro Herrera wrote:
> On 2021-Mar-18, Stephen Frost wrote:
> > This is discussed in src/backend/access/transam/README, specifically the
> > section that talks about Skipping WAL for New RelFileNode.  Basically,
> > it's the 'wal_level=minimal' optimization which allows WAL to be
> > skipped.
> 
> Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping
> *relations*.  I thought WAL-skipping meant unlogged relations, but
> I understand now that that's unrelated.  In the transam/README, WAL-skip
> means a change in a transaction in a relfilenode that, if rolled back,
> would disappear; and I'm not sure I understand how the code is handling
> the case that a relation is under that condition.
> 
> This caught my attention because a comment says "encryption does not
> support WAL-skipped relations", but there's no direct change to the
> definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
> am just overlooking something, since I'm just skimming anyway.

First, thanks for looking at these patches --- I know it isn't easy.

Second, you are right that I equated WAL-skipping relfilenodes with
relations, and this was wrong.  I have updated the attached patch to use
the term WAL-skipping "relfilenodes", and checked the rest of the
patches for any incorrect 'skipping' term, but didn't find any.

If "WAL-skipping relfilenodes" is not clear enough, we should probably
rename RelFileNodeSkippingWAL().

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.


Attachment

Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Mar 18, 2021 at 01:46:28PM -0400, Stephen Frost wrote:
> * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
> > This caught my attention because a comment says "encryption does not
> > support WAL-skipped relations", but there's no direct change to the
> > definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
> > am just overlooking something, since I'm just skimming anyway.
> 
> This is relatively current activity and so it's entirely possible
> comments and perhaps code need further updating in this area, but to
> explain what's going on in a bit more detail- 
> 
> Ultimately, we need to make sure that LSNs aren't re-used.  There's two
> sources of LSNs today: those for relations which are being written into
> the WAL and those for relations which are not (UNLOGGED relations,
> specifically).  The 'minimal' WAL level introduces complications with

Well, the story is a little more complex than that --- we currently have
four LSN uses:

1.  real LSNs for WAL-logged relfilenodes
2.  real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet relations
3.  fake LSNs for GiST indexes for relfilenodes of non-permanenet relations
4.  zero LSNs for non-GiST non-permanenet relations

This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3
so the LSNs are always unique.

> I'm not sure if it's been explicitly done yet but I believe the idea is,
> based on my last discussion with Bruce, at least initially, simply
> disallow encrypted clusters from running with wal_level=minimal to avoid
> this issue.

I adjusted the hint bit code so it potentially could work with wal_level
minimal (just for safety), but the code disallows wal_level minimal, and
is documented as such.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Key management with tests

From
Bruce Momjian
Date:
On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote:
> >  src/backend/access/gist/gistutil.c       |  2 +-
> >  src/backend/access/heap/heapam_handler.c |  2 +-
> >  src/backend/catalog/pg_publication.c     |  2 +-
> >  src/backend/commands/tablecmds.c         | 10 +++++-----
> >  src/backend/optimizer/util/plancat.c     |  3 +--
> >  src/backend/utils/cache/relcache.c       |  2 +-
> >  src/include/utils/rel.h                  | 10 ++++++++--
> >  src/include/utils/snapmgr.h              |  3 +--
> >  8 files changed, 19 insertions(+), 15 deletions(-)
> 
> This particular patch (introducing the RelationIsPermanent() macro)
> seems like it'd be a nice thing to commit independently of the rest,
> reducing the size of this patch set..? 

Committed as suggested.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Key management with tests

From
Bruce Momjian
Date:
On Mon, Mar 22, 2021 at 08:38:37PM -0400, Bruce Momjian wrote:
> > This particular patch (introducing the RelationIsPermanent() macro)
> > seems like it'd be a nice thing to commit independently of the rest,
> > reducing the size of this patch set..? 
> 
> Committed as suggested.

Also, I have written a short presentation on where I think we are with
cluster file encryption:

    https://momjian.us/main/writings/pgsql/cfe.pdf

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Key management with tests

From
Neil Chen
Date:
Hi Bruce,

I went through these patches and executed the test script you added for the KMS section, which looks all good. 

This is a point that looks like a bug - in patch 10, you changed the location and use of *RelFileNodeSkippingWAL()*, but the modified code logic seems different from the original when encryption is not enabled. After applying this patch, it still will execute the set LSN code flow when RelFileNodeSkippingWAL returns true, and encryption not enabled.



On Thu, Apr 1, 2021 at 2:47 PM Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote:
> I have made significant progress on the cluster file encryption feature so
> it is time for me to post a new set of patches.

Here is a rebase, to keep the cfbot green.

--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



--
There is no royal road to learning.
HighGo Software Co.

Re: Key management with tests

From
Bruce Momjian
Date:
On Tue, Apr  6, 2021 at 04:56:36PM +0800, Neil Chen wrote:
> Hi Bruce,
> 
> I went through these patches and executed the test script you added for the KMS
> section, which looks all good. 

Thank you for checking it.  The src/test/crypto/t/003_clusterkey.pl test
is one of the craziest tests I have ever written, so I am glad it worked
for you.

> This is a point that looks like a bug - in patch 10, you changed the location
> and use of *RelFileNodeSkippingWAL()*, but the modified code logic seems
> different from the original when encryption is not enabled. After applying this
> patch, it still will execute the set LSN code flow when RelFileNodeSkippingWAL
> returns true, and encryption not enabled.

You are very correct.  That 'return' inside the 'if' statement gave me
trouble, and MarkBufferDirtyHint() was the hardest function I had to
deal with.  Attached is an updated version of patches with a rebase; the
GitHub links listed on the wiki are updated too.

Thanks for your help.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.


Attachment

storing an explicit nonce

From
Robert Haas
Date:
On Thu, Mar 18, 2021 at 2:59 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Ultimately, we need to make sure that LSNs aren't re-used.  There's two
> > sources of LSNs today: those for relations which are being written into
> > the WAL and those for relations which are not (UNLOGGED relations,
> > specifically).  The 'minimal' WAL level introduces complications with
>
> Well, the story is a little more complex than that --- we currently have
> four LSN uses:
>
> 1.  real LSNs for WAL-logged relfilenodes
> 2.  real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet relations
> 3.  fake LSNs for GiST indexes for relfilenodes of non-permanenet relations
> 4.  zero LSNs for non-GiST non-permanenet relations
>
> This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3
> so the LSNs are always unique.

Hi!

This approach has a few disadvantages. For example, right now, we only
need to WAL log hints for the first write to each page after a
checkpoint, but in this approach, if the same page is written multiple
times per checkpoint cycle, we'd need to log hints every time. In some
workloads that could be quite expensive, especially if we log an FPI
every time.

Also, I think that all sorts of non-permanent relations currently get
zero LSNs, not just GiST. Every unlogged table and every temporary
table would need to use fake LSNs. Moreover, for unlogged tables, the
buffer manager would need changes, because it is otherwise going to
assume that anything it sees in the pd_lsn field other than a zero is
a real LSN.

So I would like to propose an alternative: store the nonce in the
page. Now the next question is where to put it. I think that putting
it into the page header would be far too invasive, so I propose that
we instead store it at the end of the page, as part of the special
space. That makes an awful lot of code not really notice that anything
is different, because it always thought that the usable space on the
page ended where the special space begins, and it doesn't really care
where that is exactly. The code that knows about the special space
might care a little bit, but whatever private data it's storing is
going to be at the beginning of the special space, and the nonce would
be stored - in this proposal - at the end of the special space. So it
turns out that it doesn't really care that much either.

Attached are a few WIP/POC patches from my colleague Bharath
implementing this. There are certainly some implementation
deficiencies here, which can be corrected if we decide this approach
is worth pursuing, but I think they are sufficient to show that the
approach is viable and also some of the consequences of going this
way. One thing that happens is that a bunch of values that used to be
constant - like TOAST_INDEX_TARGET and GinDataPageMaxDataSize - become
non-constant. I suggested to Bharath that he handle this by changing
those macros to take the nonce size as an argument, which is what the
patch does, although it missed pushing that idea down all the way in
some obscure case (e.g. SIGLEN_MAX). That has the down side that we
will now have more computation to do at runtime vs. compile-time. I am
unclear whether there would be enough impact to get exercised about,
but I'm hopeful that the answer is "no".

As written, the patch makes initdb take a --tde-nonce-size argument,
but that's really just for demonstration purposes. I assume that, if
we decide to go this way, we'd have an initdb option that selects
whether to use encryption, or perhaps the specific encryption
algorithm to be used, and then the nonce size would be computed based
on that, or else set to 0 if encryption is not in use.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 12:46:45 -0400, Robert Haas wrote:
> This approach has a few disadvantages. For example, right now, we only
> need to WAL log hints for the first write to each page after a
> checkpoint, but in this approach, if the same page is written multiple
> times per checkpoint cycle, we'd need to log hints every time. In some
> workloads that could be quite expensive, especially if we log an FPI
> every time.

Yes. I think it'd likely be prohibitively expensive in some situations.


> So I would like to propose an alternative: store the nonce in the
> page. Now the next question is where to put it. I think that putting
> it into the page header would be far too invasive, so I propose that
> we instead store it at the end of the page, as part of the special
> space. That makes an awful lot of code not really notice that anything
> is different, because it always thought that the usable space on the
> page ended where the special space begins, and it doesn't really care
> where that is exactly. The code that knows about the special space
> might care a little bit, but whatever private data it's storing is
> going to be at the beginning of the special space, and the nonce would
> be stored - in this proposal - at the end of the special space. So it
> turns out that it doesn't really care that much either.

The obvious concerns are issues around binary upgrades for cases that
already use the special space? Are you planning to address that by not
having that path? Or by storing the nonce at the "start" of the special
space (i.e. [normal data][nonce][existing special])?

Is there an argument for generalizing the nonce approach for to replace
fake LSNs for unlogged relations?

Why is using pd_special better than finding space for a flag bit in the
header indicating whether it has a nonce? Using pd_special will burden
all code using special space, and maybe even some that does not (think
empty pages now occasionally having a non-zero pd_special), whereas
implementing it on the page level wouldn't quite have the same concerns.


> One thing that happens is that a bunch of values that used to be
> constant - like TOAST_INDEX_TARGET and GinDataPageMaxDataSize - become
> non-constant. I suggested to Bharath that he handle this by changing
> those macros to take the nonce size as an argument, which is what the
> patch does, although it missed pushing that idea down all the way in
> some obscure case (e.g. SIGLEN_MAX). That has the down side that we
> will now have more computation to do at runtime vs. compile-time. I am
> unclear whether there would be enough impact to get exercised about,
> but I'm hopeful that the answer is "no".
> 
> As written, the patch makes initdb take a --tde-nonce-size argument,
> but that's really just for demonstration purposes. I assume that, if
> we decide to go this way, we'd have an initdb option that selects
> whether to use encryption, or perhaps the specific encryption
> algorithm to be used, and then the nonce size would be computed based
> on that, or else set to 0 if encryption is not in use.

I do suspect having only the "no nonce" or "nonce is a compile time
constant" cases would be good performance wise. Stuff like

> +#define MaxHeapTupleSizeLimit  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \
> +                           sizeof(ItemIdData)))
> +#define MaxHeapTupleSize(tdeNonceSize)  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \
> +                           sizeof(ItemIdData)) - MAXALIGN(tdeNonceSize))

won't be free.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, May 25, 2021 at 1:37 PM Andres Freund <andres@anarazel.de> wrote:
> The obvious concerns are issues around binary upgrades for cases that
> already use the special space? Are you planning to address that by not
> having that path? Or by storing the nonce at the "start" of the special
> space (i.e. [normal data][nonce][existing special])?

Well, there aren't any existing encrypted clusters, so what is the
scenario exactly? Perhaps you are thinking that we'd have a pg_upgrade
option that would take an unencrypted cluster and encrypt all the
pages, without any other page format changes. If so, this design would
preclude that choice, because there might be no free space available.

> Is there an argument for generalizing the nonce approach for to replace
> fake LSNs for unlogged relations?

I hadn't thought about that. Maybe. But that would require including
the nonce always, rather than only when TDE is selected, or including
it always in some kinds of pages and only conditionally in others,
which seems more complex.

> Why is using pd_special better than finding space for a flag bit in the
> header indicating whether it has a nonce? Using pd_special will burden
> all code using special space, and maybe even some that does not (think
> empty pages now occasionally having a non-zero pd_special), whereas
> implementing it on the page level wouldn't quite have the same concerns.

Well, I think there's a lot of code that knows where the line pointer
array starts, and all those calculations will have to become more
complex at runtime if we put the nonce anywhere near the start of the
page. I think there are way fewer things that care about the end of
the page. I dislike the idea that every call to PageGetItem() would
need to know the nonce size - there are hundreds of those calls and
making them more expensive seems a lot worse than the stuff this patch
changes.

It's always possible that I'm confused here, either about what you are
proposing or how impactful it would actually be...

> I do suspect having only the "no nonce" or "nonce is a compile time
> constant" cases would be good performance wise. Stuff like
>
> > +#define MaxHeapTupleSizeLimit  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \
> > +                                                sizeof(ItemIdData)))
> > +#define MaxHeapTupleSize(tdeNonceSize)  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + \
> > +                                                sizeof(ItemIdData)) - MAXALIGN(tdeNonceSize))
>
> won't be free.

One question here is whether we're comfortable saying that the nonce
is entirely constant. I wasn't sure about that. It seems possible to
me that different encryption algorithms might want nonces of different
sizes, either now or in the future. I am not a cryptographer, but that
seemed like a bit of a limiting assumption. So Bharath and I decided
to make the POC cater to a fully variable-size nonce rather than
zero-or-some-constant. However, if the consensus is that
zero-or-some-constant is better, fair enough! The patch can certainly
be adjusted to cater to work that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 12:46:45PM -0400, Robert Haas wrote:
> On Thu, Mar 18, 2021 at 2:59 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > Ultimately, we need to make sure that LSNs aren't re-used.  There's two
> > > sources of LSNs today: those for relations which are being written into
> > > the WAL and those for relations which are not (UNLOGGED relations,
> > > specifically).  The 'minimal' WAL level introduces complications with
> >
> > Well, the story is a little more complex than that --- we currently have
> > four LSN uses:
> >
> > 1.  real LSNs for WAL-logged relfilenodes
> > 2.  real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet relations
> > 3.  fake LSNs for GiST indexes for relfilenodes of non-permanenet relations
> > 4.  zero LSNs for non-GiST non-permanenet relations
> >
> > This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3
> > so the LSNs are always unique.
> 
> Hi!
> 
> This approach has a few disadvantages. For example, right now, we only
> need to WAL log hints for the first write to each page after a
> checkpoint, but in this approach, if the same page is written multiple
> times per checkpoint cycle, we'd need to log hints every time. In some
> workloads that could be quite expensive, especially if we log an FPI
> every time.

Well, if we create a separate nonce counter, we still need to make sure
it doesn't go backwards during a crash, so we have to WAL log it
somehow, perhaps at a certain interval like 1k and advance the counter
by 1k in case of crash recovery, like we do with the oid counter now, I
think.

The buffer encryption overhead is 2-4%, and WAL encryption is going to
add to that, so I thought hint bit logging overhead would be minimal
in comparison.

> Also, I think that all sorts of non-permanent relations currently get
> zero LSNs, not just GiST. Every unlogged table and every temporary
> table would need to use fake LSNs. Moreover, for unlogged tables, the
> buffer manager would need changes, because it is otherwise going to
> assume that anything it sees in the pd_lsn field other than a zero is
> a real LSN.

Have you looked at the code, specifically EncryptPage():

    https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch

+    if (!relation_is_permanent && !is_gist_page_or_similar)
+        PageSetLSN(page, LSNForEncryption(relation_is_permanent));


It assigns an LSN to unlogged pages.  As far as the buffer manager
seeing fake LSNs that already happens for GiST indexes, so I just built
on that --- seemed to work fine.

> So I would like to propose an alternative: store the nonce in the
> page. Now the next question is where to put it. I think that putting
> it into the page header would be far too invasive, so I propose that
> we instead store it at the end of the page, as part of the special
> space. That makes an awful lot of code not really notice that anything
> is different, because it always thought that the usable space on the
> page ended where the special space begins, and it doesn't really care
> where that is exactly. The code that knows about the special space
> might care a little bit, but whatever private data it's storing is
> going to be at the beginning of the special space, and the nonce would
> be stored - in this proposal - at the end of the special space. So it
> turns out that it doesn't really care that much either.

I think the big problem with that is that it adds a new counter, with
new code, and it makes adding encryption offline, like we do for adding
checksums, pretty much impossible since the page might not have space
for a nonce.  It also makes the idea of adding encryption as part of a
pg_upgrade non-link mode also impossible, at least for me.  ;-)

I have to ask why we should consider adding it to the special space,
since my current version seems fine, and has minimal code impact, and
has some advantages over using the special space.  Is it because of the
WAL hint overhead, or for a cleaner API, or something else?

Also, I need help with all the XXX comments I have in my patches before
I can move forward:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Patches

I stopped working on this to get beta out the door, but next week it
would be nice to continue on this.  However, I want to get this patch
into a state where everyone is happy with it, rather than adding more
code with an unclear future.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> One question here is whether we're comfortable saying that the nonce
> is entirely constant. I wasn't sure about that. It seems possible to
> me that different encryption algorithms might want nonces of different
> sizes, either now or in the future. I am not a cryptographer, but that
> seemed like a bit of a limiting assumption. So Bharath and I decided
> to make the POC cater to a fully variable-size nonce rather than
> zero-or-some-constant. However, if the consensus is that
> zero-or-some-constant is better, fair enough! The patch can certainly
> be adjusted to cater to work that way.

A 16-byte nonce is sufficient for AES and I doubt we will need anything
stronger than AES256 anytime soon.  Making the nonce variable length
seems it is just adding complexity for little purpose. 

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 10:37:32AM -0700, Andres Freund wrote:
> The obvious concerns are issues around binary upgrades for cases that
> already use the special space? Are you planning to address that by not
> having that path? Or by storing the nonce at the "start" of the special
> space (i.e. [normal data][nonce][existing special])?
> 
> Is there an argument for generalizing the nonce approach for to replace
> fake LSNs for unlogged relations?
> 
> Why is using pd_special better than finding space for a flag bit in the
> header indicating whether it has a nonce? Using pd_special will burden
> all code using special space, and maybe even some that does not (think
> empty pages now occasionally having a non-zero pd_special), whereas
> implementing it on the page level wouldn't quite have the same concerns.

My code can already identify if the LSN is fake or not --- why can't we
build on that?  Can someone show that logging WAL hint bits causes
unacceptable overhead beyond the encryption overhead?  I don't think we
even know that since we don't know the overhead of encrypting WAL.

One crazy idea would be to not log WAL hints, but rather use an LSN
range that will never be valid for real LSNs, like the high bit being
set.  That special range would need to be WAL-logged, but again, perhaps
every 1k, and increment by 1k on a crash.

This discussion has cemented what I had already considered --- that
doing a separate nonce will make this feature less usable/upgradable,
and take it beyond my ability or desire to complete.

Ideally, what I would like to do is to resolve my XXX questions in my
patches, get everyone happy with what we have, then let me do the WAL
encryption.  We can then see if logging hint bits is significant
overhead, and if it is, go with a special LSN range for fake LSNs.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, May 25, 2021 at 2:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> Well, if we create a separate nonce counter, we still need to make sure
> it doesn't go backwards during a crash, so we have to WAL log it

I think we don't really need a global counter, do we? We could simply
increment the nonce every time we write the page. If we want to avoid
using the same IV for different pages, then 8 bytes of the nonce could
store a value that's different for every page, and the other 8 bytes
could store a counter. Presumably we won't manage to write the same
page more than 2^64 times, since LSNs are limited to be <2^64, and
those are consumed more than 1 byte at a time for every change to any
page anywhere.

> The buffer encryption overhead is 2-4%, and WAL encryption is going to
> add to that, so I thought hint bit logging overhead would be minimal
> in comparison.

I think it depends. If buffer evictions are rare, then it won't matter
much. But if they are common, then using the LSN as the nonce will add
a lot of overhead.

> Have you looked at the code, specifically EncryptPage():
>
>         https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch
>
> +       if (!relation_is_permanent && !is_gist_page_or_similar)
> +               PageSetLSN(page, LSNForEncryption(relation_is_permanent));
>
>
> It assigns an LSN to unlogged pages.  As far as the buffer manager
> seeing fake LSNs that already happens for GiST indexes, so I just built
> on that --- seemed to work fine.

I had not, but I don't see why this issue is specific to GiST rather
than common to every kind of unlogged and temporary relation.

> I have to ask why we should consider adding it to the special space,
> since my current version seems fine, and has minimal code impact, and
> has some advantages over using the special space.  Is it because of the
> WAL hint overhead, or for a cleaner API, or something else?

My concern is about the overhead, and also the code complexity. I
think that making sure that the LSN gets changed in all cases may be
fairly tricky.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 03:09:03PM -0400, Robert Haas wrote:
> On Tue, May 25, 2021 at 2:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Well, if we create a separate nonce counter, we still need to make sure
> > it doesn't go backwards during a crash, so we have to WAL log it
> 
> I think we don't really need a global counter, do we? We could simply
> increment the nonce every time we write the page. If we want to avoid
> using the same IV for different pages, then 8 bytes of the nonce could
> store a value that's different for every page, and the other 8 bytes
> could store a counter. Presumably we won't manage to write the same
> page more than 2^64 times, since LSNs are limited to be <2^64, and
> those are consumed more than 1 byte at a time for every change to any
> page anywhere.

The issue we had here is what do you use as a special value for each
relation?  Where do you store it if it is not computed?  You can use a
global counter for the per-page nonce that doesn't change when the page
is updated, but that would still need to be a global counter.

Also, when you change hint bits, either you don't change the nonce/LSN,
and don't recrypt the page (and the hint bit changes are visible), or
you change the nonce and reencrypt the page, and you are then WAL
logging the page.  I don't see how having a nonce different from the LSN
helps here.

> > The buffer encryption overhead is 2-4%, and WAL encryption is going to
> > add to that, so I thought hint bit logging overhead would be minimal
> > in comparison.
> 
> I think it depends. If buffer evictions are rare, then it won't matter
> much. But if they are common, then using the LSN as the nonce will add
> a lot of overhead.

Well, see above.  A separate nonce somewhere else doesn't help much, as
I see it.

> > Have you looked at the code, specifically EncryptPage():
> >
> >         https://github.com/postgres/postgres/compare/bmomjian:cfe-11-gist..bmomjian:_cfe-12-rel.patch
> >
> > +       if (!relation_is_permanent && !is_gist_page_or_similar)
> > +               PageSetLSN(page, LSNForEncryption(relation_is_permanent));
> >
> >
> > It assigns an LSN to unlogged pages.  As far as the buffer manager
> > seeing fake LSNs that already happens for GiST indexes, so I just built
> > on that --- seemed to work fine.
> 
> I had not, but I don't see why this issue is specific to GiST rather
> than common to every kind of unlogged and temporary relation.
> 
> > I have to ask why we should consider adding it to the special space,
> > since my current version seems fine, and has minimal code impact, and
> > has some advantages over using the special space.  Is it because of the
> > WAL hint overhead, or for a cleaner API, or something else?
> 
> My concern is about the overhead, and also the code complexity. I
> think that making sure that the LSN gets changed in all cases may be
> fairly tricky.

Please look over the patch to see if I missed anything --- for me, it
seemed quite clear, and I am not an expert in that area of the code.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 03:20:06PM -0400, Bruce Momjian wrote:
> Also, when you change hint bits, either you don't change the nonce/LSN,
> and don't re-encrypt the page (and the hint bit changes are visible), or
> you change the nonce and reencrypt the page, and you are then WAL
> logging the page.  I don't see how having a nonce different from the LSN
> helps here.

Let me go into more detail here.  The general rule is that you never
encrypt _different_ data with the same key/nonce.  Now, since a hint bit
change changes the data, it should get a new nonce, and since it is a
newly encrypted page (using a new nonce), it should be WAL logged
because a torn page would make the data unreadable.

Now, if we want to consult some security experts and have them tell us
the hint bit visibility is not a problem, we could get by without using a
new nonce for hint bit changes, and in that case it doesn't matter if we
have a separate LSN or custom nonce --- it doesn't get changed for hint
bit changes.

My point is that we have to full-page-write cases where we change the
nonce --- we get a new LSN/nonce for free if we are using the LSN as the
nonce.  What has made this approach much easier is that you basically
tie a change of the nonce to require a change of LSN, since you are WAL
logging it and every nonce change has to be full-page-write WAL logged. 
This makes the LSN-as-nonce less fragile to breakage than a custom
nonce, in my opinion, which may explain why my patch is so small.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 03:34:04PM -0400, Bruce Momjian wrote:
> Let me go into more detail here.  The general rule is that you never
> encrypt _different_ data with the same key/nonce.  Now, since a hint bit
> change changes the data, it should get a new nonce, and since it is a
> newly encrypted page (using a new nonce), it should be WAL logged
> because a torn page would make the data unreadable.
> 
> Now, if we want to consult some security experts and have them tell us
> the hint bit visibility is not a problem, we could get by without using a
> new nonce for hint bit changes, and in that case it doesn't matter if we
> have a separate LSN or custom nonce --- it doesn't get changed for hint
> bit changes.
> 
> My point is that we have to full-page-write cases where we change the
> nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> nonce.  What has made this approach much easier is that you basically
> tie a change of the nonce to require a change of LSN, since you are WAL
> logging it and every nonce change has to be full-page-write WAL logged. 
> This makes the LSN-as-nonce less fragile to breakage than a custom
> nonce, in my opinion, which may explain why my patch is so small.

This issue is covered at the bottom of this patch to the README file:

    https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch

    Hint Bits
    - - - - -
    
    For hint bit changes, the LSN normally doesn't change, which is
    a problem.  By enabling wal_log_hints, you get full page writes
    to the WAL after the first hint bit change of the checkpoint.
    This is useful for two reasons.  First, it generates a new LSN,
    which is needed for the IV to be secure.  Second, full page images
    protect against torn pages, which is an even bigger requirement for
    encryption because the new LSN is re-encrypting the entire page,
    not just the hint bit changes.    You can safely lose the hint bit
    changes, but you need to use the same LSN to decrypt the entire
    page, so a torn page with an LSN change cannot be decrypted.
    To prevent this, wal_log_hints guarantees that the pre-hint-bit
    version (and previous LSN version) of the page is restored.
    
    However, if a hint-bit-modified page is written to the file system
    during a checkpoint, and there is a later hint bit change switching
    the same page from clean to dirty during the same checkpoint, we
    need a new LSN, and wal_log_hints doesn't give us a new LSN here.
    The fix for this is to update the page LSN by writing a dummy
    WAL record via xloginsert.c::LSNForEncryption() in such cases.

Let me know if it needs more detai.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

On Tue, May 25, 2021 at 14:56 Bruce Momjian <bruce@momjian.us> wrote:
On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> One question here is whether we're comfortable saying that the nonce
> is entirely constant. I wasn't sure about that. It seems possible to
> me that different encryption algorithms might want nonces of different
> sizes, either now or in the future. I am not a cryptographer, but that
> seemed like a bit of a limiting assumption. So Bharath and I decided
> to make the POC cater to a fully variable-size nonce rather than
> zero-or-some-constant. However, if the consensus is that
> zero-or-some-constant is better, fair enough! The patch can certainly
> be adjusted to cater to work that way.

A 16-byte nonce is sufficient for AES and I doubt we will need anything
stronger than AES256 anytime soon.  Making the nonce variable length
seems it is just adding complexity for little purpose.

I’d like to review this more and make sure using the special space is possible but if it is then it opens up a huge new possibility that we could use it for both the nonce AND an appropriately sized tag, giving us integrity along with encryption which would be a very significant additional feature.  I’d considered using a fork instead but having it on the page would be far better.

I’ll also note that we could consider possibly even find an alternative use for the space used for checksums, or leave them as they are today, though they’d be redundant at that point to the tag.

Lastly, if the special space is actually able to be variable in size and we could, say, store a flag in pg_class which tells us what’s in the special space, then we could possibly give users the option of including the tag on each page, or the choice of the size of tag, or possibly for other interesting things in the future outside of encryption and data integrity.

Overall, I’m quite interested in the idea of making the special space able to be variable. I do accept that will make it so it’s not possible to do things like have physical replication between an unencrypted cluster and an encrypted one, but the advantages seem worthwhile and users would still be able to leverage logical replication to perform such a migration with relatively little downtime.

Thanks!

Stephen

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

On Tue, May 25, 2021 at 15:09 Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 25, 2021 at 2:45 PM Bruce Momjian <bruce@momjian.us> wrote:
> Well, if we create a separate nonce counter, we still need to make sure
> it doesn't go backwards during a crash, so we have to WAL log it

I think we don't really need a global counter, do we? We could simply
increment the nonce every time we write the page. If we want to avoid
using the same IV for different pages, then 8 bytes of the nonce could
store a value that's different for every page, and the other 8 bytes
could store a counter. Presumably we won't manage to write the same
page more than 2^64 times, since LSNs are limited to be <2^64, and
those are consumed more than 1 byte at a time for every change to any
page anywhere.

The nonce does need to be absolutely unique for a given encryption key and therefore needs to be global in some form.

Thanks!

Stephen

Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote:
> My point is that we have to full-page-write cases where we change the
> nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> nonce.  What has made this approach much easier is that you basically
> tie a change of the nonce to require a change of LSN, since you are WAL
> logging it and every nonce change has to be full-page-write WAL logged.
> This makes the LSN-as-nonce less fragile to breakage than a custom
> nonce, in my opinion, which may explain why my patch is so small.

This disregards that we need to be able to increment nonces on standbys
/ during crash recovery.

It may look like that's not needed, with an (wrong!) argument like: The
only writes come from crash recovery, which always are associated with a
WAL record, guaranteeing nonce increases. Hint bits are not an issue
because they don't mark the buffer dirty.

But unfortunately that analysis is wrong. Consider the following
sequence:

1) replay record LSN X affecting page Y (FPI replay)
2) write out Y, encrypt Y using X as nonce
3) crash
4) replay record LSN X affecting page Y (FPI replay)
5) hint bit update to Y, resulting in Y'
6) write out Y', encrypt Y' using X as nonce

While 5) did not mark the page as dirty, it still modified the page
contents. Which means that we'd encrypt different content with the same
nonce - which is not allowed.

I'm pretty sure that there's several other ways to end up with page
contents that differ, despite the LSN not changing.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 01:54:21PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote:
> > My point is that we have to full-page-write cases where we change the
> > nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> > nonce.  What has made this approach much easier is that you basically
> > tie a change of the nonce to require a change of LSN, since you are WAL
> > logging it and every nonce change has to be full-page-write WAL logged.
> > This makes the LSN-as-nonce less fragile to breakage than a custom
> > nonce, in my opinion, which may explain why my patch is so small.
> 
> This disregards that we need to be able to increment nonces on standbys
> / during crash recovery.
> 
> It may look like that's not needed, with an (wrong!) argument like: The
> only writes come from crash recovery, which always are associated with a
> WAL record, guaranteeing nonce increases. Hint bits are not an issue
> because they don't mark the buffer dirty.
> 
> But unfortunately that analysis is wrong. Consider the following
> sequence:
> 
> 1) replay record LSN X affecting page Y (FPI replay)
> 2) write out Y, encrypt Y using X as nonce
> 3) crash
> 4) replay record LSN X affecting page Y (FPI replay)
> 5) hint bit update to Y, resulting in Y'
> 6) write out Y', encrypt Y' using X as nonce
> 
> While 5) did not mark the page as dirty, it still modified the page
> contents. Which means that we'd encrypt different content with the same
> nonce - which is not allowed.
> 
> I'm pretty sure that there's several other ways to end up with page
> contents that differ, despite the LSN not changing.

Yes, I can see that happening.  I think occasional leakage of hint bit
changes to be acceptable.  We might decide they are all acceptable.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 03:20:06PM -0400, Bruce Momjian wrote:
> > Also, when you change hint bits, either you don't change the nonce/LSN,
> > and don't re-encrypt the page (and the hint bit changes are visible), or
> > you change the nonce and reencrypt the page, and you are then WAL
> > logging the page.  I don't see how having a nonce different from the LSN
> > helps here.
>
> Let me go into more detail here.  The general rule is that you never
> encrypt _different_ data with the same key/nonce.  Now, since a hint bit
> change changes the data, it should get a new nonce, and since it is a
> newly encrypted page (using a new nonce), it should be WAL logged
> because a torn page would make the data unreadable.

Right.

> Now, if we want to consult some security experts and have them tell us
> the hint bit visibility is not a problem, we could get by without using a
> new nonce for hint bit changes, and in that case it doesn't matter if we
> have a separate LSN or custom nonce --- it doesn't get changed for hint
> bit changes.

I do think it's reasonable to consider having hint bits not included in
the encrypted part of the page and therefore remove the need to produce
a new nonce for each hint bit change.  Naturally, there's always an
increased risk when any data in the system isn't encrypted but given
the other parts of the system which aren't being encrypted as part of
this effort it hardly seems like a significant increase of overall risk.
I don't believe that any of the auditors and security teams I've
discussed TDE with would have issue with hint bits not being encrypted-
the principle concern has always been the primary data.

Naturally, the more we are able to encrypt and the more we can do to
provide data integrity validation, may open up the possibility for PG to
be used in even more places, which argues for having some way of making
these choices be options which a user could decide at initdb time, or at
least contemplating a road map to where we could offer users the option
to have other parts of the system be encrypted and ideally have data
integrity checks, but I don't think we necessarily have to solve
everything right now in that regard- just having TDE in some form will
open up quite a few new possibilities for v15, even if it doesn't
include data integrity validation beyond our existing checksums and
doesn't encrypt hint bits.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 04:29:08PM -0400, Stephen Frost wrote:
> Greetings,
> 
> On Tue, May 25, 2021 at 14:56 Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
>     > One question here is whether we're comfortable saying that the nonce
>     > is entirely constant. I wasn't sure about that. It seems possible to
>     > me that different encryption algorithms might want nonces of different
>     > sizes, either now or in the future. I am not a cryptographer, but that
>     > seemed like a bit of a limiting assumption. So Bharath and I decided
>     > to make the POC cater to a fully variable-size nonce rather than
>     > zero-or-some-constant. However, if the consensus is that
>     > zero-or-some-constant is better, fair enough! The patch can certainly
>     > be adjusted to cater to work that way.
> 
>     A 16-byte nonce is sufficient for AES and I doubt we will need anything
>     stronger than AES256 anytime soon.  Making the nonce variable length
>     seems it is just adding complexity for little purpose.
> 
> 
> I’d like to review this more and make sure using the special space is possible
> but if it is then it opens up a huge new possibility that we could use it for
> both the nonce AND an appropriately sized tag, giving us integrity along with
> encryption which would be a very significant additional feature.  I’d
> considered using a fork instead but having it on the page would be far better.

We already discussed that there are too many other ways to break system
integrity that are not encrypted/integrity-checked, e.g., changes to
clog.  Do you disagree?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 05:04:50PM -0400, Stephen Frost wrote:
> > Now, if we want to consult some security experts and have them tell us
> > the hint bit visibility is not a problem, we could get by without using a
> > new nonce for hint bit changes, and in that case it doesn't matter if we
> > have a separate LSN or custom nonce --- it doesn't get changed for hint
> > bit changes.
> 
> I do think it's reasonable to consider having hint bits not included in
> the encrypted part of the page and therefore remove the need to produce
> a new nonce for each hint bit change.  Naturally, there's always an
> increased risk when any data in the system isn't encrypted but given
> the other parts of the system which aren't being encrypted as part of
> this effort it hardly seems like a significant increase of overall risk.
> I don't believe that any of the auditors and security teams I've
> discussed TDE with would have issue with hint bits not being encrypted-
> the principle concern has always been the primary data.

OK, this is good to know.  I know the never-reuse rule, so it is good to
know it can be relaxed for certain data without causing problems in
other places.  Should I modify my patch to do this?

FYI, technically, the hint bit is still encrypted, but could _flip_ in
the encrypted file if changed, so that's why we say it is visible.  If
we used a block cipher instead of a streaming one (CTR), this might not
work because the earlier blocks can be based in the output of later
blocks.

> Naturally, the more we are able to encrypt and the more we can do to
> provide data integrity validation, may open up the possibility for PG to
> be used in even more places, which argues for having some way of making
> these choices be options which a user could decide at initdb time, or at
> least contemplating a road map to where we could offer users the option
> to have other parts of the system be encrypted and ideally have data
> integrity checks, but I don't think we necessarily have to solve
> everything right now in that regard- just having TDE in some form will
> open up quite a few new possibilities for v15, even if it doesn't
> include data integrity validation beyond our existing checksums and
> doesn't encrypt hint bits.

I am thinking full-file system encryption should still be used by people
needing that.  I am concerned that if we add too many
restrictions/additions on this feature, it will not be very useful.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 01:54:21PM -0700, Andres Freund wrote:
> > On 2021-05-25 15:34:04 -0400, Bruce Momjian wrote:
> > > My point is that we have to full-page-write cases where we change the
> > > nonce --- we get a new LSN/nonce for free if we are using the LSN as the
> > > nonce.  What has made this approach much easier is that you basically
> > > tie a change of the nonce to require a change of LSN, since you are WAL
> > > logging it and every nonce change has to be full-page-write WAL logged.
> > > This makes the LSN-as-nonce less fragile to breakage than a custom
> > > nonce, in my opinion, which may explain why my patch is so small.
> >
> > This disregards that we need to be able to increment nonces on standbys
> > / during crash recovery.
> >
> > It may look like that's not needed, with an (wrong!) argument like: The
> > only writes come from crash recovery, which always are associated with a
> > WAL record, guaranteeing nonce increases. Hint bits are not an issue
> > because they don't mark the buffer dirty.
> >
> > But unfortunately that analysis is wrong. Consider the following
> > sequence:
> >
> > 1) replay record LSN X affecting page Y (FPI replay)
> > 2) write out Y, encrypt Y using X as nonce
> > 3) crash
> > 4) replay record LSN X affecting page Y (FPI replay)
> > 5) hint bit update to Y, resulting in Y'
> > 6) write out Y', encrypt Y' using X as nonce
> >
> > While 5) did not mark the page as dirty, it still modified the page
> > contents. Which means that we'd encrypt different content with the same
> > nonce - which is not allowed.
> >
> > I'm pretty sure that there's several other ways to end up with page
> > contents that differ, despite the LSN not changing.
>
> Yes, I can see that happening.  I think occasional leakage of hint bit
> changes to be acceptable.  We might decide they are all acceptable.

I don't think that I agree with the idea that this would ultimately only
leak the hint bits- I'm fairly sure that this would make it relatively
trivial for an attacker to be able to deduce the contents of the entire
8k page.  I don't know that we should be willing to accept that as a
part of regular operation (which we generally view crashes as being).  I
had thought there was something in place to address this though.  If
not, it does seem like there should be.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 04:29:08PM -0400, Stephen Frost wrote:
> > On Tue, May 25, 2021 at 14:56 Bruce Momjian <bruce@momjian.us> wrote:
> >
> >     On Tue, May 25, 2021 at 02:25:21PM -0400, Robert Haas wrote:
> >     > One question here is whether we're comfortable saying that the nonce
> >     > is entirely constant. I wasn't sure about that. It seems possible to
> >     > me that different encryption algorithms might want nonces of different
> >     > sizes, either now or in the future. I am not a cryptographer, but that
> >     > seemed like a bit of a limiting assumption. So Bharath and I decided
> >     > to make the POC cater to a fully variable-size nonce rather than
> >     > zero-or-some-constant. However, if the consensus is that
> >     > zero-or-some-constant is better, fair enough! The patch can certainly
> >     > be adjusted to cater to work that way.
> >
> >     A 16-byte nonce is sufficient for AES and I doubt we will need anything
> >     stronger than AES256 anytime soon.  Making the nonce variable length
> >     seems it is just adding complexity for little purpose.
> >
> >
> > I’d like to review this more and make sure using the special space is possible
> > but if it is then it opens up a huge new possibility that we could use it for
> > both the nonce AND an appropriately sized tag, giving us integrity along with
> > encryption which would be a very significant additional feature.  I’d
> > considered using a fork instead but having it on the page would be far better.
>
> We already discussed that there are too many other ways to break system
> integrity that are not encrypted/integrity-checked, e.g., changes to
> clog.  Do you disagree?

We had agreed that this wasn't something that was strictly required in
the first version and I continue to agree with that.  On the other hand,
if we decide that we ultimately need to use an independent nonce and
further that we can make room in the special space for it, then it's
trivial to also include the tag and we absolutely should (or make it
optional to do so) in that case.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 05:14:24PM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > Yes, I can see that happening.  I think occasional leakage of hint bit
> > changes to be acceptable.  We might decide they are all acceptable.
> 
> I don't think that I agree with the idea that this would ultimately only
> leak the hint bits- I'm fairly sure that this would make it relatively
> trivial for an attacker to be able to deduce the contents of the entire
> 8k page.  I don't know that we should be willing to accept that as a
> part of regular operation (which we generally view crashes as being).  I
> had thought there was something in place to address this though.  If
> not, it does seem like there should be.

Uh, can you please explain more?  Would the hint bits leak?  In another
email you said hint bit leaking was OK.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote:
> > We already discussed that there are too many other ways to break system
> > integrity that are not encrypted/integrity-checked, e.g., changes to
> > clog.  Do you disagree?
> 
> We had agreed that this wasn't something that was strictly required in
> the first version and I continue to agree with that.  On the other hand,
> if we decide that we ultimately need to use an independent nonce and
> further that we can make room in the special space for it, then it's
> trivial to also include the tag and we absolutely should (or make it
> optional to do so) in that case.

Well, if we can't really say the data has integrity, what does the
validation bytes accomplish?  And if are going to encrypt everything
that would allow integrity, we need to encrypt almost the entire file
system.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 05:04:50PM -0400, Stephen Frost wrote:
> > > Now, if we want to consult some security experts and have them tell us
> > > the hint bit visibility is not a problem, we could get by without using a
> > > new nonce for hint bit changes, and in that case it doesn't matter if we
> > > have a separate LSN or custom nonce --- it doesn't get changed for hint
> > > bit changes.
> >
> > I do think it's reasonable to consider having hint bits not included in
> > the encrypted part of the page and therefore remove the need to produce
> > a new nonce for each hint bit change.  Naturally, there's always an
> > increased risk when any data in the system isn't encrypted but given
> > the other parts of the system which aren't being encrypted as part of
> > this effort it hardly seems like a significant increase of overall risk.
> > I don't believe that any of the auditors and security teams I've
> > discussed TDE with would have issue with hint bits not being encrypted-
> > the principle concern has always been the primary data.
>
> OK, this is good to know.  I know the never-reuse rule, so it is good to
> know it can be relaxed for certain data without causing problems in
> other places.  Should I modify my patch to do this?

Err, to be clear, I was saying that we could exclude the hint bits
*entirely* from what's being encrypted and I don't think that would be a
huge issue.  We still absolutely need to continue to implement a
never-reuse rule when it comes to nonces and making sure that we don't
encrypt different sets of data with the same key+nonce, it's just that
if we exclude the hint bits from encryption then we don't need to worry
about making sure to use a different nonce each time the hint bits
change- because they're no longer relevant.

> FYI, technically, the hint bit is still encrypted, but could _flip_ in
> the encrypted file if changed, so that's why we say it is visible.  If
> we used a block cipher instead of a streaming one (CTR), this might not
> work because the earlier blocks can be based in the output of later
> blocks.

No, in what I'm talking about, the hint bits would be entirely excluded
and therefore not encrypted.  I don't think we should keep the hint bits
as part of what's encrypted but not increase the nonce, that's dangerous
imv.

> > Naturally, the more we are able to encrypt and the more we can do to
> > provide data integrity validation, may open up the possibility for PG to
> > be used in even more places, which argues for having some way of making
> > these choices be options which a user could decide at initdb time, or at
> > least contemplating a road map to where we could offer users the option
> > to have other parts of the system be encrypted and ideally have data
> > integrity checks, but I don't think we necessarily have to solve
> > everything right now in that regard- just having TDE in some form will
> > open up quite a few new possibilities for v15, even if it doesn't
> > include data integrity validation beyond our existing checksums and
> > doesn't encrypt hint bits.
>
> I am thinking full-file system encryption should still be used by people
> needing that.  I am concerned that if we add too many
> restrictions/additions on this feature, it will not be very useful.

I disagree in the long term but I'm fine with paring down what we
specifically work to address for v15.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 05:14:24PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > Yes, I can see that happening.  I think occasional leakage of hint bit
> > > changes to be acceptable.  We might decide they are all acceptable.
> >
> > I don't think that I agree with the idea that this would ultimately only
> > leak the hint bits- I'm fairly sure that this would make it relatively
> > trivial for an attacker to be able to deduce the contents of the entire
> > 8k page.  I don't know that we should be willing to accept that as a
> > part of regular operation (which we generally view crashes as being).  I
> > had thought there was something in place to address this though.  If
> > not, it does seem like there should be.
>
> Uh, can you please explain more?  Would the hint bits leak?  In another
> email you said hint bit leaking was OK.

See my recent email, think I clarified it well over there.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote:
> > > We already discussed that there are too many other ways to break system
> > > integrity that are not encrypted/integrity-checked, e.g., changes to
> > > clog.  Do you disagree?
> >
> > We had agreed that this wasn't something that was strictly required in
> > the first version and I continue to agree with that.  On the other hand,
> > if we decide that we ultimately need to use an independent nonce and
> > further that we can make room in the special space for it, then it's
> > trivial to also include the tag and we absolutely should (or make it
> > optional to do so) in that case.
>
> Well, if we can't really say the data has integrity, what does the
> validation bytes accomplish?  And if are going to encrypt everything
> that would allow integrity, we need to encrypt almost the entire file
> system.

I'm not following this logic.  The primary data would be guaranteed to
be unchanged and there is absolutely value in that, even if the metadata
is not guaranteed to be unmolested.  Security always comes with a lot of
tradeoffs.  RLS doesn't prevent certain side-channel attacks but it
still is extremely useful in a great many cases.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 05:22:43PM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > OK, this is good to know.  I know the never-reuse rule, so it is good to
> > know it can be relaxed for certain data without causing problems in
> > other places.  Should I modify my patch to do this?
> 
> Err, to be clear, I was saying that we could exclude the hint bits
> *entirely* from what's being encrypted and I don't think that would be a
> huge issue.  We still absolutely need to continue to implement a
> never-reuse rule when it comes to nonces and making sure that we don't
> encrypt different sets of data with the same key+nonce, it's just that
> if we exclude the hint bits from encryption then we don't need to worry
> about making sure to use a different nonce each time the hint bits
> change- because they're no longer relevant.

So, let me ask --- I thought CTR basically took an encrypted stream of
bits and XOR'ed them with the data.  If that is true, then why are
changing hint bits a problem?  We already can see some of the bit stream
by knowing some bytes of the page.  I do think skipping encryption of
just the hint bits is more complex, so I want to understand why if is
needed.  (This is a question I eventually wanted to discuss, just like
my XXX questions.)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 05:25:36PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote:
> > > > We already discussed that there are too many other ways to break system
> > > > integrity that are not encrypted/integrity-checked, e.g., changes to
> > > > clog.  Do you disagree?
> > > 
> > > We had agreed that this wasn't something that was strictly required in
> > > the first version and I continue to agree with that.  On the other hand,
> > > if we decide that we ultimately need to use an independent nonce and
> > > further that we can make room in the special space for it, then it's
> > > trivial to also include the tag and we absolutely should (or make it
> > > optional to do so) in that case.
> > 
> > Well, if we can't really say the data has integrity, what does the
> > validation bytes accomplish?  And if are going to encrypt everything
> > that would allow integrity, we need to encrypt almost the entire file
> > system.
> 
> I'm not following this logic.  The primary data would be guaranteed to
> be unchanged and there is absolutely value in that, even if the metadata
> is not guaranteed to be unmolested.  Security always comes with a lot of
> tradeoffs.  RLS doesn't prevent certain side-channel attacks but it
> still is extremely useful in a great many cases.

Well, changing the clog would change how the integrity-protected data is
interpreted, so I don't see much value in it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 16:34:10 -0400, Stephen Frost wrote:
> The nonce does need to be absolutely unique for a given encryption key and
> therefore needs to be global in some form.

You can achieve that without a global counter though, by prepending a
per-relation nonce with some local counter.

I'm doubtful it's worth it though - compared to all the other costs, one
shared atomic increment is pretty OK price to pay I think.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 17:04:50 -0400, Stephen Frost wrote:
> I do think it's reasonable to consider having hint bits not included in
> the encrypted part of the page and therefore remove the need to produce
> a new nonce for each hint bit change.

Huh. How are you going to track that efficiently? Do you want to mask
them out before writing? As far as I understand you can't just
re-encrypt a page with the same nonce, but different contents, without
leaking information that you can't have leaked, even if the differences
are not of a secret nature.

I don't think hint bits are the only way to end up with needing to
re-write a page with slightly different content, but the same LSN,
during recovery, after a crash.

I think it's just not going to fly to use LSNs as nonces, and that it's
not worth butchering all kinds of aspect of the system to make it appear
to work.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 17:22:43 -0400, Stephen Frost wrote:
> Err, to be clear, I was saying that we could exclude the hint bits
> *entirely* from what's being encrypted and I don't think that would be a
> huge issue.

It's a *huge* issue. For one, the computational effort of doing so would
be a problem. But there's a more fundamental issue: We don't even know
the type of the page at the time we write data out! We can't do a lookup
of pg_class in the checkpointer to see whether the page is a heap page
where we need to mask out hint bits.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote:
> So, let me ask --- I thought CTR basically took an encrypted stream of
> bits and XOR'ed them with the data.  If that is true, then why are
> changing hint bits a problem?  We already can see some of the bit stream
> by knowing some bytes of the page.

A *single* reuse of the nonce in CTR reveals nearly all of the
plaintext. As you say, the data is XORed with the key stream. Reusing
the nonce means that you reuse the key stream. Which in turn allows you
to do:
  (data ^ stream) ^ (data' ^ stream)
which can be simplified to
  (data ^ data')
thereby leaking all of data except the difference between data and
data'. That's why it's so crucial to ensure that stream *always* differs
between two rounds of encrypting "related" data.

We can't just "hope" that data doesn't change and use CTR.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 05:22:43PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > OK, this is good to know.  I know the never-reuse rule, so it is good to
> > > know it can be relaxed for certain data without causing problems in
> > > other places.  Should I modify my patch to do this?
> >
> > Err, to be clear, I was saying that we could exclude the hint bits
> > *entirely* from what's being encrypted and I don't think that would be a
> > huge issue.  We still absolutely need to continue to implement a
> > never-reuse rule when it comes to nonces and making sure that we don't
> > encrypt different sets of data with the same key+nonce, it's just that
> > if we exclude the hint bits from encryption then we don't need to worry
> > about making sure to use a different nonce each time the hint bits
> > change- because they're no longer relevant.
>
> So, let me ask --- I thought CTR basically took an encrypted stream of
> bits and XOR'ed them with the data.  If that is true, then why are
> changing hint bits a problem?  We already can see some of the bit stream
> by knowing some bytes of the page.  I do think skipping encryption of
> just the hint bits is more complex, so I want to understand why if is
> needed.  (This is a question I eventually wanted to discuss, just like
> my XXX questions.)

That's how CTR works, yes.  The issue that you run into is that once
you've got two pages which have different data but were encrypted with
the same key and nonce then you can use crib-dragging.

A good example of how this works is here:

http://travisdazell.blogspot.com/2012/11/many-time-pad-attack-crib-drag.html

Once you've got the two different pages which had the same key+nonce
used, you can XOR them together and then start cribbing, scanning the
page for legitimate data which doesn't have to be in the part of the
data that was different between the two original pages.

Not sure what you're referring to in the second half ... simply knowing
that some of the data has a given plaintext (such as having a really
good idea that the word 'the' exists in a given message) doesn't provide
you the same level of information as two pages encrypted with the same
key+nonce but having different data.  Indeed, AES is generally believed
to be quite effective against even given plaintext attacks:


https://math.stackexchange.com/questions/51960/is-it-possible-to-guess-an-aes-key-from-a-series-of-messages-encrypted-with-that/57428

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 05:25:36PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > On Tue, May 25, 2021 at 05:15:55PM -0400, Stephen Frost wrote:
> > > > > We already discussed that there are too many other ways to break system
> > > > > integrity that are not encrypted/integrity-checked, e.g., changes to
> > > > > clog.  Do you disagree?
> > > >
> > > > We had agreed that this wasn't something that was strictly required in
> > > > the first version and I continue to agree with that.  On the other hand,
> > > > if we decide that we ultimately need to use an independent nonce and
> > > > further that we can make room in the special space for it, then it's
> > > > trivial to also include the tag and we absolutely should (or make it
> > > > optional to do so) in that case.
> > >
> > > Well, if we can't really say the data has integrity, what does the
> > > validation bytes accomplish?  And if are going to encrypt everything
> > > that would allow integrity, we need to encrypt almost the entire file
> > > system.
> >
> > I'm not following this logic.  The primary data would be guaranteed to
> > be unchanged and there is absolutely value in that, even if the metadata
> > is not guaranteed to be unmolested.  Security always comes with a lot of
> > tradeoffs.  RLS doesn't prevent certain side-channel attacks but it
> > still is extremely useful in a great many cases.
>
> Well, changing the clog would change how the integrity-protected data is
> interpreted, so I don't see much value in it.

I hate to have to say it, but no, it's simply not correct to presume
that the ability to maniuplate any data means that it's not valuable to
protect anything.  Further, while clog could be manipulated today,
hopefully one day it would become quite difficult to do so.  I'm not
asking for that today, or to be in v15, but if we do come down on the
side of making space in the special area for a nonce, then, even if you
don't feel it's useful, I would strongly argue to have an option for
space to also exist for a tag to go.

Even if your claim that it's useless until clog is addressed were
correct, which I dispute, surely if we do one day have such validation
of clog we would also need a tag in the regular user pages, so why not
add the option while it's easy to do and let users decide if it's useful
to them or not?

This does presume that we ultimately agree on the approach which
involves the special area, of course.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-25 16:34:10 -0400, Stephen Frost wrote:
> > The nonce does need to be absolutely unique for a given encryption key and
> > therefore needs to be global in some form.
>
> You can achieve that without a global counter though, by prepending a
> per-relation nonce with some local counter.
>
> I'm doubtful it's worth it though - compared to all the other costs, one
> shared atomic increment is pretty OK price to pay I think.

Yes, I tend to agree.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Andres Freund
Date:
On 2021-05-25 19:48:54 -0400, Stephen Frost wrote:
> That's how CTR works, yes.  The issue that you run into is that once
> you've got two pages which have different data but were encrypted with
> the same key and nonce then you can use crib-dragging.
> 
> A good example of how this works is here:
> 
> http://travisdazell.blogspot.com/2012/11/many-time-pad-attack-crib-drag.html
> 
> Once you've got the two different pages which had the same key+nonce
> used, you can XOR them together and then start cribbing, scanning the
> page for legitimate data which doesn't have to be in the part of the
> data that was different between the two original pages.

IOW, purely hint bit changes are the *dream* case for an attacker,
because any difference can just be ignored. All an attacker has to do is
to look at the writes, see if an IV repeats for a block, and the
attacker will get the *entire* page's worth of data. Either minus hint
bits (which are irrelevant), or with a trivial bit of inferrence even
that (because hint bits can only change in one direction).

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-25 17:04:50 -0400, Stephen Frost wrote:
> > I do think it's reasonable to consider having hint bits not included in
> > the encrypted part of the page and therefore remove the need to produce
> > a new nonce for each hint bit change.
>
> Huh. How are you going to track that efficiently? Do you want to mask
> them out before writing? As far as I understand you can't just
> re-encrypt a page with the same nonce, but different contents, without
> leaking information that you can't have leaked, even if the differences
> are not of a secret nature.

The simple thought I had was masking them out, yes.  No, you can't
re-encrypt a different page with the same nonce.  (Re-encrypting the
exact same page with the same nonce, however, just yields the same
cryptotext and therefore is fine).

> I don't think hint bits are the only way to end up with needing to
> re-write a page with slightly different content, but the same LSN,
> during recovery, after a crash.

Any other cases would have to be addressed if we were to use LSNs, of
course.

> I think it's just not going to fly to use LSNs as nonces, and that it's
> not worth butchering all kinds of aspect of the system to make it appear
> to work.

I do agree that we'd want to avoid "butchering all kinds of aspects of
the system" if possible. :)

Thanks!

Stephen

Attachment

Re: storing an explicit nonce

From
Andres Freund
Date:
On 2021-05-25 17:15:55 -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > We already discussed that there are too many other ways to break system
> > integrity that are not encrypted/integrity-checked, e.g., changes to
> > clog.  Do you disagree?
> 
> We had agreed that this wasn't something that was strictly required in
> the first version and I continue to agree with that.  On the other hand,
> if we decide that we ultimately need to use an independent nonce and
> further that we can make room in the special space for it, then it's
> trivial to also include the tag and we absolutely should (or make it
> optional to do so) in that case.

The page format for clog and that for relation data is unrelated.



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-25 17:22:43 -0400, Stephen Frost wrote:
> > Err, to be clear, I was saying that we could exclude the hint bits
> > *entirely* from what's being encrypted and I don't think that would be a
> > huge issue.
>
> It's a *huge* issue. For one, the computational effort of doing so would
> be a problem. But there's a more fundamental issue: We don't even know
> the type of the page at the time we write data out! We can't do a lookup
> of pg_class in the checkpointer to see whether the page is a heap page
> where we need to mask out hint bits.

Yeah, I hadn't been contemplating the challenge in figuring out if the
changes were hint bit changes or if it was some other page- merely
reflecting on the question of if hint bits, themselves, could possibly
be excluded.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-25 17:15:55 -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > We already discussed that there are too many other ways to break system
> > > integrity that are not encrypted/integrity-checked, e.g., changes to
> > > clog.  Do you disagree?
> >
> > We had agreed that this wasn't something that was strictly required in
> > the first version and I continue to agree with that.  On the other hand,
> > if we decide that we ultimately need to use an independent nonce and
> > further that we can make room in the special space for it, then it's
> > trivial to also include the tag and we absolutely should (or make it
> > optional to do so) in that case.
>
> The page format for clog and that for relation data is unrelated.

Indeed they are, but that's not relevant to the thrust of this specific
debate.

Bruce is arguing that because clog is unprotected that it's not useful
to protect relation data, with regard to data integrity validation as
provided by AES-GCM using/storing tags.  I dispute this, as relation
data is primary data while clog, for all its value, is still metadata.
Yes, impacting the metadata has an impact on the primary data, but it
doesn't *change* that primary data at its core (and it's also more
likely to be detected than random bit flipping in the relation data
would be, which is possible if you're only encrypting and not providing
any integrity validation).

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 08:03:14PM -0400, Stephen Frost wrote:
> Indeed they are, but that's not relevant to the thrust of this specific
> debate.
> 
> Bruce is arguing that because clog is unprotected that it's not useful
> to protect relation data, with regard to data integrity validation as
> provided by AES-GCM using/storing tags.  I dispute this, as relation
> data is primary data while clog, for all its value, is still metadata.
> Yes, impacting the metadata has an impact on the primary data, but it
> doesn't *change* that primary data at its core (and it's also more
> likely to be detected than random bit flipping in the relation data
> would be, which is possible if you're only encrypting and not providing
> any integrity validation).

Even if you can protect clog, this documentation paragraph makes it
clear that if you can modify the cluster, you can weaken security enough
to read and write any data you want:

https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch

    Cluster file encryption does not protect against unauthorized
    file system writes.  Such writes can allow data decryption if
    used to weaken the system's security and the weakened system is
    later supplied with the externally-stored cluster encryption key.
    This also does not always detect if users with write access remove
    or modify database files.

I know of no way to make that safer, so again, I don't see the value in
modification detection.  Maybe someday we would find a way, but it seems
so remote as to not warrant consideration.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote:
> > So, let me ask --- I thought CTR basically took an encrypted stream of
> > bits and XOR'ed them with the data.  If that is true, then why are
> > changing hint bits a problem?  We already can see some of the bit stream
> > by knowing some bytes of the page.
> 
> A *single* reuse of the nonce in CTR reveals nearly all of the
> plaintext. As you say, the data is XORed with the key stream. Reusing
> the nonce means that you reuse the key stream. Which in turn allows you
> to do:
>   (data ^ stream) ^ (data' ^ stream)
> which can be simplified to
>   (data ^ data')
> thereby leaking all of data except the difference between data and
> data'. That's why it's so crucial to ensure that stream *always* differs
> between two rounds of encrypting "related" data.
> 
> We can't just "hope" that data doesn't change and use CTR.

My point was about whether we need to change the nonce, and hence
WAL-log full page images if we change hint bits.  If we don't and
reencrypt the page with the same nonce, don't we only expose the hint
bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
cases.

I don't understand your computation above.  You decrypt the page into
shared buffers, you change a hint bit, and rewrite the page.  You are
re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
only change the hint bits in the new write?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 08:03:14PM -0400, Stephen Frost wrote:
> > Indeed they are, but that's not relevant to the thrust of this specific
> > debate.
> >
> > Bruce is arguing that because clog is unprotected that it's not useful
> > to protect relation data, with regard to data integrity validation as
> > provided by AES-GCM using/storing tags.  I dispute this, as relation
> > data is primary data while clog, for all its value, is still metadata.
> > Yes, impacting the metadata has an impact on the primary data, but it
> > doesn't *change* that primary data at its core (and it's also more
> > likely to be detected than random bit flipping in the relation data
> > would be, which is possible if you're only encrypting and not providing
> > any integrity validation).
>
> Even if you can protect clog, this documentation paragraph makes it
> clear that if you can modify the cluster, you can weaken security enough
> to read and write any data you want:
>
> https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch
>
>     Cluster file encryption does not protect against unauthorized
>     file system writes.  Such writes can allow data decryption if
>     used to weaken the system's security and the weakened system is
>     later supplied with the externally-stored cluster encryption key.
>     This also does not always detect if users with write access remove
>     or modify database files.

This is clearly a different consideration than the concern around clog
and speaks to the issues with how we fetch and maintain the key- things
which we can and really should be better about than what is currently
being done, and which I do believe we will improve upon.

> I know of no way to make that safer, so again, I don't see the value in
> modification detection.  Maybe someday we would find a way, but it seems
> so remote as to not warrant consideration.

I'm rather baffled by the comment that there's 'no way to make that
safer'.  Giving users a way to segregate actual data from configuration
and commands would greatly improve the situation by making it much more
difficult for a user who only has access to the data directory, where
much of the data is encrypted and protected against data maniupulation
using proper tags, to capture the encryption key.

The concerns which are not actually discussed in the paragraph above
relate to how the key is handled- specifically that we run some external
command that the user provides to fetch it, and that command can be
overridden via postgresql.auto.conf that lives in the data directory.
That's not a terribly safe thing to do and we can certainly do better,
and without all that much difficulty if we actually look at doing so.

A very simple approach would be to just require that the command to
fetch the encryption key come from postgresql.conf and then simply
encrypt+protect postgresql.auto.conf.  We'd then document that the user
needs to ensure they have appropriate protection of postgresql.conf,
which could and probably should live elsewhere.

I'd like to see us incrementally move in the direction of providing a
way for users, probably advanced ones to start but hopefully eventually
getting to a point that you don't have to be an advanced user, to
implement a reasonably secure solution which provides both
confidentiality and integrity.  We do not have to solve all of these
things in the first release, but I don't think we should be talking
today about tossing out the idea that, some day down the road, we could
have a robust system which provides both.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 07:48:54PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, May 25, 2021 at 05:22:43PM -0400, Stephen Frost wrote:
> > > * Bruce Momjian (bruce@momjian.us) wrote:
> > > > OK, this is good to know.  I know the never-reuse rule, so it is good to
> > > > know it can be relaxed for certain data without causing problems in
> > > > other places.  Should I modify my patch to do this?
> > > 
> > > Err, to be clear, I was saying that we could exclude the hint bits
> > > *entirely* from what's being encrypted and I don't think that would be a
> > > huge issue.  We still absolutely need to continue to implement a
> > > never-reuse rule when it comes to nonces and making sure that we don't
> > > encrypt different sets of data with the same key+nonce, it's just that
> > > if we exclude the hint bits from encryption then we don't need to worry
> > > about making sure to use a different nonce each time the hint bits
> > > change- because they're no longer relevant.
> > 
> > So, let me ask --- I thought CTR basically took an encrypted stream of
> > bits and XOR'ed them with the data.  If that is true, then why are
> > changing hint bits a problem?  We already can see some of the bit stream
> > by knowing some bytes of the page.  I do think skipping encryption of
> > just the hint bits is more complex, so I want to understand why if is
> > needed.  (This is a question I eventually wanted to discuss, just like
> > my XXX questions.)
> 
> That's how CTR works, yes.  The issue that you run into is that once
> you've got two pages which have different data but were encrypted with
> the same key and nonce then you can use crib-dragging.
> 
> A good example of how this works is here:
> 
> http://travisdazell.blogspot.com/2012/11/many-time-pad-attack-crib-drag.html
> 
> Once you've got the two different pages which had the same key+nonce
> used, you can XOR them together and then start cribbing, scanning the
> page for legitimate data which doesn't have to be in the part of the
> data that was different between the two original pages.
> 
> Not sure what you're referring to in the second half ... simply knowing
> that some of the data has a given plaintext (such as having a really
> good idea that the word 'the' exists in a given message) doesn't provide
> you the same level of information as two pages encrypted with the same
> key+nonce but having different data.  Indeed, AES is generally believed
> to be quite effective against even given plaintext attacks:
> 
>
https://math.stackexchange.com/questions/51960/is-it-possible-to-guess-an-aes-key-from-a-series-of-messages-encrypted-with-that/57428

Agreed.  I was just reinforcing that, and trying to say that hint bit
change might also be considered known information.

Anyway, if you think the hint bit changes would leak, I an accept that. 
It means we need to wal log hit bit changes, no matter if the nonce is
the LSN or a custom one.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 07:48:54PM -0400, Stephen Frost wrote:
> > Not sure what you're referring to in the second half ... simply knowing
> > that some of the data has a given plaintext (such as having a really
> > good idea that the word 'the' exists in a given message) doesn't provide
> > you the same level of information as two pages encrypted with the same
> > key+nonce but having different data.  Indeed, AES is generally believed
> > to be quite effective against even given plaintext attacks:
> >
> >
https://math.stackexchange.com/questions/51960/is-it-possible-to-guess-an-aes-key-from-a-series-of-messages-encrypted-with-that/57428
>
> Agreed.  I was just reinforcing that, and trying to say that hint bit
> change might also be considered known information.
>
> Anyway, if you think the hint bit changes would leak, I an accept that.
> It means we need to wal log hit bit changes, no matter if the nonce is
> the LSN or a custom one.

The nonce needs to be a new one, if we include the hint bits in the set
of data which is encrypted.

However, what I believe folks are getting at here is that we could keep
the LSN the same, but increase the nonce when the hint bits change, but
*not* WAL log either the nonce change or the hint bit change (unless
it's being logged for some other reason, in which case log both), thus
reducing the amount of WAL being produced.  What would matter is that
both the hint bit change and the new nonce hit disk at the same time, or
neither do, or we replay back to some state where the nonce and the hint
bits 'match up' so that the page decrypts (and the integrity check
works).

That generally seems pretty reasonable to me and basically makes the
increase in nonce work very much in the same manner that the hint bits
themselves do- sometimes it changes even when the LSN doesn't but, in
such cases, we don't actually WAL it, and that's ok because we don't
actually care about it being updated- what's in the WAL when the page is
replayed is perfectly fine and we'll just update the hint bits again
when and if we decide we need to based on the actual visibility
information at that time.

Now, making sure that we don't end up re-using the same nonce over again
is a concern and we'd want to address that somehow, as suggested
earlier perhaps by simply incrementing it making sure to durably note
whenever we'd crossed some threshold (each 1k or whatever) and then on
crash recovery making sure we bump past that, but that seems entirely
doable.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote:
> The nonce needs to be a new one, if we include the hint bits in the set
> of data which is encrypted.
> 
> However, what I believe folks are getting at here is that we could keep
> the LSN the same, but increase the nonce when the hint bits change, but
> *not* WAL log either the nonce change or the hint bit change (unless
> it's being logged for some other reason, in which case log both), thus
> reducing the amount of WAL being produced.  What would matter is that
> both the hint bit change and the new nonce hit disk at the same time, or
> neither do, or we replay back to some state where the nonce and the hint
> bits 'match up' so that the page decrypts (and the integrity check
> works).

How do we prevent torn pages if we are writing the page with a new
nonce, and no WAL-logged full page image?

> That generally seems pretty reasonable to me and basically makes the
> increase in nonce work very much in the same manner that the hint bits
> themselves do- sometimes it changes even when the LSN doesn't but, in
> such cases, we don't actually WAL it, and that's ok because we don't
> actually care about it being updated- what's in the WAL when the page is
> replayed is perfectly fine and we'll just update the hint bits again
> when and if we decide we need to based on the actual visibility
> information at that time.

We get away with this because hint-bit only changes only change single
bytes on the page, and we can't tear a page between bytes, but if we
change the nonce, the entire page will have different bytes.  What am I
missing here?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
On 2021-05-25 21:51:31 -0400, Bruce Momjian wrote:
> How do we prevent torn pages if we are writing the page with a new
> nonce, and no WAL-logged full page image?

That should only arise if we are guaranteed to replay from a redo point
that is followed by at least one FPI for the page we're about to write?

- Andres



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote:
> > The nonce needs to be a new one, if we include the hint bits in the set
> > of data which is encrypted.
> >
> > However, what I believe folks are getting at here is that we could keep
> > the LSN the same, but increase the nonce when the hint bits change, but
> > *not* WAL log either the nonce change or the hint bit change (unless
> > it's being logged for some other reason, in which case log both), thus
> > reducing the amount of WAL being produced.  What would matter is that
> > both the hint bit change and the new nonce hit disk at the same time, or
> > neither do, or we replay back to some state where the nonce and the hint
> > bits 'match up' so that the page decrypts (and the integrity check
> > works).
>
> How do we prevent torn pages if we are writing the page with a new
> nonce, and no WAL-logged full page image?

err, we'd still WAL the FPI, same as we do for checksums, that's what I
would expect and would think we'd need.  As long as the FPI is in the
WAL since the last checkpoint, later changes to hint bits or the nonce
wouldn't matter- we'll replay the FPI and that'll have the right nonce
for the hint bits that were part of the FPI.

Any subsequent changes to the hint bits wouldn't be WAL'd though and
neither would the changes to the nonce and that all should be fine
because we'll blow away the entire page on crash recovery to push it
back to what it was when we first wrote the page after the last
checkpoint.  Naturally, other changes which have to be WAL'd would still
be done but those would be replayed in shared buffers on top of the
prior FPI and the nonce set to some $new value (one which we know
couldn't have been used prior, by incrementing by some value) when we go
to write out that new page.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 09:58:22PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote:
> > > The nonce needs to be a new one, if we include the hint bits in the set
> > > of data which is encrypted.
> > > 
> > > However, what I believe folks are getting at here is that we could keep
> > > the LSN the same, but increase the nonce when the hint bits change, but
> > > *not* WAL log either the nonce change or the hint bit change (unless
> > > it's being logged for some other reason, in which case log both), thus
> > > reducing the amount of WAL being produced.  What would matter is that
> > > both the hint bit change and the new nonce hit disk at the same time, or
> > > neither do, or we replay back to some state where the nonce and the hint
> > > bits 'match up' so that the page decrypts (and the integrity check
> > > works).
> > 
> > How do we prevent torn pages if we are writing the page with a new
> > nonce, and no WAL-logged full page image?
> 
> err, we'd still WAL the FPI, same as we do for checksums, that's what I
> would expect and would think we'd need.  As long as the FPI is in the
> WAL since the last checkpoint, later changes to hint bits or the nonce
> wouldn't matter- we'll replay the FPI and that'll have the right nonce
> for the hint bits that were part of the FPI.
> 
> Any subsequent changes to the hint bits wouldn't be WAL'd though and
> neither would the changes to the nonce and that all should be fine
> because we'll blow away the entire page on crash recovery to push it
> back to what it was when we first wrote the page after the last
> checkpoint.  Naturally, other changes which have to be WAL'd would still
> be done but those would be replayed in shared buffers on top of the
> prior FPI and the nonce set to some $new value (one which we know
> couldn't have been used prior, by incrementing by some value) when we go
> to write out that new page.

OK, I see what you are saying.  If we use a nonce that is not the full
page write LSN then we can use it for hint bit changes _after_ the first
full page write during the checkpoint, and we don't need to WAL log that
since it isn't a real LSN and we can throw it away on crash recovery. 
This is not possible if we are using the LSN for the full page write LSN
for the hint bit nonce, though we could use a dummy WAL record to
generate an LSN for this, right?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
On 2021-05-25 22:11:46 -0400, Bruce Momjian wrote:
> This is not possible if we are using the LSN for the full page write LSN
> for the hint bit nonce, though we could use a dummy WAL record to
> generate an LSN for this, right?

We cannot use a dummy WAL record, see my explanation about the standby /
crash recovery issues.



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

On Tue, May 25, 2021 at 22:11 Bruce Momjian <bruce@momjian.us> wrote:
On Tue, May 25, 2021 at 09:58:22PM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, May 25, 2021 at 09:42:48PM -0400, Stephen Frost wrote:
> > > The nonce needs to be a new one, if we include the hint bits in the set
> > > of data which is encrypted.
> > >
> > > However, what I believe folks are getting at here is that we could keep
> > > the LSN the same, but increase the nonce when the hint bits change, but
> > > *not* WAL log either the nonce change or the hint bit change (unless
> > > it's being logged for some other reason, in which case log both), thus
> > > reducing the amount of WAL being produced.  What would matter is that
> > > both the hint bit change and the new nonce hit disk at the same time, or
> > > neither do, or we replay back to some state where the nonce and the hint
> > > bits 'match up' so that the page decrypts (and the integrity check
> > > works).
> >
> > How do we prevent torn pages if we are writing the page with a new
> > nonce, and no WAL-logged full page image?
>
> err, we'd still WAL the FPI, same as we do for checksums, that's what I
> would expect and would think we'd need.  As long as the FPI is in the
> WAL since the last checkpoint, later changes to hint bits or the nonce
> wouldn't matter- we'll replay the FPI and that'll have the right nonce
> for the hint bits that were part of the FPI.
>
> Any subsequent changes to the hint bits wouldn't be WAL'd though and
> neither would the changes to the nonce and that all should be fine
> because we'll blow away the entire page on crash recovery to push it
> back to what it was when we first wrote the page after the last
> checkpoint.  Naturally, other changes which have to be WAL'd would still
> be done but those would be replayed in shared buffers on top of the
> prior FPI and the nonce set to some $new value (one which we know
> couldn't have been used prior, by incrementing by some value) when we go
> to write out that new page.

OK, I see what you are saying.  If we use a nonce that is not the full
page write LSN then we can use it for hint bit changes _after_ the first
full page write during the checkpoint, and we don't need to WAL log that
since it isn't a real LSN and we can throw it away on crash recovery.
This is not possible if we are using the LSN for the full page write LSN
for the hint bit nonce, though we could use a dummy WAL record to
generate an LSN for this, right?

Yes, think you’ve got it.  To do it using LSNs and ensure that we always have a unique nonce we’d have to generated dummy WAL, in order to get new LSNs to make sure the nonce is always unique and that wouldn’t be great.

Andres mentioned other possible cases where the LSN doesn’t change even though we change the page and, as he’s probably right, we would have to figure out a solution in those cases too (potentially including cases like crash recovery or replay on a replica where we can’t really just go around creating dummy WAL records to get new LSNs..).  If the nonce isn’t the LSN then suddenly those cases are fine and the LSN can stay the same and it doesn’t matter that the nonce is changed when we write out the page during crash recovery because it’s not tied to the WAL/LSN stream.

If I’ve got it right, that does mean that the nonces on the replica might differ from those on the primary though and I’m not completely sure how I feel about that. We might wish to explicitly document that, due to such risk, users should use unique and distinct keys on each replica that are different from the primary and each other (not a bad idea in general anyway, but would be quite important with this strategy).

Thanks,

Stephen

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 10:23:46PM -0400, Stephen Frost wrote:
> If I’ve got it right, that does mean that the nonces on the replica might
> differ from those on the primary though and I’m not completely sure how I feel
> about that. We might wish to explicitly document that, due to such risk, users
> should use unique and distinct keys on each replica that are different from the
> primary and each other (not a bad idea in general anyway, but would be quite
> important with this strategy).

I have to think more about this, but we were planning to allow different
primary and replica relation encryption keys to allow for relation key
rotation.  The WAL key has to be the same for both.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 09:31:02PM -0400, Bruce Momjian wrote:
> I don't understand your computation above.  You decrypt the page into
> shared buffers, you change a hint bit, and rewrite the page.  You are
> re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> only change the hint bits in the new write?

Can someone explain the hint bit exploit using the process I describe
here?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Antonin Houska
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote:
> > > So, let me ask --- I thought CTR basically took an encrypted stream of
> > > bits and XOR'ed them with the data.  If that is true, then why are
> > > changing hint bits a problem?  We already can see some of the bit stream
> > > by knowing some bytes of the page.
> > 
> > A *single* reuse of the nonce in CTR reveals nearly all of the
> > plaintext. As you say, the data is XORed with the key stream. Reusing
> > the nonce means that you reuse the key stream. Which in turn allows you
> > to do:
> >   (data ^ stream) ^ (data' ^ stream)
> > which can be simplified to
> >   (data ^ data')
> > thereby leaking all of data except the difference between data and
> > data'. That's why it's so crucial to ensure that stream *always* differs
> > between two rounds of encrypting "related" data.
> > 
> > We can't just "hope" that data doesn't change and use CTR.
> 
> My point was about whether we need to change the nonce, and hence
> WAL-log full page images if we change hint bits.  If we don't and
> reencrypt the page with the same nonce, don't we only expose the hint
> bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> cases.
> 
> I don't understand your computation above.  You decrypt the page into
> shared buffers, you change a hint bit, and rewrite the page.  You are
> re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> only change the hint bits in the new write?

The way I view things is that the CTR mode encrypts each individual bit,
independent from any other bit on the page. For non-hint bits data=data', so
(data ^ data') is always zero, regardless the actual values of the data. So I
agree with you that by reusing the nonce we only expose the hint bits.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, May 25, 2021 at 7:58 PM Stephen Frost <sfrost@snowman.net> wrote:
> The simple thought I had was masking them out, yes.  No, you can't
> re-encrypt a different page with the same nonce.  (Re-encrypting the
> exact same page with the same nonce, however, just yields the same
> cryptotext and therefore is fine).

In the interest of not being viewed as too much of a naysayer, let me
first reiterate that I am generally in favor of TDE going forward and
am not looking to throw up unnecessary obstacles in the way of making
that happen.

That said, I don't see how this particular idea can work. When we want
to write a page out to disk, we need to identify which bits in the
page are hint bits, so that we can avoid including them in what is
encrypted, which seems complicated and expensive. But even worse, when
we then read a page back off of disk, we'd need to decrypt everything
except for the hint bits, but how do we know which bits are hint bits
if the page isn't decrypted yet? We can't annotate an 8kB page that
might be full with enough extra information to say where the
non-encrypted parts are and still have the result be guaranteed to fit
within 8kb.

Also, it's not just hint bits per se, but anything that would cause us
to use MarkBufferDirtyHint(). For a btree index, per  _bt_check_unique
and _bt_killitems, that includes the entire line pointer array,
because of how ItemIdMarkDead() is used. Even apart from the problem
of how decryption would know which things we encrypted and which
things we didn't, I really have a hard time believing that it's OK to
exclude the entire line pointer array in every btree page from
encryption from a security perspective. Among other potential
problems, that's leaking all the information an attacker could
possibly want to have about where their known plaintext might occur in
the page.

However, I believe that if we store the nonce in the page explicitly,
as proposed here, rather trying to derive it from the LSN, then we
don't need to worry about this kind of masking, which I think is
better from both a security perspective and a performance perspective.
There is one thing I'm not quite sure about, though. I had previously
imagined that each page would have a nonce and we could just do
nonce++ each time we write the page. But that doesn't quite work if
the standby can do more writes of the same page than the master. One
vague idea I have for fixing this is: let each page's 16-byte nonce
consist of 8 random bytes and an 8-byte counter that will be
incremented on every write. But, the first time a standby writes each
page, force a "key rotation" where the 8-byte random value is replaced
with a new one, different one from what the master is using for that
page. Detecting this is a bit expensive, because it probably means we
need to store the TLI that last wrote each page on every page too, but
maybe it could be made to work; we're talking about a feature that is
expensive by nature. However, I'm a little worried about the
cryptographic properties of this approach. It would often mean that an
attacker who has full filesystem access can get multiple encrypted
images of the same data, each encrypted with a different nonce. I
don't know whether that's a hazard or not, but it feels like the sort
of thing that, if I were a cryptographer, I would be pleased to have.

Another idea might be - instead of doing nonce++ every time we write
the page, do nonce=random(). That's eventually going to repeat a
value, but it's extremely likely to take a *super* long time if there
are enough bits. A potentially rather large problem, though, is that
generating random numbers in large quantities isn't very cheap.

Anybody got a better idea?

I really like your (Stephen's) idea of including something in the
special space that permits integrity checking. One thing that is quite
nice about that is we could do it first, as an independent patch,
before we did TDE. It would be an independently useful feature, and it
would mean that if there are any problems with the code that injects
stuff into the special space, we could try to track those down in a
non-TDE context. That's really good, because in a TDE context, the
pages are going to be garbled and unreadable (we hope, anyway). If we
have a problem that we can reproduce with just an integrity-checking
token shoved into every page, you can look at the page and try to
understand what went wrong. So I really like this direction both from
the point of view of improving integrity checking, and also from the
point of view of being able to debug problems.

Now, one downside of this approach is that if we have the ability to
turn integrity-checking tokens on and off, and separately we can turn
encryption on and off, then we can't simplify down to two cases as
Andres was advocating above; you have to cater to a variety of
possible values of how-much-stuff-we-squeezed-into-the-special space.
At that point you kind of end up with the approach the draft patches
were already taking, which Andres was worried would be expensive.

I am not entirely certain, however, that I understand what the
proposal is here exactly for integrity verification. I Googled
"AES-GCM using/storing tags" but it didn't help me that much, because
I don't really know the subject area. A really simple integrity
verifier for a page would be to store the db OID, ts OID, relfilenode,
and block number in the page, and check them on read, preventing
blocks from moving around without us noticing. But I gather that
perhaps the idea here is to store something like
hash(db_oid||ts_oid||relfilenode||block||block_contents) in each page,
basically a beefed-up checksum that is too wide to fake easily. It's
probably more complicated than that, though: I admit to having limited
knowledge of modern cryptography.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, May 25, 2021 at 7:58 PM Stephen Frost <sfrost@snowman.net> wrote:
> > The simple thought I had was masking them out, yes.  No, you can't
> > re-encrypt a different page with the same nonce.  (Re-encrypting the
> > exact same page with the same nonce, however, just yields the same
> > cryptotext and therefore is fine).
>
> In the interest of not being viewed as too much of a naysayer, let me
> first reiterate that I am generally in favor of TDE going forward and
> am not looking to throw up unnecessary obstacles in the way of making
> that happen.

Quite glad to hear that.  Hopefully we'll all be able to get on the same
page to move TDE forward.

> That said, I don't see how this particular idea can work. When we want
> to write a page out to disk, we need to identify which bits in the
> page are hint bits, so that we can avoid including them in what is
> encrypted, which seems complicated and expensive. But even worse, when
> we then read a page back off of disk, we'd need to decrypt everything
> except for the hint bits, but how do we know which bits are hint bits
> if the page isn't decrypted yet? We can't annotate an 8kB page that
> might be full with enough extra information to say where the
> non-encrypted parts are and still have the result be guaranteed to fit
> within 8kb.

Yeah, Andres pointed that out and it's certainly an issue with this
general idea.

> Also, it's not just hint bits per se, but anything that would cause us
> to use MarkBufferDirtyHint(). For a btree index, per  _bt_check_unique
> and _bt_killitems, that includes the entire line pointer array,
> because of how ItemIdMarkDead() is used. Even apart from the problem
> of how decryption would know which things we encrypted and which
> things we didn't, I really have a hard time believing that it's OK to
> exclude the entire line pointer array in every btree page from
> encryption from a security perspective. Among other potential
> problems, that's leaking all the information an attacker could
> possibly want to have about where their known plaintext might occur in
> the page.

Also a good point.

> However, I believe that if we store the nonce in the page explicitly,
> as proposed here, rather trying to derive it from the LSN, then we
> don't need to worry about this kind of masking, which I think is
> better from both a security perspective and a performance perspective.
> There is one thing I'm not quite sure about, though. I had previously
> imagined that each page would have a nonce and we could just do
> nonce++ each time we write the page. But that doesn't quite work if
> the standby can do more writes of the same page than the master. One
> vague idea I have for fixing this is: let each page's 16-byte nonce
> consist of 8 random bytes and an 8-byte counter that will be
> incremented on every write. But, the first time a standby writes each
> page, force a "key rotation" where the 8-byte random value is replaced
> with a new one, different one from what the master is using for that
> page. Detecting this is a bit expensive, because it probably means we
> need to store the TLI that last wrote each page on every page too, but
> maybe it could be made to work; we're talking about a feature that is
> expensive by nature. However, I'm a little worried about the
> cryptographic properties of this approach. It would often mean that an
> attacker who has full filesystem access can get multiple encrypted
> images of the same data, each encrypted with a different nonce. I
> don't know whether that's a hazard or not, but it feels like the sort
> of thing that, if I were a cryptographer, I would be pleased to have.

I do agree that, in general, this is a feature that's expensive to begin
with and folks are generally going to be accepting of that.  Encrypting
the same data with different nonces will produce different results and
shouldn't be an issue.  The nonces really do need to be unique for a
given key though.

> Another idea might be - instead of doing nonce++ every time we write
> the page, do nonce=random(). That's eventually going to repeat a
> value, but it's extremely likely to take a *super* long time if there
> are enough bits. A potentially rather large problem, though, is that
> generating random numbers in large quantities isn't very cheap.

There's specific discussion about how to choose a nonce in NIST
publications and using a properly random one that's large enough is
one accepted approach, though my recollection was that the preference
was to use an incrementing guaranteed-unique nonce and using a random
one was more of a "if you can't coordinate using an incrementing one
then you can do this".  I can try to hunt for the specifics on that
though.

The issue of getting large amounts of cryptographically random numbers
seems very likely to make this not work so well though.

> Anybody got a better idea?

If we stipulate (and document) that all replicas need their own keys
then we no longer need to worry about nonce re-use between the primary
and the replica.  Not sure that's *better*, per se, but I do think it's
worth consideration.  Teaching pg_basebackup how to decrypt and then
re-encrypt with a different key wouldn't be challenging.

> I really like your (Stephen's) idea of including something in the
> special space that permits integrity checking. One thing that is quite
> nice about that is we could do it first, as an independent patch,
> before we did TDE. It would be an independently useful feature, and it
> would mean that if there are any problems with the code that injects
> stuff into the special space, we could try to track those down in a
> non-TDE context. That's really good, because in a TDE context, the
> pages are going to be garbled and unreadable (we hope, anyway). If we
> have a problem that we can reproduce with just an integrity-checking
> token shoved into every page, you can look at the page and try to
> understand what went wrong. So I really like this direction both from
> the point of view of improving integrity checking, and also from the
> point of view of being able to debug problems.

I agree with all of this.

> Now, one downside of this approach is that if we have the ability to
> turn integrity-checking tokens on and off, and separately we can turn
> encryption on and off, then we can't simplify down to two cases as
> Andres was advocating above; you have to cater to a variety of
> possible values of how-much-stuff-we-squeezed-into-the-special space.
> At that point you kind of end up with the approach the draft patches
> were already taking, which Andres was worried would be expensive.

Yes, if the amount of space available is variable then there's an added
cost for that.  While I appreciate the concern about having that be
expensive, for my 2c at least, I like to think that having this sudden
space that's available for use may lead to other really interesting
capabilities beyond the ones we're talking about here, so I'm not really
thrilled with the idea of boiling it down to just two cases.

> I am not entirely certain, however, that I understand what the
> proposal is here exactly for integrity verification. I Googled
> "AES-GCM using/storing tags" but it didn't help me that much, because
> I don't really know the subject area. A really simple integrity
> verifier for a page would be to store the db OID, ts OID, relfilenode,
> and block number in the page, and check them on read, preventing
> blocks from moving around without us noticing. But I gather that
> perhaps the idea here is to store something like
> hash(db_oid||ts_oid||relfilenode||block||block_contents) in each page,
> basically a beefed-up checksum that is too wide to fake easily. It's
> probably more complicated than that, though: I admit to having limited
> knowledge of modern cryptography.

Happy to help on this bit.  Probably the simplest way to explain what's
going on here is that you have two functions- encrypt and decrypt.  The
encrypt function takes: (key, nonce, plaintext) and returns (ciphertext,
tag).  The decrypt function takes: (key, nonce, ciphertext, tag) and
returns: (plaintext) ... OR an error saying "data integrity check
failed".

As an example, here's a test case from NIST for AES GCM *encryption*:

Key = 31bdadd96698c204aa9ce1448ea94ae1fb4a9a0b3c9d773b51bb1822666b8f22
IV = 0d18e06c7c725ac9e362e1ce
PT = 2db5168e932556f8089a0622981d017d
AAD =
CT = fa4362189661d163fcd6a56d8bf0405a
Tag = d636ac1bbedd5cc3ee727dc2ab4a9489

key/IV (aka nonce)/PT are inputs, CT and Tag are outputs.

Then an example for AES GCM *decryption*:

Key = 4c8ebfe1444ec1b2d503c6986659af2c94fafe945f72c1e8486a5acfedb8a0f8
IV = 473360e0ad24889959858995
CT = d2c78110ac7e8f107c0df0570bd7c90c
AAD =
Tag = c26a379b6d98ef2852ead8ce83a833a7
PT = 7789b41cb3ee548814ca0b388c10b343

Key/IV/CT/Tag are inputs, PT is the output

... but, a more interesting one when considering the tag is:

Key = c997768e2d14e3d38259667a6649079de77beb4543589771e5068e6cd7cd0b14
IV = 835090aed9552dbdd45277e2
CT = 9f6607d68e22ccf21928db0986be126e
AAD =
Tag = f32617f67c574fd9f44ef76ff880ab9f
FAIL

Again, Key/IV/CT/Tag are inputs, but there's no PT output and instead
you just get FAIL and that's because the data integrity check failed.

Exactly how the tag is generated is discussed here if you're really
curious:

https://en.wikipedia.org/wiki/Galois/Counter_Mode

but the gist of that is that it's done as part of the encryption.  Note
that you can include additional data beyond just what you're encrypting
in the tag.  In our case, we would probably include the LSN, which would
mean that the LSN would be confirmed to be correct additional
information that wasn't actually encrypted.  The "AAD" above is
"Additional Authenticated Data".

One thing to be absolutely clear about here though is that simply taking
a hash() of the ciphertext and storing that with the data does *not*
provide cryptographic data integrity validation for the page because it
doesn't involve the actual key or IV at all and the hash is done after
the ciphertext is generated- therefore an attacker can change the data
and just change the hash to match and you'd never know.

Now, when it comes to hashing the *plaintext* data and storing that, you
have to be careful ther because you can very easily fall into the trap
of giving away information about the plaintext data that way if an
attacker can reason about what the plaintext might look like.  If I know
the block contains just a single english word and all we've done is
sha256'd it then I can just run sha256 on all english words and figure
out what it is, so to protect the data you need to incorporate the key,
nonce, etc, somehow into the hash (that is- something that should be
very hard for the attacker to discover) and suddently you're doing what
AES-GCM *already* does for you, except you're trying to hack it yourself
instead of using the tools available which were written by experts.

The way that I tend to look at this area is that everyone used to try
and do encryption and data integrity independently and the result was a
bunch of different implementations, some good, some bad (and therefore
leaked sensitive information) and the crypto folks basically said "ok,
let's take the *good* implementations and bake that in, because
otherwise people are going to just keep screwing up and using bad
approaches for this."

What this means for your proposal above is that the actual data
validation information will be generated in two different ways depending
on if we're using AES-GCM and doing TDE, or if we're doing just the data
validation piece and not encrypting anything.  That's maybe not ideal
but I don't think it's a huge issue either and your proposal will still
address the question of if we end up missing anything when it comes to
how the special area is handled throughout the code.

If it'd help, I'd be happy to jump on a call to discuss further.  Also
happy to continue on this thread too, of course.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 07:14:47AM +0200, Antonin Houska wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> 
> > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote:
> > > > So, let me ask --- I thought CTR basically took an encrypted stream of
> > > > bits and XOR'ed them with the data.  If that is true, then why are
> > > > changing hint bits a problem?  We already can see some of the bit stream
> > > > by knowing some bytes of the page.
> > > 
> > > A *single* reuse of the nonce in CTR reveals nearly all of the
> > > plaintext. As you say, the data is XORed with the key stream. Reusing
> > > the nonce means that you reuse the key stream. Which in turn allows you
> > > to do:
> > >   (data ^ stream) ^ (data' ^ stream)
> > > which can be simplified to
> > >   (data ^ data')
> > > thereby leaking all of data except the difference between data and
> > > data'. That's why it's so crucial to ensure that stream *always* differs
> > > between two rounds of encrypting "related" data.
> > > 
> > > We can't just "hope" that data doesn't change and use CTR.
> > 
> > My point was about whether we need to change the nonce, and hence
> > WAL-log full page images if we change hint bits.  If we don't and
> > reencrypt the page with the same nonce, don't we only expose the hint
> > bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> > cases.
> > 
> > I don't understand your computation above.  You decrypt the page into
> > shared buffers, you change a hint bit, and rewrite the page.  You are
> > re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> > only change the hint bits in the new write?
> 
> The way I view things is that the CTR mode encrypts each individual bit,
> independent from any other bit on the page. For non-hint bits data=data', so
> (data ^ data') is always zero, regardless the actual values of the data. So I
> agree with you that by reusing the nonce we only expose the hint bits.

OK, that's what I thought.  We already expose the clog and fsm, so
exposing the hint bits seems acceptable.  If everyone agrees, I will
adjust my patch to not WAL log hint bit changes.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Stephen Frost (sfrost@snowman.net) wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > Another idea might be - instead of doing nonce++ every time we write
> > the page, do nonce=random(). That's eventually going to repeat a
> > value, but it's extremely likely to take a *super* long time if there
> > are enough bits. A potentially rather large problem, though, is that
> > generating random numbers in large quantities isn't very cheap.
>
> There's specific discussion about how to choose a nonce in NIST
> publications and using a properly random one that's large enough is
> one accepted approach, though my recollection was that the preference
> was to use an incrementing guaranteed-unique nonce and using a random
> one was more of a "if you can't coordinate using an incrementing one
> then you can do this".  I can try to hunt for the specifics on that
> though.

Disucssion of generating IVs here:

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf

section 8.2 specifically.

Note that 8.3 also discusses subsequent limitations which one should
follow when using a random nonce, to reduce the chances of a collision.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Wed, May 26, 2021 at 2:37 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Anybody got a better idea?
>
> If we stipulate (and document) that all replicas need their own keys
> then we no longer need to worry about nonce re-use between the primary
> and the replica.  Not sure that's *better*, per se, but I do think it's
> worth consideration.  Teaching pg_basebackup how to decrypt and then
> re-encrypt with a different key wouldn't be challenging.

I agree that we could do that and that it's possible worth
considering. However, it would be easy - and tempting - for users to
violate the no-nonce-reuse principle. For example, consider a
hypothetical user who takes a backup on Monday via a filesystem
snapshot - which might be either (a) a snapshot of the cluster while
it is stopped, or (b) a snapshot of the cluster while it's running,
from which crash recovery can be safely performed as long as it's a
true atomic snapshot, or (c) a snapshot taken between pg_start_backup
and pg_stop_backup which will be used just like a backup taken by
pg_basebackup. In any of these cases, there's no opportunity for a
tool we provide to intervene and re-key. Now, we would provide a tool
that re-keys in such situations and tell people to be sure they run it
before using any of those backups, and maybe that's the best we can
do. However, that tool is going to run for a good long time because it
has to rewrite the entire cluster, so someone with a terabyte-scale
database is going to be sorely tempted to skip this "unnecessary" and
time-consuming step. If it were possible to set things up so that good
things happen automatically and without user action, that would be
swell.

Here's another idea: suppose that a nonce is 128 bits, 64 of which are
randomly generated at server startup, and the other 64 of which are a
counter. If you're willing to assume that the 64 bits generated
randomly at server startup are not going to collide in practice,
because the number of server lifetimes per key should be very small
compared to 2^64, then this gets you the benefits of a
randomly-generate nonce without needing to keep on generating new
cryptographically strong random numbers, and pretty much regardless of
what users do with their backups. If you replay an FPI, you can write
out the page exactly as you got it from the master, without
re-encrypting. If you modify and then write a page, you generate a
nonce for it containing your own server lifetime identifier.

> Yes, if the amount of space available is variable then there's an added
> cost for that.  While I appreciate the concern about having that be
> expensive, for my 2c at least, I like to think that having this sudden
> space that's available for use may lead to other really interesting
> capabilities beyond the ones we're talking about here, so I'm not really
> thrilled with the idea of boiling it down to just two cases.

Although I'm glad you like some things about this idea, I think the
proposed system will collapse if we press it too hard. We're going to
need to be judicious.

> One thing to be absolutely clear about here though is that simply taking
> a hash() of the ciphertext and storing that with the data does *not*
> provide cryptographic data integrity validation for the page because it
> doesn't involve the actual key or IV at all and the hash is done after
> the ciphertext is generated- therefore an attacker can change the data
> and just change the hash to match and you'd never know.

Ah, right. So you'd actually want something more like
hash(dboid||tsoid||relfilenode||blockno||block_contents||secret).
Maybe not generated exactly that way: perhaps the secret is really the
IV for the hash function rather than part of the hashed data, or
whatever. However you do it exactly, it prevents someone from
verifying - or faking - a signature unless they have the secret.

> very hard for the attacker to discover) and suddently you're doing what
> AES-GCM *already* does for you, except you're trying to hack it yourself
> instead of using the tools available which were written by experts.

I am all in favor of using the expert-written tools provided we can
figure out how to do it in a way we all agree is correct.

> What this means for your proposal above is that the actual data
> validation information will be generated in two different ways depending
> on if we're using AES-GCM and doing TDE, or if we're doing just the data
> validation piece and not encrypting anything.  That's maybe not ideal
> but I don't think it's a huge issue either and your proposal will still
> address the question of if we end up missing anything when it comes to
> how the special area is handled throughout the code.

Hmm. Is there no expert-written method for this sort of thing without
encryption? One thing that I think would be really helpful is to be
able to take a TDE-ified cluster and run it through decryption, ending
up with a cluster that still has extra special space but which isn't
actually encrypted any more. Ideally it can end up in a state where
integrity validation still works. This might be something people just
Want To Do, and they're willing to sacrifice the space. But it would
also be real nice for testing and debugging. Imagine for example that
the data on page X is physiologically corrupted i.e. decryption
produces something that looks like a page, but there's stuff wrong
with it, like the item pointers point to a page offset greater than
the page size. Well, what you really want to do with this page is run
pg_filedump on it, or hexdump, or od, or pg_hexedit, or whatever your
favorite tool is, so that you can figure out what's going on, but
that's going to be hard if the pages are all encrypted.

I guess nothing in what you are saying really precludes that, but I
agree that if we have to switch up the method for creating the
integrity verifier thing in this situation, that's not great.

> If it'd help, I'd be happy to jump on a call to discuss further.  Also
> happy to continue on this thread too, of course.

I am finding the written discussion to be helpful right now, and it
has the advantage of being easy to refer back to later, so my vote
would be to keep doing this for now and we can always reassess if it
seems to make sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> OK, that's what I thought.  We already expose the clog and fsm, so
> exposing the hint bits seems acceptable.  If everyone agrees, I will
> adjust my patch to not WAL log hint bit changes.

Robert pointed out that it's not just hint bits where this is happening
though, but it can also happen with btree line pointer arrays.  Even if
we were entirely comfortable accepting that the hint bits are leaked
because of this, leaking the btree line pointer array doesn't seem like
it could possibly be acceptable..

I've not run down that code myself, but I don't have any reason to doubt
Robert's assessment.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote:
> However, I believe that if we store the nonce in the page explicitly,
> as proposed here, rather trying to derive it from the LSN, then we
> don't need to worry about this kind of masking, which I think is
> better from both a security perspective and a performance perspective.

You are saying that by using a non-LSN nonce, you can write out the page
with a new nonce, but the same LSN, and also discard the page during
crash recovery and use the WAL copy?

I am confused why checksums, which are widely used, acceptably require
wal_log_hints, but there is concern that file encryption, which is
heavier, cannot acceptably require wal_log_hints.  I must be missing
something.

Why can't checksums also throw away hint bit changes like you want to do
for file encryption and not require wal_log_hints?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, May 26, 2021 at 2:37 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > Anybody got a better idea?
> >
> > If we stipulate (and document) that all replicas need their own keys
> > then we no longer need to worry about nonce re-use between the primary
> > and the replica.  Not sure that's *better*, per se, but I do think it's
> > worth consideration.  Teaching pg_basebackup how to decrypt and then
> > re-encrypt with a different key wouldn't be challenging.
>
> I agree that we could do that and that it's possible worth
> considering. However, it would be easy - and tempting - for users to
> violate the no-nonce-reuse principle. For example, consider a

(guessing you meant no-key-reuse above)

> hypothetical user who takes a backup on Monday via a filesystem
> snapshot - which might be either (a) a snapshot of the cluster while
> it is stopped, or (b) a snapshot of the cluster while it's running,
> from which crash recovery can be safely performed as long as it's a
> true atomic snapshot, or (c) a snapshot taken between pg_start_backup
> and pg_stop_backup which will be used just like a backup taken by
> pg_basebackup. In any of these cases, there's no opportunity for a
> tool we provide to intervene and re-key. Now, we would provide a tool
> that re-keys in such situations and tell people to be sure they run it
> before using any of those backups, and maybe that's the best we can
> do. However, that tool is going to run for a good long time because it
> has to rewrite the entire cluster, so someone with a terabyte-scale
> database is going to be sorely tempted to skip this "unnecessary" and
> time-consuming step. If it were possible to set things up so that good
> things happen automatically and without user action, that would be
> swell.

Yes, if someone were to use a snapshot and set up a replica from it
they'd end up with the same key being used and potentially have an issue
with the key+nonce combination being re-used between the primary and
replica with different data leading to a possible data leak.

> Here's another idea: suppose that a nonce is 128 bits, 64 of which are
> randomly generated at server startup, and the other 64 of which are a
> counter. If you're willing to assume that the 64 bits generated
> randomly at server startup are not going to collide in practice,
> because the number of server lifetimes per key should be very small
> compared to 2^64, then this gets you the benefits of a
> randomly-generate nonce without needing to keep on generating new
> cryptographically strong random numbers, and pretty much regardless of
> what users do with their backups. If you replay an FPI, you can write
> out the page exactly as you got it from the master, without
> re-encrypting. If you modify and then write a page, you generate a
> nonce for it containing your own server lifetime identifier.

Yes, this kind of approach is discussed in the NIST publication in
section 8.2.2.  We'd have to keep track of what nonce we used for which
page, of course, but that should be alright using the special space as
discussed.

> > Yes, if the amount of space available is variable then there's an added
> > cost for that.  While I appreciate the concern about having that be
> > expensive, for my 2c at least, I like to think that having this sudden
> > space that's available for use may lead to other really interesting
> > capabilities beyond the ones we're talking about here, so I'm not really
> > thrilled with the idea of boiling it down to just two cases.
>
> Although I'm glad you like some things about this idea, I think the
> proposed system will collapse if we press it too hard. We're going to
> need to be judicious.

Sure.

> > One thing to be absolutely clear about here though is that simply taking
> > a hash() of the ciphertext and storing that with the data does *not*
> > provide cryptographic data integrity validation for the page because it
> > doesn't involve the actual key or IV at all and the hash is done after
> > the ciphertext is generated- therefore an attacker can change the data
> > and just change the hash to match and you'd never know.
>
> Ah, right. So you'd actually want something more like
> hash(dboid||tsoid||relfilenode||blockno||block_contents||secret).
> Maybe not generated exactly that way: perhaps the secret is really the
> IV for the hash function rather than part of the hashed data, or
> whatever. However you do it exactly, it prevents someone from
> verifying - or faking - a signature unless they have the secret.
>
> > very hard for the attacker to discover) and suddently you're doing what
> > AES-GCM *already* does for you, except you're trying to hack it yourself
> > instead of using the tools available which were written by experts.
>
> I am all in favor of using the expert-written tools provided we can
> figure out how to do it in a way we all agree is correct.

In the patch set that Bruce has which uses the OpenSSL functions to do
AES GCM with tag there is included a test suite which works with the
NIST published test vectors to verify that it all works correctly with
the key, nonce/IV, plaintext, tag, ciphertext, etc.  The patch set
includes a subset of the NIST tests since we rely on OpenSSL for the
heavy lifting there, but the entire test suite passes if you pull down
the test vectors and run them.

> > What this means for your proposal above is that the actual data
> > validation information will be generated in two different ways depending
> > on if we're using AES-GCM and doing TDE, or if we're doing just the data
> > validation piece and not encrypting anything.  That's maybe not ideal
> > but I don't think it's a huge issue either and your proposal will still
> > address the question of if we end up missing anything when it comes to
> > how the special area is handled throughout the code.
>
> Hmm. Is there no expert-written method for this sort of thing without
> encryption? One thing that I think would be really helpful is to be
> able to take a TDE-ified cluster and run it through decryption, ending
> up with a cluster that still has extra special space but which isn't
> actually encrypted any more. Ideally it can end up in a state where
> integrity validation still works. This might be something people just
> Want To Do, and they're willing to sacrifice the space. But it would
> also be real nice for testing and debugging. Imagine for example that
> the data on page X is physiologically corrupted i.e. decryption
> produces something that looks like a page, but there's stuff wrong
> with it, like the item pointers point to a page offset greater than
> the page size. Well, what you really want to do with this page is run
> pg_filedump on it, or hexdump, or od, or pg_hexedit, or whatever your
> favorite tool is, so that you can figure out what's going on, but
> that's going to be hard if the pages are all encrypted.

So ... yes and no.

If you want to actually verify that the data is valid and unmolested by
virtue of a key being involved, then you can actually use AES GCM and
simply only feed it AADlen.  The NIST examples have test cases for
exactly this too:

Count = 0
Key = 78dc4e0aaf52d935c3c01eea57428f00ca1fd475f5da86a49c8dd73d68c8e223
IV = d79cf22d504cc793c3fb6c8a
PT =
AAD = b96baa8c1c75a671bfb2d08d06be5f36
CT =
Tag = 3e5d486aa2e30b22e040b85723a06e76

Note that in this case there's a key and an IV/nonce, but there isn't
any plaintext while there *is* AAD ("Additional Authenticated Data").
We could certainly do that too, the downside there is mostly that we'd
still need a key and an IV and those seem like odd parameters to
require when we aren't doing encryption, but it would mean we'd be using
the exact same functions with OpenSSL that we would be in the TDE case,
just passing in the block as AAD instead of as plaintext to be
encrypted, so there is that advantage to it.

> I guess nothing in what you are saying really precludes that, but I
> agree that if we have to switch up the method for creating the
> integrity verifier thing in this situation, that's not great.

I had been imagining that we wouldn't want to require a key and have to
calculate an IV/nonce for the "not doing TDE" case, so I was figuring
we'd just use a hash and it'd be very much like our existing checksum
and not provide any real protection against an attacker intentionally
molesting the page (since they can just calculate a new checksum that
includes whatever their changes were).

At the end of the day though, I'm fine with either (or both, for that
matter; I don't see any of these aspects being the difficult to
implement bits, the question is mainly what do we give our users the
ability to do, what do we just use for development, etc).

> > If it'd help, I'd be happy to jump on a call to discuss further.  Also
> > happy to continue on this thread too, of course.
>
> I am finding the written discussion to be helpful right now, and it
> has the advantage of being easy to refer back to later, so my vote
> would be to keep doing this for now and we can always reassess if it
> seems to make sense.

Sure.

Thanks!

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 03:49:43PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > OK, that's what I thought.  We already expose the clog and fsm, so
> > exposing the hint bits seems acceptable.  If everyone agrees, I will
> > adjust my patch to not WAL log hint bit changes.
> 
> Robert pointed out that it's not just hint bits where this is happening
> though, but it can also happen with btree line pointer arrays.  Even if
> we were entirely comfortable accepting that the hint bits are leaked
> because of this, leaking the btree line pointer array doesn't seem like
> it could possibly be acceptable..
> 
> I've not run down that code myself, but I don't have any reason to doubt
> Robert's assessment.

OK, I guess we could split out log_hints to maybe just FPW-log btree
changes or something, but my recent email questions why wal_log_hints is
an issue anyway.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 04:40:48PM -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote:
> > However, I believe that if we store the nonce in the page explicitly,
> > as proposed here, rather trying to derive it from the LSN, then we
> > don't need to worry about this kind of masking, which I think is
> > better from both a security perspective and a performance perspective.
> 
> You are saying that by using a non-LSN nonce, you can write out the page
> with a new nonce, but the same LSN, and also discard the page during
> crash recovery and use the WAL copy?
> 
> I am confused why checksums, which are widely used, acceptably require
> wal_log_hints, but there is concern that file encryption, which is
> heavier, cannot acceptably require wal_log_hints.  I must be missing
> something.
> 
> Why can't checksums also throw away hint bit changes like you want to do
> for file encryption and not require wal_log_hints?

One detail might be this extra hint bit FPW case:

    https://github.com/postgres/postgres/compare/bmomjian:cfe-01-doc..bmomjian:_cfe-02-internaldoc.patch
    
    However, if a hint-bit-modified page is written to the file system
    during a checkpoint, and there is a later hint bit change switching
    the same page from clean to dirty during the same checkpoint, we
    need a new LSN, and wal_log_hints doesn't give us a new LSN here.
    The fix for this is to update the page LSN by writing a dummy
    WAL record via xloginsert.c::LSNForEncryption() in such cases.

Is this how file encryption is different from checksum wal_log_hints,
and the big concern?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 01:56:38PM -0400, Robert Haas wrote:
> In the interest of not being viewed as too much of a naysayer, let me
> first reiterate that I am generally in favor of TDE going forward and
> am not looking to throw up unnecessary obstacles in the way of making
> that happen.

Rather than surprise anyone, I might as well just come out and say some
things.  First, I have always admitted this feature has limited
usefulness.  

I think a non-LSN nonce adds a lot of code complexity, which adds a code
and maintenance burden.  It also prevents the creation of an encrypted
replica from a non-encrypted primary using binary replication, which
makes deployment harder.

Take a feature of limited usefulness, add code complexity and deployment
difficulty, and the feature becomes even less useful.

For these reasons, if we decide to go in the direction of using a
non-LSN nonce, I no longer plan to continue working on this feature. I
would rather work on things that have a more positive impact.  Maybe a
non-LSN nonce is a better long-term plan, but there are too many
unknowns and complexity for me to feel comfortable with it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-26 07:14:47 +0200, Antonin Houska wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > My point was about whether we need to change the nonce, and hence
> > WAL-log full page images if we change hint bits.  If we don't and
> > reencrypt the page with the same nonce, don't we only expose the hint
> > bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> > cases.
> >
> > I don't understand your computation above.  You decrypt the page into
> > shared buffers, you change a hint bit, and rewrite the page.  You are
> > re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> > only change the hint bits in the new write?

Yea, I had a bit of a misfire there. Sorry.

I suspect that if we try to not disclose data if an attacker has write
access, this still leaves us with issues around nonce reuse, unless we
also employ integrity measures. Particularly due to CTR mode, which
makes it easy to manipulate individual parts of the encrypted page
without causing the decrypted page to be invalid. E.g. the attacker can
just update pd_upper on the page by a small offset, and suddenly the
replay will insert the tuple at a slightly shifted offset - which then
seems to leak enough data to actually analyze things?

As the patch stands that seems trivially doable, because as I read it,
most of the page header is not encrypted, and checksums are done of the
already encrypted data. But even if that weren't the case, brute forcing
16bit worth of checksum isn't too bad, even though it would obviously
make an attack a lot more noisy.


https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92
    /*
     * Check if the page has a special size == GISTPageOpaqueData, a valid
     * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN.  This
     * is true for all GiST pages, and perhaps a few pages that are not.  The
     * only downside of guessing wrong is that we might not update the LSN for
     * some non-permanent relation page changes, and therefore reuse the IV,
     * which seems acceptable.
     */

Huh?

Regards,

Andres



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 22:23:46 -0400, Stephen Frost wrote:
> Andres mentioned other possible cases where the LSN doesn’t change even
> though we change the page and, as he’s probably right, we would have to
> figure out a solution in those cases too (potentially including cases like
> crash recovery or replay on a replica where we can’t really just go around
> creating dummy WAL records to get new LSNs..).

Yea, I think there's quite a few of those. For one, we don't guarantee
that that the hole between pd_lower/upper is zeroes. It e.g. contains
old tuple data after deleted tuples are pruned away. But when logging an
FPI, we omit that range. Which means that after crash recovery the area
is zeroed out. There's several cases where padding can result in the
same.

Just look at checkXLogConsistency(), heap_mask() et al for all the
differences that can occur and that need to be ignored for the recovery
consistency checking to work.

Particularly the hole issue seems trivial to exploit, because we know
the plaintext of the hole after crash recovery (0s).


I don't see how using the LSN alone is salvagable.


Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> If we used a block cipher instead of a streaming one (CTR), this might
> not work because the earlier blocks can be based in the output of
> later blocks.

What made us choose CTR for WAL & data file encryption? I checked the
README in the patchset and the wiki page, and neither seem to discuss
that.

The dangers around nonce reuse, the space overhead of storing the nonce,
the fact that single bit changes in the encrypted data don't propagate
seem not great?  Why aren't we using something like XTS? It has obvious
issues as wel, but CTR's weaknesses seem at least as great. And if we
want a MAC, then we don't want CTR either.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Neil Chen
Date:
Greetings,

On Thu, May 27, 2021 at 4:52 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> I am confused why checksums, which are widely used, acceptably require
> wal_log_hints, but there is concern that file encryption, which is
> heavier, cannot acceptably require wal_log_hints.  I must be missing
> something.
>
> Why can't checksums also throw away hint bit changes like you want to do
> for file encryption and not require wal_log_hints?


I'm really confused about it, too. I read the above communication, not sure if my understanding is correct... What we are facing is not only the change of flag such as *pd_flags*, but also others like pointer array changes in btree like Robert said. We don't need them to write a WAL record.

I have an immature idea, could we use LSN+blkno+checksum as the nonce when the checksum enabled? And when the checksum disabled, we just use a global counter to generate a number as the fake checksum value... Then we also use LSN+blkno+fake_checksum as the nonce. Is there anything wrong with that?

--
There is no royal road to learning.
HighGo Software Co.

Re: storing an explicit nonce

From
Robert Haas
Date:
On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote:
> You are saying that by using a non-LSN nonce, you can write out the page
> with a new nonce, but the same LSN, and also discard the page during
> crash recovery and use the WAL copy?

I don't know what "discard the page during crash recovery and use the
WAL copy" means.

> I am confused why checksums, which are widely used, acceptably require
> wal_log_hints, but there is concern that file encryption, which is
> heavier, cannot acceptably require wal_log_hints.  I must be missing
> something.

I explained this in the first complete paragraph of my first email
with this subject line: "For example, right now, we only need to WAL
log hints for the first write to each page after a checkpoint, but in
this approach, if the same page is written multiple times per
checkpoint cycle, we'd need to log hints every time." That's a huge
difference. Page eviction in some workloads can push the same pages
out of shared buffers every few seconds, whereas something that has to
be done once per checkpoint cycle cannot affect each page nearly so
often. A checkpoint is only going to occur every 5 minutes by default,
or more realistically every 10-15 minutes in a well-tuned production
system. In other words, we're not holding up some kind of double
standard, where the existing feature is allowed to depend on doing a
certain thing but your feature isn't allowed to depend on the same
thing. Your design depends on doing something which is potentially
100x+ more expensive than the existing thing. It's not always going to
be that expensive, but it can be.

> Why can't checksums also throw away hint bit changes like you want to do
> for file encryption and not require wal_log_hints?

Well, I don't want to throw away hint bit changes, just like we don't
throw them away right now. And I want to do that by making sure that
each time the page is written, we use a different nonce, but without
the expense of having to advance the LSN.

Now, another option is to do what you suggest here. We could say that
if a dirty page is evicted, but the page is only dirty because of
hint-type changes, we don't actually write it out. That does avoid
using the same nonce for multiple writes, because now there's only one
write. It also fixes the problem on standbys that Andres was
complaining about, because on a standby, the only way a page can
possibly be dirtied without an associated WAL record is through a
hint-type change. However, I think we'd find that this, too, is pretty
expensive in certain workloads. It's useful to write hint bits -
that's why we do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-26 07:14:47 +0200, Antonin Houska wrote:
> > Bruce Momjian <bruce@momjian.us> wrote:
> > > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > > My point was about whether we need to change the nonce, and hence
> > > WAL-log full page images if we change hint bits.  If we don't and
> > > reencrypt the page with the same nonce, don't we only expose the hint
> > > bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> > > cases.
> > >
> > > I don't understand your computation above.  You decrypt the page into
> > > shared buffers, you change a hint bit, and rewrite the page.  You are
> > > re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> > > only change the hint bits in the new write?
> 
> Yea, I had a bit of a misfire there. Sorry.
> 
> I suspect that if we try to not disclose data if an attacker has write
> access, this still leaves us with issues around nonce reuse, unless we
> also employ integrity measures. Particularly due to CTR mode, which
> makes it easy to manipulate individual parts of the encrypted page
> without causing the decrypted page to be invalid. E.g. the attacker can
> just update pd_upper on the page by a small offset, and suddenly the
> replay will insert the tuple at a slightly shifted offset - which then
> seems to leak enough data to actually analyze things?

Yes, I don't think protecting from write access is a realistic goal at
this point, and frankly ever.  I think write access protection needs
all-cluster-file encryption.  This is documented:

    https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch

    Cluster file encryption does not protect against unauthorized
    file system writes.  Such writes can allow data decryption if
    used to weaken the system's security and the weakened system is
    later supplied with the externally-stored cluster encryption key.
    This also does not always detect if users with write access remove
    or modify database files.

If this needs more text, let me know.

> As the patch stands that seems trivially doable, because as I read it,
> most of the page header is not encrypted, and checksums are done of the
> already encrypted data. But even if that weren't the case, brute forcing
> 16bit worth of checksum isn't too bad, even though it would obviously
> make an attack a lot more noisy.
> 
>
https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92
>     /*
>      * Check if the page has a special size == GISTPageOpaqueData, a valid
>      * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN.  This
>      * is true for all GiST pages, and perhaps a few pages that are not.  The
>      * only downside of guessing wrong is that we might not update the LSN for
>      * some non-permanent relation page changes, and therefore reuse the IV,
>      * which seems acceptable.
>      */
> 
> Huh?

Are you asking about this C commention in relation to the discussion
above, or is it an independent question?  Are asking what it means?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 22:23:46 -0400, Stephen Frost wrote:
> > Andres mentioned other possible cases where the LSN doesn’t change even
> > though we change the page and, as he’s probably right, we would have to
> > figure out a solution in those cases too (potentially including cases like
> > crash recovery or replay on a replica where we can’t really just go around
> > creating dummy WAL records to get new LSNs..).
> 
> Yea, I think there's quite a few of those. For one, we don't guarantee
> that that the hole between pd_lower/upper is zeroes. It e.g. contains
> old tuple data after deleted tuples are pruned away. But when logging an
> FPI, we omit that range. Which means that after crash recovery the area
> is zeroed out. There's several cases where padding can result in the
> same.
>
> Just look at checkXLogConsistency(), heap_mask() et al for all the
> differences that can occur and that need to be ignored for the recovery
> consistency checking to work.
> 
> Particularly the hole issue seems trivial to exploit, because we know
> the plaintext of the hole after crash recovery (0s).
> 
> 
> I don't see how using the LSN alone is salvagable.

OK, so you are saying the replica would have all zeros because of crash
recovery, so XOR'ing that with the encryption steam makes the encryption
stream visible, and you could use that to decrypt the dead data on the
primary.  That is an interesting case that would need to fix.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > If we used a block cipher instead of a streaming one (CTR), this might
> > not work because the earlier blocks can be based in the output of
> > later blocks.
> 
> What made us choose CTR for WAL & data file encryption? I checked the
> README in the patchset and the wiki page, and neither seem to discuss
> that.
> 
> The dangers around nonce reuse, the space overhead of storing the nonce,
> the fact that single bit changes in the encrypted data don't propagate
> seem not great?  Why aren't we using something like XTS? It has obvious
> issues as wel, but CTR's weaknesses seem at least as great. And if we
> want a MAC, then we don't want CTR either.

We chose CTR because it was fast, and we could use the same method for
WAL, which needs a streaming, not block, cipher.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 05:45:21PM +0800, Neil Chen wrote:
> Greetings,
> 
> On Thu, May 27, 2021 at 4:52 PM Bruce Momjian <bruce@momjian.us> wrote:
> 
>     >
>     > I am confused why checksums, which are widely used, acceptably require
>     > wal_log_hints, but there is concern that file encryption, which is
>     > heavier, cannot acceptably require wal_log_hints.  I must be missing
>     > something.
>     >
>     > Why can't checksums also throw away hint bit changes like you want to do
>     > for file encryption and not require wal_log_hints?
> 
> 
> 
> I'm really confused about it, too. I read the above communication, not sure if
> my understanding is correct... What we are facing is not only the change of
> flag such as *pd_flags*, but also others like pointer array changes in btree
> like Robert said. We don't need them to write a WAL record.

Well, the code now does write full page images for hint bit changes, so
it should work fine.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote:
> On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote:
> > You are saying that by using a non-LSN nonce, you can write out the page
> > with a new nonce, but the same LSN, and also discard the page during
> > crash recovery and use the WAL copy?
> 
> I don't know what "discard the page during crash recovery and use the
> WAL copy" means.

I was asking  how decoupling the nonce from the LSN allows for us to
avoid full page writes for hint bit changes.  I am guessing you are
saying that on recovery, if we see a hint-bit-only change in the WAL
(with a new nonce), we just throw away the page because it could be torn
and use the WAL full page write version.

> > I am confused why checksums, which are widely used, acceptably require
> > wal_log_hints, but there is concern that file encryption, which is
> > heavier, cannot acceptably require wal_log_hints.  I must be missing
> > something.
> 
> I explained this in the first complete paragraph of my first email
> with this subject line: "For example, right now, we only need to WAL
> log hints for the first write to each page after a checkpoint, but in
> this approach, if the same page is written multiple times per
> checkpoint cycle, we'd need to log hints every time." That's a huge
> difference. Page eviction in some workloads can push the same pages
> out of shared buffers every few seconds, whereas something that has to
> be done once per checkpoint cycle cannot affect each page nearly so
> often. A checkpoint is only going to occur every 5 minutes by default,
> or more realistically every 10-15 minutes in a well-tuned production
> system. In other words, we're not holding up some kind of double
> standard, where the existing feature is allowed to depend on doing a
> certain thing but your feature isn't allowed to depend on the same
> thing. Your design depends on doing something which is potentially
> 100x+ more expensive than the existing thing. It's not always going to
> be that expensive, but it can be.

Yes, it might be 1e100+++ more expensive too, but we don't know, and I
am not ready to add a lot of complexity for such an unknown.

> > Why can't checksums also throw away hint bit changes like you want to do
> > for file encryption and not require wal_log_hints?
> 
> Well, I don't want to throw away hint bit changes, just like we don't
> throw them away right now. And I want to do that by making sure that
> each time the page is written, we use a different nonce, but without
> the expense of having to advance the LSN.
> 
> Now, another option is to do what you suggest here. We could say that
> if a dirty page is evicted, but the page is only dirty because of
> hint-type changes, we don't actually write it out. That does avoid
> using the same nonce for multiple writes, because now there's only one
> write. It also fixes the problem on standbys that Andres was
> complaining about, because on a standby, the only way a page can
> possibly be dirtied without an associated WAL record is through a
> hint-type change. However, I think we'd find that this, too, is pretty
> expensive in certain workloads. It's useful to write hint bits -
> that's why we do it.

Oh, that does sound nice.  It is kind of an exit hatch if we are
evicting pages often for hint bit changes.  I like it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
"Andres Freund"
Date:
Hi,

On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > If we used a block cipher instead of a streaming one (CTR), this might
> > > not work because the earlier blocks can be based in the output of
> > > later blocks.
> > 
> > What made us choose CTR for WAL & data file encryption? I checked the
> > README in the patchset and the wiki page, and neither seem to discuss
> > that.
> > 
> > The dangers around nonce reuse, the space overhead of storing the nonce,
> > the fact that single bit changes in the encrypted data don't propagate
> > seem not great?  Why aren't we using something like XTS? It has obvious
> > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > want a MAC, then we don't want CTR either.
> 
> We chose CTR because it was fast, and we could use the same method for
> WAL, which needs a streaming, not block, cipher.

The WAL is block oriented too.

Andres



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > not work because the earlier blocks can be based in the output of
> > > > later blocks.
> > >
> > > What made us choose CTR for WAL & data file encryption? I checked the
> > > README in the patchset and the wiki page, and neither seem to discuss
> > > that.
> > >
> > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > the fact that single bit changes in the encrypted data don't propagate
> > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > want a MAC, then we don't want CTR either.
> >
> > We chose CTR because it was fast, and we could use the same method for
> > WAL, which needs a streaming, not block, cipher.
>
> The WAL is block oriented too.

I'm curious what you'd suggest for the heap where we wouldn't be able to
have block chaining (at least, I presume we aren't talking about
rewriting entire segments whenever we change something in a heap).

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> Rather than surprise anyone, I might as well just come out and say some
> things.  First, I have always admitted this feature has limited
> usefulness.  
> 
> I think a non-LSN nonce adds a lot of code complexity, which adds a code
> and maintenance burden.  It also prevents the creation of an encrypted
> replica from a non-encrypted primary using binary replication, which
> makes deployment harder.
> 
> Take a feature of limited usefulness, add code complexity and deployment
> difficulty, and the feature becomes even less useful.
> 
> For these reasons, if we decide to go in the direction of using a
> non-LSN nonce, I no longer plan to continue working on this feature. I
> would rather work on things that have a more positive impact.  Maybe a
> non-LSN nonce is a better long-term plan, but there are too many
> unknowns and complexity for me to feel comfortable with it.

I had some more time to think about this.  The big struggle for this
feature has not been writing it, but rather keeping it lean enough that
its code complexity will be acceptable for a feature of limited
usefulness.  (The Windows port and pg_upgrade took similar approaches.)

Thinking about the feature to add checksums online, it seems to have
failed due to us over-complexifying the feature.  If we had avoided
allowing the checksum restart requirement, the patch would probably be
part of Postgres today.  However, a few people asked for
restart-ability, and since we don't really have much infrastructure to
do online whole-cluster changes, it added a lot of code.  Once the patch
was done, we looked at the code size and the benefits of the feature,
and decided it wasn't worth it.

I suspect that if we start adding a non-LSN nonce and malicious write
detection, we will end up with the same problem --- a complex patch for
a feature that has limited usefulness, and requires dump/restore or
logical replication to add it to a cluster.  I think such a patch would
be rejected, and I would probably even vote against it myself.

I don't want this to sound like I only want to do this my way, but I
also don't want to be silent when I smell failure, and if the
probability of failure gets too high, I am willing to abandon a feature
rather than continue.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 11:49:33 -0400, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > > not work because the earlier blocks can be based in the output of
> > > > > later blocks.
> > > > 
> > > > What made us choose CTR for WAL & data file encryption? I checked the
> > > > README in the patchset and the wiki page, and neither seem to discuss
> > > > that.
> > > > 
> > > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > > the fact that single bit changes in the encrypted data don't propagate
> > > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > > want a MAC, then we don't want CTR either.
> > > 
> > > We chose CTR because it was fast, and we could use the same method for
> > > WAL, which needs a streaming, not block, cipher.
> > 
> > The WAL is block oriented too.
> 
> I'm curious what you'd suggest for the heap where we wouldn't be able to
> have block chaining (at least, I presume we aren't talking about
> rewriting entire segments whenever we change something in a heap).

What prevents us from using something like XTS? I'm not saying that that
is the right approach, due to the fact that it leaks information about a
block being the same as an earlier version of the same block. But right
now we are talking about using CTR without addressing the weaknesses CTR
has, where a failure to increase the nonce is fatal (the code even
documents known cases where that could happen!), and where there's no
error propagation within a block.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote:
> Hi,
> 
> On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > not work because the earlier blocks can be based in the output of
> > > > later blocks.
> > > 
> > > What made us choose CTR for WAL & data file encryption? I checked the
> > > README in the patchset and the wiki page, and neither seem to discuss
> > > that.
> > > 
> > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > the fact that single bit changes in the encrypted data don't propagate
> > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > want a MAC, then we don't want CTR either.
> > 
> > We chose CTR because it was fast, and we could use the same method for
> > WAL, which needs a streaming, not block, cipher.
> 
> The WAL is block oriented too.

Well, AES block mode only does 16-byte blocks, as far as I know, and I
assume WAL is more granular than that.  Also, you need to know the bytes
_before_ the WAL do write a new 16-byte block, so it seems overly
complex for our usage too.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 11:19 AM Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote:
> > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > You are saying that by using a non-LSN nonce, you can write out the page
> > > with a new nonce, but the same LSN, and also discard the page during
> > > crash recovery and use the WAL copy?
> >
> > I don't know what "discard the page during crash recovery and use the
> > WAL copy" means.
>
> I was asking  how decoupling the nonce from the LSN allows for us to
> avoid full page writes for hint bit changes.  I am guessing you are
> saying that on recovery, if we see a hint-bit-only change in the WAL
> (with a new nonce), we just throw away the page because it could be torn
> and use the WAL full page write version.

Well, in the design where the nonce is stored in the page, there is no
need for every hint-type change to appear in the WAL at all. Once per
checkpoint cycle, you need to write a full page image, as we do for
checksums or wal_log_hints. The rest of the time, you can just bump
the nonce and rewrite the page, same as we do today.

> Yes, it might be 1e100+++ more expensive too, but we don't know, and I
> am not ready to add a lot of complexity for such an unknown.

No, it can't be 1e100+++ more expensive, because it's not
realistically possible for a page to be written to disk 1e100+++ times
per checkpoint cycle. It is however entirely possible for it to be
written 100 times per checkpoint cycle. That is not something unknown
about which we need to speculate; it is easy to see that this can
happen, even on a simple test like pgbench with a data set larger than
shared buffers.

It is not right to confuse "we have no idea whether this will be
expensive" with "how expensive this will be is workload-dependent,"
which is what you seem to be doing here. If we had no idea whether
something would be expensive, then I agree that it might not be worth
adding complexity for it, or maybe some testing should be done first
to find out. But if we know for certain that in some workloads
something can be very expensive, then we had better at least talk
about whether it is worth adding complexity in order to resolve the
problem. And that is the situation here.

I am not even convinced that storing the nonce in the block is going
to be more complex, because it seems to me that the patches I posted
upthread worked out pretty cleanly. There are some things to discuss
and think about there, for sure, but it is not like we are talking
about inventing warp drive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 12:01:16 -0400, Bruce Momjian wrote:
> On Thu, May 27, 2021 at 08:34:51AM -0700, Andres Freund wrote:
> > On Thu, May 27, 2021, at 08:10, Bruce Momjian wrote:
> > > On Wed, May 26, 2021 at 05:11:24PM -0700, Andres Freund wrote:
> > > > On 2021-05-25 17:12:05 -0400, Bruce Momjian wrote:
> > > > > If we used a block cipher instead of a streaming one (CTR), this might
> > > > > not work because the earlier blocks can be based in the output of
> > > > > later blocks.
> > > > 
> > > > What made us choose CTR for WAL & data file encryption? I checked the
> > > > README in the patchset and the wiki page, and neither seem to discuss
> > > > that.
> > > > 
> > > > The dangers around nonce reuse, the space overhead of storing the nonce,
> > > > the fact that single bit changes in the encrypted data don't propagate
> > > > seem not great?  Why aren't we using something like XTS? It has obvious
> > > > issues as wel, but CTR's weaknesses seem at least as great. And if we
> > > > want a MAC, then we don't want CTR either.
> > > 
> > > We chose CTR because it was fast, and we could use the same method for
> > > WAL, which needs a streaming, not block, cipher.
> > 
> > The WAL is block oriented too.
> 
> Well, AES block mode only does 16-byte blocks, as far as I know, and I
> assume WAL is more granular than that.

WAL is 8kB blocks by default. We only ever write it out with at least
that granularity.


> Also, you need to know the bytes _before_ the WAL do write a new
> 16-byte block, so it seems overly complex for our usage too.

See the XTS reference. Yes, it needs the previous 16bytes, but only
within the 8kB page.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 12:03:00PM -0400, Robert Haas wrote:
> On Thu, May 27, 2021 at 11:19 AM Bruce Momjian <bruce@momjian.us> wrote:
> > I was asking  how decoupling the nonce from the LSN allows for us to
> > avoid full page writes for hint bit changes.  I am guessing you are
> > saying that on recovery, if we see a hint-bit-only change in the WAL
> > (with a new nonce), we just throw away the page because it could be torn
> > and use the WAL full page write version.
> 
> Well, in the design where the nonce is stored in the page, there is no
> need for every hint-type change to appear in the WAL at all. Once per
> checkpoint cycle, you need to write a full page image, as we do for
> checksums or wal_log_hints. The rest of the time, you can just bump
> the nonce and rewrite the page, same as we do today.

What is it about having the nonce be the LSN that doesn't allow that to
happen?  Could we just create a dummy LSN record and assign that to the
page and use that as a nonce.

> > Yes, it might be 1e100+++ more expensive too, but we don't know, and I
> > am not ready to add a lot of complexity for such an unknown.
> 
> No, it can't be 1e100+++ more expensive, because it's not
> realistically possible for a page to be written to disk 1e100+++ times
> per checkpoint cycle. It is however entirely possible for it to be
> written 100 times per checkpoint cycle. That is not something unknown
> about which we need to speculate; it is easy to see that this can
> happen, even on a simple test like pgbench with a data set larger than
> shared buffers.

I guess you didn't get my joke on that one.  ;-)

> It is not right to confuse "we have no idea whether this will be
> expensive" with "how expensive this will be is workload-dependent,"
> which is what you seem to be doing here. If we had no idea whether
> something would be expensive, then I agree that it might not be worth
> adding complexity for it, or maybe some testing should be done first
> to find out. But if we know for certain that in some workloads
> something can be very expensive, then we had better at least talk
> about whether it is worth adding complexity in order to resolve the
> problem. And that is the situation here.

Sure, but the downsides of avoiding it seem very high to me, not only in
code complexity but in requiring dump/reload or logical replication to
deploy.

> I am not even convinced that storing the nonce in the block is going
> to be more complex, because it seems to me that the patches I posted
> upthread worked out pretty cleanly. There are some things to discuss
> and think about there, for sure, but it is not like we are talking
> about inventing warp drive.

See above.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 11:10:00 -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 04:46:29PM -0700, Andres Freund wrote:
> > On 2021-05-25 22:23:46 -0400, Stephen Frost wrote:
> > > Andres mentioned other possible cases where the LSN doesn’t change even
> > > though we change the page and, as he’s probably right, we would have to
> > > figure out a solution in those cases too (potentially including cases like
> > > crash recovery or replay on a replica where we can’t really just go around
> > > creating dummy WAL records to get new LSNs..).
> > 
> > Yea, I think there's quite a few of those. For one, we don't guarantee
> > that that the hole between pd_lower/upper is zeroes. It e.g. contains
> > old tuple data after deleted tuples are pruned away. But when logging an
> > FPI, we omit that range. Which means that after crash recovery the area
> > is zeroed out. There's several cases where padding can result in the
> > same.
> >
> > Just look at checkXLogConsistency(), heap_mask() et al for all the
> > differences that can occur and that need to be ignored for the recovery
> > consistency checking to work.
> > 
> > Particularly the hole issue seems trivial to exploit, because we know
> > the plaintext of the hole after crash recovery (0s).
> > 
> > 
> > I don't see how using the LSN alone is salvagable.
> 
> OK, so you are saying the replica would have all zeros because of crash
> recovery, so XOR'ing that with the encryption steam makes the encryption
> stream visible, and you could use that to decrypt the dead data on the
> primary.  That is an interesting case that would need to fix.

I don't see how it's a viable security model to assume that you can
ensure that we never write different data with the same LSN. Yes, you
can fix a few cases, but how can we be confident that we're actually
doing a good job, when the consequences are pretty dramatic.

Nor do I think it's architecturally OK to impose a significant new
hurdle against doing any sort of "changing" writes on standbys.

It's time to move on from the idea of using the LSN as the nonce.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 10:57:24 -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 04:26:01PM -0700, Andres Freund wrote:
> > I suspect that if we try to not disclose data if an attacker has write
> > access, this still leaves us with issues around nonce reuse, unless we
> > also employ integrity measures. Particularly due to CTR mode, which
> > makes it easy to manipulate individual parts of the encrypted page
> > without causing the decrypted page to be invalid. E.g. the attacker can
> > just update pd_upper on the page by a small offset, and suddenly the
> > replay will insert the tuple at a slightly shifted offset - which then
> > seems to leak enough data to actually analyze things?
> 
> Yes, I don't think protecting from write access is a realistic goal at
> this point, and frankly ever.  I think write access protection needs
> all-cluster-file encryption.  This is documented:
> 
>     https://github.com/postgres/postgres/compare/master..bmomjian:_cfe-01-doc.patch
> 
>     Cluster file encryption does not protect against unauthorized
>     file system writes.  Such writes can allow data decryption if
>     used to weaken the system's security and the weakened system is
>     later supplied with the externally-stored cluster encryption key.
>     This also does not always detect if users with write access remove
>     or modify database files.
> 
> If this needs more text, let me know.

Well, it's one thing to say that it's not a complete protection, and
another that a few byte sized writes to a single page are sufficient to
get access to encrypted data. And "all-cluster-file" encryption won't
help against the type of scenario I outlined.


> >
https://github.com/bmomjian/postgres/commit/7b43d37a5edb91c29ab6b4bb00def05def502c33#diff-0dcb5b2f36c573e2a7787994690b8fe585001591105f78e58ae3accec8f998e0R92
> >     /*
> >      * Check if the page has a special size == GISTPageOpaqueData, a valid
> >      * GIST_PAGE_ID, no invalid GiST flag bits are set, and a valid LSN.  This
> >      * is true for all GiST pages, and perhaps a few pages that are not.  The
> >      * only downside of guessing wrong is that we might not update the LSN for
> >      * some non-permanent relation page changes, and therefore reuse the IV,
> >      * which seems acceptable.
> >      */
> > 
> > Huh?
> 
> Are you asking about this C commention in relation to the discussion
> above, or is it an independent question?  Are asking what it means?

The comment is blithely waving away a fundamental no-no (reusing nonces)
when using CTR mode as "acceptable".

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 12:01 PM Andres Freund <andres@anarazel.de> wrote:
> What prevents us from using something like XTS? I'm not saying that that
> is the right approach, due to the fact that it leaks information about a
> block being the same as an earlier version of the same block. But right
> now we are talking about using CTR without addressing the weaknesses CTR
> has, where a failure to increase the nonce is fatal (the code even
> documents known cases where that could happen!), and where there's no
> error propagation within a block.

I spent some time this morning reading up on XTS in general and also
on previous discussions on this list on the list. It seems like XTS is
considered state-of-the-art for full disk encryption, and what we're
doing seems to me to be similar in concept. The most useful on-list
discussion that I found was on this thread:


https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df

There are a lot of things that people said on that thread, but then
Bruce basically proposes CBC and/or CTR and I couldn't clearly
understand the reasons for that choice. Maybe there was some off-list
discussion of this that wasn't captured in the email traffic?

All that having been said, I am pretty sure I don't fully understand
what any of these modes involve. I gather that XTS requires two keys,
but it seems like it doesn't require a nonce. It seems to use a
"tweak" that is generated from the block number and the position
within the block (since an e.g. 8kB database block is being encrypted
as a bunch of 16-byte AES blocks) but apparently there's no problem
with the tweak being the same every time the block is encrypted? If no
nonce is required, that seems like a massive advantage, since then we
don't need to worry about how to get one or about how to make sure
it's never reused.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 10:47:13 -0400, Robert Haas wrote:
> Now, another option is to do what you suggest here. We could say that
> if a dirty page is evicted, but the page is only dirty because of
> hint-type changes, we don't actually write it out. That does avoid
> using the same nonce for multiple writes, because now there's only one
> write. It also fixes the problem on standbys that Andres was
> complaining about, because on a standby, the only way a page can
> possibly be dirtied without an associated WAL record is through a
> hint-type change.

What does that protect against that I was concerned about? That still
allows hint bits to be leaked, via

1) replay WAL record with FPI
2) hint bit change during read
3) incremental page change

vs 1) 3). Even if we declare that OK, it doesn't actually address the
whole issue of WAL replay not necessarily re-creating bit identical page
contents.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 12:28:39 -0400, Robert Haas wrote:
> All that having been said, I am pretty sure I don't fully understand
> what any of these modes involve. I gather that XTS requires two keys,
> but it seems like it doesn't require a nonce.

It needs a second secret, but that second secret can - as far as I
understand it - be generated using a strong prng and encrypted with the
"main" key, and stored in a central location.


> It seems to use a "tweak" that is generated from the block number and
> the position within the block (since an e.g. 8kB database block is
> being encrypted as a bunch of 16-byte AES blocks) but apparently
> there's no problem with the tweak being the same every time the block
> is encrypted?

Right. That comes with a price however: It leaks the information that a
block "version" is identical to an earlier version of the block. That's
obviously better than leaking information that allows decryption like
with the nonce reuse issue.

Nor does it provide integrity - which does seem like a significant issue
going forward. Which does require storing additional per-page data...

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 12:28:39PM -0400, Robert Haas wrote:
> On Thu, May 27, 2021 at 12:01 PM Andres Freund <andres@anarazel.de> wrote:
> > What prevents us from using something like XTS? I'm not saying that that
> > is the right approach, due to the fact that it leaks information about a
> > block being the same as an earlier version of the same block. But right
> > now we are talking about using CTR without addressing the weaknesses CTR
> > has, where a failure to increase the nonce is fatal (the code even
> > documents known cases where that could happen!), and where there's no
> > error propagation within a block.
> 
> I spent some time this morning reading up on XTS in general and also
> on previous discussions on this list on the list. It seems like XTS is
> considered state-of-the-art for full disk encryption, and what we're
> doing seems to me to be similar in concept. The most useful on-list
> discussion that I found was on this thread:
> 
>
https://www.postgresql.org/message-id/flat/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com#19d3b7c37b9f84798f899360393584df
> 
> There are a lot of things that people said on that thread, but then
> Bruce basically proposes CBC and/or CTR and I couldn't clearly
> understand the reasons for that choice. Maybe there was some off-list
> discussion of this that wasn't captured in the email traffic?

There was no other discussion about XTS that I know of.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 12:31 PM Andres Freund <andres@anarazel.de> wrote:
> What does that protect against that I was concerned about? That still
> allows hint bits to be leaked, via
>
> 1) replay WAL record with FPI
> 2) hint bit change during read
> 3) incremental page change
>
> vs 1) 3). Even if we declare that OK, it doesn't actually address the
> whole issue of WAL replay not necessarily re-creating bit identical page
> contents.

You're right. That seems fatal, as it would lead to encrypting the
different versions of the page with the IV on the master and the
standby, and the differences would consist of old data that could be
recovered by XORing the two encrypted page versions. To be clear, it
is tuple data that would be recovered, not just hint bits.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-27 12:28:39 -0400, Robert Haas wrote:
> > All that having been said, I am pretty sure I don't fully understand
> > what any of these modes involve. I gather that XTS requires two keys,
> > but it seems like it doesn't require a nonce.
>
> It needs a second secret, but that second secret can - as far as I
> understand it - be generated using a strong prng and encrypted with the
> "main" key, and stored in a central location.

Yes, I'm fairly confident this is the case.

> > It seems to use a "tweak" that is generated from the block number and
> > the position within the block (since an e.g. 8kB database block is
> > being encrypted as a bunch of 16-byte AES blocks) but apparently
> > there's no problem with the tweak being the same every time the block
> > is encrypted?
>
> Right. That comes with a price however: It leaks the information that a
> block "version" is identical to an earlier version of the block. That's
> obviously better than leaking information that allows decryption like
> with the nonce reuse issue.

Right, if we simply can't solve the nonce-reuse concern then that would
be better.

> Nor does it provide integrity - which does seem like a significant issue
> going forward. Which does require storing additional per-page data...

Yeah, this is one of the reasons that I hadn't really been thrilled with
XTS- I've really been looking down the road at eventually having GCM and
having actual integrity validation included.

That's not really a reason to rule it out though and Bruce's point about
having a way to get to an encrypted cluster from an unencrypted one is
certainly worth consideration.  Naturally, we'd need to document
everything appropriately but there isn't anything saying that we
couldn't, say, have XTS in v15 without any adjustments to the page
layout, accepting that there's no data integrity validation and focusing
just on encryption, and then returning to the question about adding in
data integrity validation for a future version, perhaps using the
special area for a nonce+tag with GCM or maybe something else.  Users
who wish to move to a cluster with encryption and data integrity
validation would have to get there through some other means than
replication, but that's going to always be the case because we have to
have space to store the tag, even if we can figure out some other
solution for the nonce.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> That's not really a reason to rule it out though and Bruce's point about
> having a way to get to an encrypted cluster from an unencrypted one is
> certainly worth consideration.  Naturally, we'd need to document
> everything appropriately but there isn't anything saying that we
> couldn't, say, have XTS in v15 without any adjustments to the page
> layout, accepting that there's no data integrity validation and focusing
> just on encryption, and then returning to the question about adding in
> data integrity validation for a future version, perhaps using the
> special area for a nonce+tag with GCM or maybe something else.  Users
> who wish to move to a cluster with encryption and data integrity
> validation would have to get there through some other means than
> replication, but that's going to always be the case because we have to
> have space to store the tag, even if we can figure out some other
> solution for the nonce.

But won't we then end up with a different set of requirements around
nonce assignment durability when introducing GCM support? That's not
actually entirely trivial to do correctly on a standby. I guess we can
use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
reuse, but ...

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> > That's not really a reason to rule it out though and Bruce's point about
> > having a way to get to an encrypted cluster from an unencrypted one is
> > certainly worth consideration.  Naturally, we'd need to document
> > everything appropriately but there isn't anything saying that we
> > couldn't, say, have XTS in v15 without any adjustments to the page
> > layout, accepting that there's no data integrity validation and focusing
> > just on encryption, and then returning to the question about adding in
> > data integrity validation for a future version, perhaps using the
> > special area for a nonce+tag with GCM or maybe something else.  Users
> > who wish to move to a cluster with encryption and data integrity
> > validation would have to get there through some other means than
> > replication, but that's going to always be the case because we have to
> > have space to store the tag, even if we can figure out some other
> > solution for the nonce.
>
> But won't we then end up with a different set of requirements around
> nonce assignment durability when introducing GCM support? That's not
> actually entirely trivial to do correctly on a standby. I guess we can
> use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> reuse, but ...

Not sure if I'm entirely following the question but I would have thought
the up-thread idea of generating a random part of the nonce for each
start up and then a global counter for the rest, which would be written
whenever the page is updated (meaning it wouldn't have anything to do
with the LSN and would be stored in the special area as Robert
contemplated) would work for both primaries and replicas.

Taking a step back, while I like the idea of trying to think through
these complications in a future world where we add GCM support, if we're
actually agreed on seriously looking at XTS for v15 then maybe we should
focus on that for the moment.  As Bruce says, there's a lot of moving
parts in this patch that likely need discussion and agreement in order
for us to be able to move forward with it.  For one, we'd probably want
to get agreement on what we'd use to construct the tweak, for starters.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 12:15 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Well, in the design where the nonce is stored in the page, there is no
> > need for every hint-type change to appear in the WAL at all. Once per
> > checkpoint cycle, you need to write a full page image, as we do for
> > checksums or wal_log_hints. The rest of the time, you can just bump
> > the nonce and rewrite the page, same as we do today.
>
> What is it about having the nonce be the LSN that doesn't allow that to
> happen?  Could we just create a dummy LSN record and assign that to the
> page and use that as a nonce.

I can't tell which of two possible proposals you are describing here.
If the LSN is used to derive the nonce, then one option is to just log
a WAL record every time we need a new nonce. As I understand it,
that's basically what you've already implemented, and we've discussed
the disadvantages of that approach at some length already. The basic
problems seem to be:

- It's potentially very expensive if page evictions are frequent,
which they will be whenever the workload is write-heavy and the
working set is larger than shared_buffers.
- If there's ever a situation where we need to write a page image
different from any page image written previously and we cannot at that
time write a WAL record to generate a new LSN for use as the nonce,
then the algorithm is broken entirely. Andres's latest post points out
- I think correctly - that this happens on standbys, because WAL
replay does not generate byte-identical results on standbys even if
you ignore hint bits.

The first point strikes me as a sufficiently serious performance
problem to justify giving up on this design, but that's a judgement
call. The second one seems like it breaks it entirely.

Now, there's another possible direction that is also suggested by your
remarks here: maybe you meant using a fake LSN in cases where we can't
use a real one. For example, suppose you decide to reserve half of the
LSN space - all LSNs with the high bit set, for example - for this
purpose. Well, you somehow need to ensure that you never use one of
those values more than once, so you might think of putting a counter
in shared memory. But now imagine a master with two standbys. How
would you avoid having the same counter value used on one standby and
also on the other standby? Even if they use the same counter for
different pages, it's a critical security flaw. And since those
standbys don't even need to know that the other one exists, that seems
pretty well impossible to avoid.

Now you might ask why we don't have the same problem if we store the
nonce in the special space. One difference is that if you store the
nonce explicitly, you can allow however much bit space you need in
order to guarantee uniqueness, whereas reserving half the LSN space
only gives you 63 bits. That's not enough to achieve uniqueness
without tight coordination. With 128 bits, you can do things like just
generate random values and assume they're vanishingly unlikely to
collide, or randomly generate half the value and use the other half as
a counter and be pretty safe. With 63 bits you just don't have enough
bit space available to reliably avoid collisions using algorithms of
that type, due to the birthday paradox. I think it would be adequate
for uniqueness if there were a single shared counter and every
allocation came from it, but again, as soon as you imagine a master
and a bunch of standbys, that breaks down.

Also, it's not entirely clear to me that you can avoid needing the LSN
space on the page for a real LSN at the same time you also need it for
a fake-LSN-being-used-as-a-nonce. We rely on the LSN field containing
the LSN of the last WAL record for the page in order to obey the
WAL-before-data rule, without which crash recovery will not work
reliably. Now, if you sometimes have to use that field for a nonce
that is a fake LSN, that means you no longer always have a place to
store the real LSN. I can't convince myself off-hand that it's
completely impossible to work around that problem, but it seems like
any attempt to do so would be complicated and fragile at best. I don't
think that's a direction that we want to go. Making crash recovery
work reliably is a hard problem where we've had lots of bugs despite
years of dedicated effort. TDE is also complex and has lots of
pitfalls of its own. If we take two things which are individually
complicated and hard to get right and intertwine them by making them
share bit-space, I think it drives the complexity up to a level where
we don't have much hope of getting things right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 12:00:03 -0400, Bruce Momjian wrote:
> On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> > Rather than surprise anyone, I might as well just come out and say some
> > things.  First, I have always admitted this feature has limited
> > usefulness.  
> > 
> > I think a non-LSN nonce adds a lot of code complexity, which adds a code
> > and maintenance burden.  It also prevents the creation of an encrypted
> > replica from a non-encrypted primary using binary replication, which
> > makes deployment harder.
> > 
> > Take a feature of limited usefulness, add code complexity and deployment
> > difficulty, and the feature becomes even less useful.
> > 
> > For these reasons, if we decide to go in the direction of using a
> > non-LSN nonce, I no longer plan to continue working on this feature. I
> > would rather work on things that have a more positive impact.  Maybe a
> > non-LSN nonce is a better long-term plan, but there are too many
> > unknowns and complexity for me to feel comfortable with it.
> 
> [...]
> I suspect that if we start adding a non-LSN nonce and malicious write
> detection, we will end up with the same problem --- a complex patch for
> a feature that has limited usefulness, and requires dump/restore or
> logical replication to add it to a cluster.  I think such a patch would
> be rejected, and I would probably even vote against it myself.

I think it's diametrically the opposite. Using the LSN as the nonce
requires that all code modifying pages needs to be audited (which
clearly hasn't been done yet), whereas an independent nonce can be
maintained in a few central places. And that's not just a one-off issue,
it's a forevermore issue.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 12:49 PM Stephen Frost <sfrost@snowman.net> wrote:
> Right, if we simply can't solve the nonce-reuse concern then that would
> be better.

Given the issues that Andres raised about standbys and the treatment
of the "hole," I see using the LSN for the nonce as a dead-end. I
think it's pretty bad on performance grounds too, for reasons already
discussed, but you could always hypothesize that people care so much
about security that they will ignore any amount of trouble with
performance. You can hardly hypothesize that those same people also
won't mind security vulnerabilities that expose tuple data, though.

I don't think the idea of storing the nonce at the end of the page is
dead. There seem to be some difficulties there, but I think there are
reasonable prospects of solving them. At the very least there's the
brute-force approach of generating a ton of cryptographically strong
random numbers, and there seems to be some possibility of doing better
than that.

However, I'm pretty excited by this idea of using XTS. Now granted I
didn't have the foggiest idea what XTS was before today, but I hear
you and Andres saying that we can use that approach without needing a
nonce at all. That seems to make a lot of the problems we're talking
about here just go away.

> > Nor does it provide integrity - which does seem like a significant issue
> > going forward. Which does require storing additional per-page data...
>
> Yeah, this is one of the reasons that I hadn't really been thrilled with
> XTS- I've really been looking down the road at eventually having GCM and
> having actual integrity validation included.
>
> That's not really a reason to rule it out though and Bruce's point about
> having a way to get to an encrypted cluster from an unencrypted one is
> certainly worth consideration.  Naturally, we'd need to document
> everything appropriately but there isn't anything saying that we
> couldn't, say, have XTS in v15 without any adjustments to the page
> layout, accepting that there's no data integrity validation and focusing
> just on encryption, and then returning to the question about adding in
> data integrity validation for a future version, perhaps using the
> special area for a nonce+tag with GCM or maybe something else.  Users
> who wish to move to a cluster with encryption and data integrity
> validation would have to get there through some other means than
> replication, but that's going to always be the case because we have to
> have space to store the tag, even if we can figure out some other
> solution for the nonce.

+1 from me to all of this except the idea of foreclosing present
discussion on how data-integrity validation could be made to work. I
think it would great to have more discussion of that problem now, in
case it informs our decisions about anything else, especially because,
based on your earlier remarks, it seems like there is some coupling
between the two problems.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 13:26:11 -0400, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> > > That's not really a reason to rule it out though and Bruce's point about
> > > having a way to get to an encrypted cluster from an unencrypted one is
> > > certainly worth consideration.  Naturally, we'd need to document
> > > everything appropriately but there isn't anything saying that we
> > > couldn't, say, have XTS in v15 without any adjustments to the page
> > > layout, accepting that there's no data integrity validation and focusing
> > > just on encryption, and then returning to the question about adding in
> > > data integrity validation for a future version, perhaps using the
> > > special area for a nonce+tag with GCM or maybe something else.  Users
> > > who wish to move to a cluster with encryption and data integrity
> > > validation would have to get there through some other means than
> > > replication, but that's going to always be the case because we have to
> > > have space to store the tag, even if we can figure out some other
> > > solution for the nonce.
> > 
> > But won't we then end up with a different set of requirements around
> > nonce assignment durability when introducing GCM support? That's not
> > actually entirely trivial to do correctly on a standby. I guess we can
> > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> > reuse, but ...
> 
> Not sure if I'm entirely following the question

It seems like it might end up with lots of duplicated effort to go for
XTS in the short term, if we medium term then have to solve all the
issues around how to maintain efficiently and correctly nonces anyway,
because we want integrity support.


> but I would have thought the up-thread idea of generating a random
> part of the nonce for each start up and then a global counter for the
> rest, which would be written whenever the page is updated (meaning it
> wouldn't have anything to do with the LSN and would be stored in the
> special area as Robert contemplated) would work for both primaries and
> replicas.

Yea, it's not a bad approach. Particularly because it removes the need
to ensure that "global nonce counter" increments are guaranteed to be
durable.


> For one, we'd probably want to get agreement on what we'd use to
> construct the tweak, for starters.

Hm, isn't that just a pg_strong_random() and storing it encrypted?

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-27 13:26:11 -0400, Stephen Frost wrote:
> > * Andres Freund (andres@anarazel.de) wrote:
> > > On 2021-05-27 12:49:15 -0400, Stephen Frost wrote:
> > > > That's not really a reason to rule it out though and Bruce's point about
> > > > having a way to get to an encrypted cluster from an unencrypted one is
> > > > certainly worth consideration.  Naturally, we'd need to document
> > > > everything appropriately but there isn't anything saying that we
> > > > couldn't, say, have XTS in v15 without any adjustments to the page
> > > > layout, accepting that there's no data integrity validation and focusing
> > > > just on encryption, and then returning to the question about adding in
> > > > data integrity validation for a future version, perhaps using the
> > > > special area for a nonce+tag with GCM or maybe something else.  Users
> > > > who wish to move to a cluster with encryption and data integrity
> > > > validation would have to get there through some other means than
> > > > replication, but that's going to always be the case because we have to
> > > > have space to store the tag, even if we can figure out some other
> > > > solution for the nonce.
> > >
> > > But won't we then end up with a different set of requirements around
> > > nonce assignment durability when introducing GCM support? That's not
> > > actually entirely trivial to do correctly on a standby. I guess we can
> > > use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> > > reuse, but ...
> >
> > Not sure if I'm entirely following the question
>
> It seems like it might end up with lots of duplicated effort to go for
> XTS in the short term, if we medium term then have to solve all the
> issues around how to maintain efficiently and correctly nonces anyway,
> because we want integrity support.

You and Robert both seem to be going in that direction, one which I tend
to share, while Bruce is very hard set against it from the perspective
that he doesn't view integrity as important (I disagree quite strongly
with that, even if we can't protect everything, I see it as certainly
valuable to protect the primary data) and that this approach adds
complexity (the amount of which doesn't seem to be agreed upon).

I'm also not sure how much of the effort would really be duplicated.

Were we to start with XTS, that's almost drop-in with what Bruce has
(actually, it should simplify some parts since we no longer need to deal
with making sure we always increase the LSN, etc) gives users more
flexibility in terms of getting to an encrypted cluster and solves
certain use-cases.  Very little of that seems like it would be ripped
out if we were to (also) provide a GCM option.

Now, if we were to *only* provide a GCM option then maybe we wouldn't
need to think about the XTS case of having to come up with a tweak
(though that seems like a rather small amount of code) but that would
also mean we need to change the page format and we can't do any kind of
binary/page-level transistion to an encrypted cluster, like we could
with XTS.

Trying to break it down, the end-goal states look like:

GCM-only: no binary upgrade path due to having to store the tag
XTS-only: no data integrity option
GCM+XTS: binary upgrade path for XTS, data integrity with GCM

If we want both a binary upgrade path, and a data integrity option, then
it seems like the only end state which provides both is GCM+XTS, in
which case I don't think there's a lot of actual duplication.

Perhaps there's an "XTS + some other data integrity approach" option
where we could preserve the page format by stuffing information into
another fork or maybe telling users to hash their data and store that
hash as another column which would allow us to avoid implementing GCM,
but I don't see a way to avoid having XTS if we are going to provide a
binary upgrade path.

Perhaps AES-GCM-SIV would be interesting to consider in general, but
that still means we need to find space for the tag and that still
precludes a binary upgrade path.

> > but I would have thought the up-thread idea of generating a random
> > part of the nonce for each start up and then a global counter for the
> > rest, which would be written whenever the page is updated (meaning it
> > wouldn't have anything to do with the LSN and would be stored in the
> > special area as Robert contemplated) would work for both primaries and
> > replicas.
>
> Yea, it's not a bad approach. Particularly because it removes the need
> to ensure that "global nonce counter" increments are guaranteed to be
> durable.

Right.

> > For one, we'd probably want to get agreement on what we'd use to
> > construct the tweak, for starters.
>
> Hm, isn't that just a pg_strong_random() and storing it encrypted?

Perhaps it is, but at least in some other cases it's generated based on
sector and block (which maybe could be relfilenode and block for us?):

https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:
> But won't we then end up with a different set of requirements around
> nonce assignment durability when introducing GCM support? That's not
> actually entirely trivial to do correctly on a standby. I guess we can
> use AES-GCM-SSIV and be ok with living with edge cases leading to nonce
> reuse, but ...

All these different encryption modes are hard for me to grok.

That said, I want to mention a point which I think may be relevant
here. As far as I know, in the case of a permanent table page, we
never write X then X' then X again. If the change is WAL-logged, then
the LSN advances, and it will never thereafter go backward. Otherwise,
it's something that uses MarkBufferDirtyHint(). As far as I know, all
of those changes are one-way. For example, we set hint bits without
logging the change, but anything that clears hint bits is logged. We
mark btree index items dead as a type of hint, but they never come
back to life again; instead, they get cleaned out of the page entirely
as a WAL-logged operation. So I don't know that an adversary seeing
the same exact ciphertext multiple times is really likely to occur.

Well, it could certainly occur for temporary or unlogged tables, since
those have LSN = 0. And in cases were we currently copy pages around,
like creating a new database, it could happen. I suspect those cases
could be fixed, if we cared enough, and there are independent reasons
to want to fix the create-new-database case. It would be fairly easy
to put fake LSNs in temporary buffers, since they're in a separate
pool of buffers in backend-private memory with a separate buffer
manager. And it could probably even be done for unlogged tables,
though not as easily. Or we could use the special-space technique to
put some unpredictable garbage into each page and then change the
garbage every time we write the page. I read the discussion so far to
say that maybe these kinds of measures aren't even needed, and if so,
great. But even without doing anything, I don't think it's going to
happen very much.

Another case where this sort of thing might happen is a standby doing
whatever the master did. I suppose that could be avoided if the
standby always has its own encryption keys, but that forces a key
rotation when you create a standby, and it doesn't seem like a lot of
fun to insist on that. But the information leak seems minor. If we get
to a point where an adversary with full filesystem access on all our
systems can't do better than assessing our replication lag, we'll be a
lot better off then than we are now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> I'm also not sure how much of the effort would really be duplicated.
>
> Were we to start with XTS, that's almost drop-in with what Bruce has
> (actually, it should simplify some parts since we no longer need to deal
> with making sure we always increase the LSN, etc) gives users more
> flexibility in terms of getting to an encrypted cluster and solves
> certain use-cases.  Very little of that seems like it would be ripped
> out if we were to (also) provide a GCM option.

> Now, if we were to *only* provide a GCM option then maybe we wouldn't
> need to think about the XTS case of having to come up with a tweak
> (though that seems like a rather small amount of code) but that would
> also mean we need to change the page format and we can't do any kind of
> binary/page-level transistion to an encrypted cluster, like we could
> with XTS.

> Trying to break it down, the end-goal states look like:
>
> GCM-only: no binary upgrade path due to having to store the tag
> XTS-only: no data integrity option
> GCM+XTS: binary upgrade path for XTS, data integrity with GCM

Why would GCM + XTS make sense? Especially if we were to go with
AES-GCM-SIV or something, drastically reducing the danger of nonce
reuse?

And I don't think there's an easy way to do both using openssl, without
double encrypting, which we'd obviously not want for performance
reasons. And I don't think we'd want to implement either ourselves -
leaving other dangers aside, I don't think we want to do the
optimization work necessary to get good performance.


> If we want both a binary upgrade path, and a data integrity option, then
> it seems like the only end state which provides both is GCM+XTS, in
> which case I don't think there's a lot of actual duplication.

I honestly feel that Bruce's point about trying to shoot for the moon,
and thus not getting the basic feature done, applies much more to the
binary upgrade path than anything else. I think we should just stop
aiming for that for now. If we later want to add code that goes through
the cluster to ensure that there's enough space on each page for
integrity data, to provide a migration path, fine. But we shouldn't make
the binary upgrade path for TED a hard requirement.


> > > For one, we'd probably want to get agreement on what we'd use to
> > > construct the tweak, for starters.
> >
> > Hm, isn't that just a pg_strong_random() and storing it encrypted?
>
> Perhaps it is, but at least in some other cases it's generated based on
> sector and block (which maybe could be relfilenode and block for us?):
>
> https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac

My understanding is that you'd use
    tweak_secret + block_offset
or
    someop(tweak_secret, relfilenode) block_offset

to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 3:22 PM Stephen Frost <sfrost@snowman.net> wrote:
> Trying to break it down, the end-goal states look like:
>
> GCM-only: no binary upgrade path due to having to store the tag
> XTS-only: no data integrity option
> GCM+XTS: binary upgrade path for XTS, data integrity with GCM
>
> If we want both a binary upgrade path, and a data integrity option, then
> it seems like the only end state which provides both is GCM+XTS, in
> which case I don't think there's a lot of actual duplication.
>
> Perhaps there's an "XTS + some other data integrity approach" option
> where we could preserve the page format by stuffing information into
> another fork or maybe telling users to hash their data and store that
> hash as another column which would allow us to avoid implementing GCM,
> but I don't see a way to avoid having XTS if we are going to provide a
> binary upgrade path.
>
> Perhaps AES-GCM-SIV would be interesting to consider in general, but
> that still means we need to find space for the tag and that still
> precludes a binary upgrade path.

Anything that decouples features without otherwise losing ground is a
win. If there are things A and B, such that A does encryption and B
does integrity validation, and A and B can be turned on and off
independently of each other, that is better than some
otherwise-comparable C that provides both features.

But I'm going to have to defer to you and Andres and whoever else on
whether that's true for any encryption methods/modes in particular.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 15:48:09 -0400, Robert Haas wrote:
> That said, I want to mention a point which I think may be relevant
> here. As far as I know, in the case of a permanent table page, we
> never write X then X' then X again.

Well, there's crash recovery / restarts. And as previously explained
they can end up with different page contents than before.


> And in cases were we currently copy pages around, like creating a new
> database, it could happen.

As long as its identical data that should be fine, except leaking that
the data is identical. Which doesn't make me really concerned in case of
template databases.


> I suspect those cases could be fixed, if we cared enough, and there
> are independent reasons to want to fix the create-new-database
> case. It would be fairly easy to put fake LSNs in temporary buffers,
> since they're in a separate pool of buffers in backend-private memory
> with a separate buffer manager. And it could probably even be done for
> unlogged tables, though not as easily. [...] I read
> the discussion so far to say that maybe these kinds of measures aren't
> even needed, and if so, great. But even without doing anything, I
> don't think it's going to happen very much.

What precisely are you referring to with "aren't even needed"?

I don't see how the fake LSN approach can work for the crash recovery
issues?


> Or we could use the special-space technique to put some unpredictable
> garbage into each page and then change the garbage every time we write
> the page

Unfortunately with CTR mode that doesn't provide much protection, if
it's part of the encrypted data (vs IV/nonce). A one bit change in the
encrypted data only changes one bit in the unencrypted data, as the data
is just XORd with the cipher stream. So random changes in one place
doesn't prevent disclosure in other parts of the data if the nonce
doesn't also change.  And one can easily predict the effect of flipping
certain bits.


> Another case where this sort of thing might happen is a standby doing
> whatever the master did. I suppose that could be avoided if the
> standby always has its own encryption keys, but that forces a key
> rotation when you create a standby, and it doesn't seem like a lot of
> fun to insist on that. But the information leak seems minor.

Which leaks seem minor? The "hole" issues leak all the prior contents of
the hole, without needing any complicated analysis of the data, because
one plain text is known (zeroes).

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> > I'm also not sure how much of the effort would really be duplicated.
> >
> > Were we to start with XTS, that's almost drop-in with what Bruce has
> > (actually, it should simplify some parts since we no longer need to deal
> > with making sure we always increase the LSN, etc) gives users more
> > flexibility in terms of getting to an encrypted cluster and solves
> > certain use-cases.  Very little of that seems like it would be ripped
> > out if we were to (also) provide a GCM option.
>
> > Now, if we were to *only* provide a GCM option then maybe we wouldn't
> > need to think about the XTS case of having to come up with a tweak
> > (though that seems like a rather small amount of code) but that would
> > also mean we need to change the page format and we can't do any kind of
> > binary/page-level transistion to an encrypted cluster, like we could
> > with XTS.
>
> > Trying to break it down, the end-goal states look like:
> >
> > GCM-only: no binary upgrade path due to having to store the tag
> > XTS-only: no data integrity option
> > GCM+XTS: binary upgrade path for XTS, data integrity with GCM
>
> Why would GCM + XTS make sense? Especially if we were to go with
> AES-GCM-SIV or something, drastically reducing the danger of nonce
> reuse?

You can't get to a GCM-based solution without changing the page format
and therefore you can't get there using streaming replication or a
pg_upgrade that does an encrypt step along with the copy.

> And I don't think there's an easy way to do both using openssl, without
> double encrypting, which we'd obviously not want for performance
> reasons. And I don't think we'd want to implement either ourselves -
> leaving other dangers aside, I don't think we want to do the
> optimization work necessary to get good performance.

Errrr, clearly a misunderstanding here- what I'm suggesting is that we'd
have initdb options where someone could initdb and say they want XTS, OR
they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV).  I'm
not talking about doing both in the cluster at the same time..

Or, with XTS, we could have an option to pg_basebackup + encrypt into
XTS to build an encrypted replica from an unencrypted cluster.  There
isn't any way we could do that with GCM though since we wouldn't have
any place to put the tag.

> > If we want both a binary upgrade path, and a data integrity option, then
> > it seems like the only end state which provides both is GCM+XTS, in
> > which case I don't think there's a lot of actual duplication.
>
> I honestly feel that Bruce's point about trying to shoot for the moon,
> and thus not getting the basic feature done, applies much more to the
> binary upgrade path than anything else. I think we should just stop
> aiming for that for now. If we later want to add code that goes through
> the cluster to ensure that there's enough space on each page for
> integrity data, to provide a migration path, fine. But we shouldn't make
> the binary upgrade path for TED a hard requirement.

Ok, that's a pretty clear fundamental disagreement between you and
Bruce.  For my 2c, I tend to agree with you that the binary upgrade path
isn't that critical.

If we agree to forgo the binary upgrade requirement and are willing to
accept Robert's approach to use the special area for the nonce+tag, or
similar, then we could perhaps avoid the work of supporting XTS.

> > > > For one, we'd probably want to get agreement on what we'd use to
> > > > construct the tweak, for starters.
> > >
> > > Hm, isn't that just a pg_strong_random() and storing it encrypted?
> >
> > Perhaps it is, but at least in some other cases it's generated based on
> > sector and block (which maybe could be relfilenode and block for us?):
> >
> > https://medium.com/asecuritysite-when-bob-met-alice/who-needs-a-tweak-meet-full-disk-encryption-437e720879ac
>
> My understanding is that you'd use
>     tweak_secret + block_offset
> or
>     someop(tweak_secret, relfilenode) block_offset
>
> to generate the actual per-block (in the 8192 byte, not 128bit sense) tweak.

The above article, at least, suggested encrypting the sector number
using the second key and then multiplying that times 2^(block number),
where those blocks were actually AES 128bit blocks.  The article further
claims that this is what's used in things like Bitlocker, TrueCrypt,
VeraCrypt and OpenSSL.

While the documentation isn't super clear, I'm taking that to mean that
when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
with a 256-bit key (twice the size of the AES key length function), and
you give it a 'tweak', that what you would actually be passing in would
be the "sector number" in the above method, or for us perhaps it would
be relfilenode+block number, or maybe just block number but it seems
like it'd be better to include the relfilenode to me.

OpenSSL docs:

https://www.openssl.org/docs/man1.1.1/man3/EVP_aes_256_cbc.html

Naturally, we would implement testing and use the NIST AES-XTS test
vectors to verify that we're getting the correct results from OpenSSL
based on this understanding.  Still leaves us with the question of what
exactly we should pass into OpenSSL as the 'tweak', if it should be the
block offset inside the file only, or the block offset + relfilenode, or
something else.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Alvaro Herrera
Date:
On 2021-May-27, Andres Freund wrote:

> On 2021-05-27 15:48:09 -0400, Robert Haas wrote:

> > Another case where this sort of thing might happen is a standby doing
> > whatever the master did. I suppose that could be avoided if the
> > standby always has its own encryption keys, but that forces a key
> > rotation when you create a standby, and it doesn't seem like a lot of
> > fun to insist on that. But the information leak seems minor.
> 
> Which leaks seem minor? The "hole" issues leak all the prior contents of
> the hole, without needing any complicated analysis of the data, because
> one plain text is known (zeroes).

Maybe that problem could be solved by having PageRepairFragmentation,
compactify_tuples et al always fill the hole with zeroes, in encrypted
databases.

-- 
Álvaro Herrera       Valdivia, Chile



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 16:09:13 -0400, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> > > I'm also not sure how much of the effort would really be duplicated.
> > >
> > > Were we to start with XTS, that's almost drop-in with what Bruce has
> > > (actually, it should simplify some parts since we no longer need to deal
> > > with making sure we always increase the LSN, etc) gives users more
> > > flexibility in terms of getting to an encrypted cluster and solves
> > > certain use-cases.  Very little of that seems like it would be ripped
> > > out if we were to (also) provide a GCM option.
> > 
> > > Now, if we were to *only* provide a GCM option then maybe we wouldn't
> > > need to think about the XTS case of having to come up with a tweak
> > > (though that seems like a rather small amount of code) but that would
> > > also mean we need to change the page format and we can't do any kind of
> > > binary/page-level transistion to an encrypted cluster, like we could
> > > with XTS.
> > 
> > > Trying to break it down, the end-goal states look like:
> > >
> > > GCM-only: no binary upgrade path due to having to store the tag
> > > XTS-only: no data integrity option
> > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM
> >
> [...]
> > And I don't think there's an easy way to do both using openssl, without
> > double encrypting, which we'd obviously not want for performance
> > reasons. And I don't think we'd want to implement either ourselves -
> > leaving other dangers aside, I don't think we want to do the
> > optimization work necessary to get good performance.
> 
> Errrr, clearly a misunderstanding here- what I'm suggesting is that we'd
> have initdb options where someone could initdb and say they want XTS, OR
> they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV).  I'm
> not talking about doing both in the cluster at the same time..

Ah, that makes more sense ;). So the end goal states are the different
paths we could take?


> Still leaves us with the question of what exactly we should pass into
> OpenSSL as the 'tweak', if it should be the block offset inside the
> file only, or the block offset + relfilenode, or something else.

I think it has to include the relfilenode as a minimum. It'd not be
great if you could identify equivalent blocks in different tables. It
might even be worth complicating createdb() a bit and including the
dboid as well.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 16:13:44 -0400, Alvaro Herrera wrote:
> Maybe that problem could be solved by having PageRepairFragmentation,
> compactify_tuples et al always fill the hole with zeroes, in encrypted
> databases.

If that were the only issue, maybe. But there's plenty other places were
similar things happen. Look at all the stuff that needs to be masked out
for wal consistency checking (checkXLogConsistency() + all the things it
calls). And there's no way proposed to actually have a maintainable way
of detecting omissions around this.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-27 16:09:13 -0400, Stephen Frost wrote:
> > * Andres Freund (andres@anarazel.de) wrote:
> > > On 2021-05-27 15:22:21 -0400, Stephen Frost wrote:
> > > > I'm also not sure how much of the effort would really be duplicated.
> > > >
> > > > Were we to start with XTS, that's almost drop-in with what Bruce has
> > > > (actually, it should simplify some parts since we no longer need to deal
> > > > with making sure we always increase the LSN, etc) gives users more
> > > > flexibility in terms of getting to an encrypted cluster and solves
> > > > certain use-cases.  Very little of that seems like it would be ripped
> > > > out if we were to (also) provide a GCM option.
> > >
> > > > Now, if we were to *only* provide a GCM option then maybe we wouldn't
> > > > need to think about the XTS case of having to come up with a tweak
> > > > (though that seems like a rather small amount of code) but that would
> > > > also mean we need to change the page format and we can't do any kind of
> > > > binary/page-level transistion to an encrypted cluster, like we could
> > > > with XTS.
> > >
> > > > Trying to break it down, the end-goal states look like:
> > > >
> > > > GCM-only: no binary upgrade path due to having to store the tag
> > > > XTS-only: no data integrity option
> > > > GCM+XTS: binary upgrade path for XTS, data integrity with GCM
> > >
> > [...]
> > > And I don't think there's an easy way to do both using openssl, without
> > > double encrypting, which we'd obviously not want for performance
> > > reasons. And I don't think we'd want to implement either ourselves -
> > > leaving other dangers aside, I don't think we want to do the
> > > optimization work necessary to get good performance.
> >
> > Errrr, clearly a misunderstanding here- what I'm suggesting is that we'd
> > have initdb options where someone could initdb and say they want XTS, OR
> > they could initdb and say they want AES-GCM (or maybe AES-GCM-SIV).  I'm
> > not talking about doing both in the cluster at the same time..
>
> Ah, that makes more sense ;). So the end goal states are the different
> paths we could take?

The end goals are different possible things we could provide support
for, not in one cluster, but in one build of PG.  That is, we could
add support in v15 (or whatever) for:

initdb --encryption-type=AES-XTS

and then in v16 add support for:

initdb --encryption-type=AES-GCM (or AES-GCM-SIV, whatever)

while keeping support for AES-XTS.

Users who just want encryption could go do a binary upgrade of some kind
to a cluster which has AES-XTS encryption, but to get GCM they'd have to
initialize a new cluster and migrate data to it using logical
replication or pg_dump/restore.

There's also been requests for other possible encryption options, so I
don't think these would even be the only options eventually, though I do
think we'd probably have them broken down into "just encryption" or
"encryption + data integrity" with the same resulting limitations
regarding the ability to do binary upgrades.

> > Still leaves us with the question of what exactly we should pass into
> > OpenSSL as the 'tweak', if it should be the block offset inside the
> > file only, or the block offset + relfilenode, or something else.
>
> I think it has to include the relfilenode as a minimum. It'd not be
> great if you could identify equivalent blocks in different tables. It
> might even be worth complicating createdb() a bit and including the
> dboid as well.

At this point I'm wondering if it's just:

dboid/relfilenode:block-offset

and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128)
(or -256, whatever we're using based on what was passed to initdb)
returns.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, May 27, 2021 at 4:04 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-05-27 15:48:09 -0400, Robert Haas wrote:
> > That said, I want to mention a point which I think may be relevant
> > here. As far as I know, in the case of a permanent table page, we
> > never write X then X' then X again.
>
> Well, there's crash recovery / restarts. And as previously explained
> they can end up with different page contents than before.

Right, I'm not trying to oversell this point ... if in system XYZ
there's a serious security exposure from ever repeating a page write,
we should not use system XYZ unless we do some work to make sure that
every page write is different. But if we just think it would be nicer
if page writes didn't repeat, that's probably *mostly* true today
already.

> I don't see how the fake LSN approach can work for the crash recovery
> issues?

I wasn't trying to say it could. You've convinced me on that point.

> > Or we could use the special-space technique to put some unpredictable
> > garbage into each page and then change the garbage every time we write
> > the page
>
> Unfortunately with CTR mode that doesn't provide much protection, if
> it's part of the encrypted data (vs IV/nonce). A one bit change in the
> encrypted data only changes one bit in the unencrypted data, as the data
> is just XORd with the cipher stream. So random changes in one place
> doesn't prevent disclosure in other parts of the data if the nonce
> doesn't also change.  And one can easily predict the effect of flipping
> certain bits.

Yeah, I wasn't talking about CTR mode there. I was just saying if we
wanted to avoid ever repeating a write.

> > Another case where this sort of thing might happen is a standby doing
> > whatever the master did. I suppose that could be avoided if the
> > standby always has its own encryption keys, but that forces a key
> > rotation when you create a standby, and it doesn't seem like a lot of
> > fun to insist on that. But the information leak seems minor.
>
> Which leaks seem minor? The "hole" issues leak all the prior contents of
> the hole, without needing any complicated analysis of the data, because
> one plain text is known (zeroes).

No. You're confusing what I was saying here, in the contents of your
comments about the limitations of AES-GCM-SIV, with the discussion
with Bruce about nonce generation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote:
> The above article, at least, suggested encrypting the sector number
> using the second key and then multiplying that times 2^(block number),
> where those blocks were actually AES 128bit blocks.  The article further
> claims that this is what's used in things like Bitlocker, TrueCrypt,
> VeraCrypt and OpenSSL.
> 
> While the documentation isn't super clear, I'm taking that to mean that
> when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
> with a 256-bit key (twice the size of the AES key length function), and
> you give it a 'tweak', that what you would actually be passing in would
> be the "sector number" in the above method, or for us perhaps it would
> be relfilenode+block number, or maybe just block number but it seems
> like it'd be better to include the relfilenode to me.

If you go in that direction, you should make sure pg_upgrade preserves
what you use (it does not preserve relfilenode, just pg_class.oid), and
CREATE DATABASE still works with a simple file copy.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote:
> > The above article, at least, suggested encrypting the sector number
> > using the second key and then multiplying that times 2^(block number),
> > where those blocks were actually AES 128bit blocks.  The article further
> > claims that this is what's used in things like Bitlocker, TrueCrypt,
> > VeraCrypt and OpenSSL.
> >
> > While the documentation isn't super clear, I'm taking that to mean that
> > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
> > with a 256-bit key (twice the size of the AES key length function), and
> > you give it a 'tweak', that what you would actually be passing in would
> > be the "sector number" in the above method, or for us perhaps it would
> > be relfilenode+block number, or maybe just block number but it seems
> > like it'd be better to include the relfilenode to me.
>
> If you go in that direction, you should make sure pg_upgrade preserves
> what you use (it does not preserve relfilenode, just pg_class.oid), and
> CREATE DATABASE still works with a simple file copy.

Ah, yes, good point, if we support in-place pg_upgrade of an encrypted
cluster then the tweak has to be consistent between the old and new.

I tend to agree with Andres that it'd be reasonable to make CREATE
DATABASE do a bit more work for an encrypted cluster though, so I'm less
concerned about that.

Using pg_class.oid instead of relfilenode seems likely to complicate
things like crash recovery though, wouldn't it?  I wonder if there's
something else we could use.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 16:55:29 -0400, Robert Haas wrote:
> No. You're confusing what I was saying here, in the contents of your
> comments about the limitations of AES-GCM-SIV, with the discussion
> with Bruce about nonce generation.

Ah. I think the focus on LSNs confused me a bit.

FWIW:
Nist guidance on IVs for AES GCM (surprisingly readable):
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf
AES-GCM-SIV (harder to read):
https://eprint.iacr.org/2017/168.pdf

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Neil Chen
Date:


On Thu, May 27, 2021 at 11:12 PM Bruce Momjian <bruce@momjian.us> wrote:

Well, the code now does write full page images for hint bit changes, so
it should work fine.


Yes, indeed it works well and I'd tested it. But here I want to make clear my understanding of the argument, if there is any problem please help me correct it.

1. Why couldn't we just throw away the hint bit change? Just don't encrypt them?
Maybe we can expose the *pd_flags*, we needn't re-encrypt when it changed and there's no security risk. But there have many other changes that will call the function *MarkBufferDirtyHint* and we also needn't WAL log them too. We couldn't expose all of them, so the way "throw them away, don't encrypt them" is not feasible.

2. Why can we accept the performance degradation caused by checksum in this way, but TDE can't?
The checksum must be implemented in this way, but in TDE maybe we can try another way to avoid this harm.

3. Another benefit of using the special space is that it's also can be used for AES-GCM to support integrity.

I'm just a beginner of PG and may not have considered some obvious problems. But please let me put forward my rough idea again -- Why can't we simply use LSN+blockNum+checksum as nonce? 
When the checksums are enabled, every time we call the *MarkBufferDirtyHint* will generate a new LSN. So we can simply use the LSN+blockNum+0000 as the nonce.
When the checksums are disabled, we can use these unused checksum values as a counter to make sure we have different nonce even if we don't write the new WAL record.

--
There is no royal road to learning.
HighGo Software Co.

Re: storing an explicit nonce

From
Neil Chen
Date:


On Fri, May 28, 2021 at 2:12 PM Neil Chen <carpenter.nail.cz@gmail.com> wrote:

When the checksums are disabled, we can use these unused checksum values as a counter to make sure we have different nonce even if we don't write the new WAL record.


Ah, well, I think I've figured it out for myself. In this way, we can't protect against torn pages...

--
There is no royal road to learning.
HighGo Software Co.

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, May 27, 2021 at 04:36:23PM -0400, Stephen Frost wrote:
> At this point I'm wondering if it's just:
> 
> dboid/relfilenode:block-offset
> 
> and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128)
> (or -256, whatever we're using based on what was passed to initdb)
> returns.

FYI, the dboid is not preserved by pg_upgrade.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On 2021-05-27 17:00:23 -0400, Bruce Momjian wrote:
> If you go in that direction, you should make sure pg_upgrade preserves
> what you use (it does not preserve relfilenode, just pg_class.oid)

Is there a reason for pg_upgrade not to maintain relfilenode, aside from
implementation simplicity (which is a good reason!). The fact that the old and
new clusters have different relfilenodes does make inspecting some things a
bit harder.

It'd be harder to adjust the relfilenode to match between old/new cluster if
pg_upgrade needed to deal with relmapper using relations (i.e. ones where
pg_class.relfilenode isn't used because they need to be accessed to read
pg_class, or because they're shared), but it doesn't need to.

Greetings,

Andres Freund



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-05-27 17:00:23 -0400, Bruce Momjian wrote:
> > If you go in that direction, you should make sure pg_upgrade preserves
> > what you use (it does not preserve relfilenode, just pg_class.oid)
>
> Is there a reason for pg_upgrade not to maintain relfilenode, aside from
> implementation simplicity (which is a good reason!). The fact that the old and
> new clusters have different relfilenodes does make inspecting some things a
> bit harder.

This was discussed for a bit during the Unconference (though it was
related to backups and major upgrades which involves replicas) and the
general consensus seemed to be that, no, it wasn't for any specific
reason beyond that pg_upgrade didn't need to preserve relfilenode and
therefore didn't.

There was a discussion around if there were possibly any pitfalls that
we might run into, should we try to have pg_upgrade preserve
relfilenodes but I don't *think* there were any actual show stoppers
that came up.  The simplest approach, I would think, would be to have it
do the same thing that it does for OIDs today- basically have pg_dump in
binary mode emit a function call to inform the backend of what
relfilenode to use for the next CREATE statement.  We would need to also
pass into that function if the table should have a TOAST table and what
the relfilenode for that should be too, for the base table.  We'd need
to also handle indexes, mat views, etc, of course.

> It'd be harder to adjust the relfilenode to match between old/new cluster if
> pg_upgrade needed to deal with relmapper using relations (i.e. ones where
> pg_class.relfilenode isn't used because they need to be accessed to read
> pg_class, or because they're shared), but it doesn't need to.

Right, and we generally shouldn't need to worry about conflicts arising
from relfilenodes used by catalog tables since the new cluster should be
a freshly initdb'd cluster and everything in the fresh catalog should be
below the relfilenode values we use for user relations.

There did seem to generally be some usefulness to having relfilenodes
preserved across major version upgrades beyond TDE and that's a pretty
independent project that could be tackled independently of TDE efforts.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Mon, May 31, 2021 at 4:16 PM Stephen Frost <sfrost@snowman.net> wrote:
> There did seem to generally be some usefulness to having relfilenodes
> preserved across major version upgrades beyond TDE and that's a pretty
> independent project that could be tackled independently of TDE efforts.

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Mon, May 31, 2021 at 04:16:52PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2021-05-27 17:00:23 -0400, Bruce Momjian wrote:
> > > If you go in that direction, you should make sure pg_upgrade preserves
> > > what you use (it does not preserve relfilenode, just pg_class.oid)
> > 
> > Is there a reason for pg_upgrade not to maintain relfilenode, aside from
> > implementation simplicity (which is a good reason!). The fact that the old and
> > new clusters have different relfilenodes does make inspecting some things a
> > bit harder.
> 
> This was discussed for a bit during the Unconference (though it was
> related to backups and major upgrades which involves replicas) and the
> general consensus seemed to be that, no, it wasn't for any specific
> reason beyond that pg_upgrade didn't need to preserve relfilenode and
> therefore didn't.

Yes, David Steele wanted it so incremental backups after pg_upgrade were
smaller, which makes sense.

> There was a discussion around if there were possibly any pitfalls that
> we might run into, should we try to have pg_upgrade preserve
> relfilenodes but I don't *think* there were any actual show stoppers
> that came up.  The simplest approach, I would think, would be to have it
> do the same thing that it does for OIDs today- basically have pg_dump in
> binary mode emit a function call to inform the backend of what
> relfilenode to use for the next CREATE statement.  We would need to also
> pass into that function if the table should have a TOAST table and what
> the relfilenode for that should be too, for the base table.  We'd need
> to also handle indexes, mat views, etc, of course.

Yes, exactly.  The pg_upgrade.c paragraph says:

     *  We control all assignments of pg_class.oid (and relfilenode) so toast
     *  oids are the same between old and new clusters.  This is important
     *  because toast oids are stored as toast pointers in user tables.
     *
     *  While pg_class.oid and pg_class.relfilenode are initially the same
     *  in a cluster, they can diverge due to CLUSTER, REINDEX, or VACUUM
     *  FULL.  In the new cluster, pg_class.oid and pg_class.relfilenode will
     *  be the same and will match the old pg_class.oid value.  Because of
     *  this, old/new pg_class.relfilenode values will not match if CLUSTER,
     *  REINDEX, or VACUUM FULL have been performed in the old cluster.

One tricky case is pg_largeobject, which is copied from the old to new
cluster since it has user data.  To preserve that relfilenode, you would
need to have pg_upgrade perform cluster surgery in each database to
renumber its relfilenode to match since it is created by initdb.  I
can't think of a case where pg_upgrade already does something like that.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> For these reasons, if we decide to go in the direction of using a
> non-LSN nonce, I no longer plan to continue working on this feature. I
> would rather work on things that have a more positive impact.  Maybe a
> non-LSN nonce is a better long-term plan, but there are too many
> unknowns and complexity for me to feel comfortable with it.

As stated above, I have no plans to continue working on this feature.  I
am attaching my final patches here in case anyone wants to make use of
them;  it passes check-world and all my private tests.  I have removed
my patches from the feature wiki page:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

and replaced it with a link to this email.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.


Attachment

Re: storing an explicit nonce

From
vignesh C
Date:
On Sat, Jun 26, 2021 at 2:52 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> > For these reasons, if we decide to go in the direction of using a
> > non-LSN nonce, I no longer plan to continue working on this feature. I
> > would rather work on things that have a more positive impact.  Maybe a
> > non-LSN nonce is a better long-term plan, but there are too many
> > unknowns and complexity for me to feel comfortable with it.
>
> As stated above, I have no plans to continue working on this feature.  I
> am attaching my final patches here in case anyone wants to make use of
> them;  it passes check-world and all my private tests.  I have removed
> my patches from the feature wiki page:
>
>         https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
>
> and replaced it with a link to this email.

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Jul 14, 2021 at 09:45:12PM +0530, vignesh C wrote:
> On Sat, Jun 26, 2021 at 2:52 AM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > On Wed, May 26, 2021 at 05:02:01PM -0400, Bruce Momjian wrote:
> > > For these reasons, if we decide to go in the direction of using a
> > > non-LSN nonce, I no longer plan to continue working on this feature. I
> > > would rather work on things that have a more positive impact.  Maybe a
> > > non-LSN nonce is a better long-term plan, but there are too many
> > > unknowns and complexity for me to feel comfortable with it.
> >
> > As stated above, I have no plans to continue working on this feature.  I
> > am attaching my final patches here in case anyone wants to make use of
> > them;  it passes check-world and all my private tests.  I have removed
> > my patches from the feature wiki page:
> >
> >         https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> >
> > and replaced it with a link to this email.
> 
> The patch does not apply on Head anymore, could you rebase and post a
> patch. I'm changing the status to "Waiting for Author".

Oh, I forgot this was in the commitfast.  I have marked it as Withdrawn.
Sorry for the confusion.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Shruthi Gowda
Date:
On Fri, May 28, 2021 at 2:39 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Thu, May 27, 2021 at 04:09:13PM -0400, Stephen Frost wrote:
> > > The above article, at least, suggested encrypting the sector number
> > > using the second key and then multiplying that times 2^(block number),
> > > where those blocks were actually AES 128bit blocks.  The article further
> > > claims that this is what's used in things like Bitlocker, TrueCrypt,
> > > VeraCrypt and OpenSSL.
> > >
> > > While the documentation isn't super clear, I'm taking that to mean that
> > > when you actually use EVP_aes_128_xts() in OpenSSL, and you provide it
> > > with a 256-bit key (twice the size of the AES key length function), and
> > > you give it a 'tweak', that what you would actually be passing in would
> > > be the "sector number" in the above method, or for us perhaps it would
> > > be relfilenode+block number, or maybe just block number but it seems
> > > like it'd be better to include the relfilenode to me.
> >
> > If you go in that direction, you should make sure pg_upgrade preserves
> > what you use (it does not preserve relfilenode, just pg_class.oid), and
> > CREATE DATABASE still works with a simple file copy.
>
> Ah, yes, good point, if we support in-place pg_upgrade of an encrypted
> cluster then the tweak has to be consistent between the old and new.
>
> I tend to agree with Andres that it'd be reasonable to make CREATE
> DATABASE do a bit more work for an encrypted cluster though, so I'm less
> concerned about that.
>
> Using pg_class.oid instead of relfilenode seems likely to complicate
> things like crash recovery though, wouldn't it?  I wonder if there's
> something else we could use.
>
Hi,
I have extracted the preserving relfilenode and dboid from [1] and
rebased on the current head. While tested I have found a few issues.

- Variable' dbDumpId' was not initialized before passing to
ArchiveEntry() in dumpDatabase() function due to which pg_upgrade was
failing with 'bad dumpId' error
- 'create_storage' flag was set as TRUE irrespective of relkind which
resulted in hitting assert when the source cluster had TYPE in it.
- In createdb() flow, ''dboid' was set to the preserved dboid in wrong
place. It was eventually overwritten and caused problems while
restoring the DB
- Removed the restriction on dumping the postgres DB OID

I have fixed all the issues and now the patch is working as expected.

[1] https://www.postgresql.org/message-id/7082.1562337694@localhost


Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Aug 11, 2021 at 3:41 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> I have fixed all the issues and now the patch is working as expected.

Hi,

I'm changing the subject line since the patch does something which was
discussed on that thread but isn't really related to the old email
subject. In general, I think this patch is uncontroversial and in
reasonably good shape. However, there's one part that I'm not too sure
about. If Tom Lane happens to be paying attention to this thread, I
think his feedback would be particularly useful, since he knows a lot
about the inner workings of pg_dump. Opinions from anybody else would
be great, too. Anyway, here's the hunk that worries me:

+
+               /*
+                * Need a separate entry, otherwise the command will
be run in the
+                * same transaction as the CREATE DATABASE command, which is not
+                * allowed.
+                */
+               ArchiveEntry(fout,
+                                        dbCatId,       /* catalog ID */
+                                        dbDumpId,      /* dump ID */
+                                        ARCHIVE_OPTS(.tag = datname,
+                                                                 .owner = dba,
+
.description = "SET_DB_OID",
+
.section = SECTION_PRE_DATA,
+
.createStmt = setDBIdQry->data,
+
.dropStmt = NULL));
+

To me, adding a separate TOC entry for a thing that is not really a
separate object seems like a scary hack that might come back to bite
us. Unfortunately, I don't know enough about pg_dump to say exactly
how it might come back to bite us, which leaves wide open the
possibility that I am completely wrong.... I just think it's the
intention that archive entries correspond to actual objects in the
database, not commands that we want executed in some particular order.
If that criticism is indeed correct, then my proposal would be to
instead add a WITH OID = nnn option to CREATE DATABASE and allow it to
be used only in binary upgrade mode. That has the disadvantage of
being inconsistent with the way that we preserve OIDs everywhere else,
but the only other alternatives are (1) do something like the above,
(2) remove the requirement that CREATE DATABASE run in its own
transaction, and (3) give up. (2) sounds hard and (3) is unappealing.

The rest of this email will be detailed review comments on the patch
as presented, and thus probably only interesting to someone actually
working on the patch. Feel free to skip if that's not you.

- I suggest splitting the patch into one portion that deals with
database OID and another portion that deals with tablespace OID and
relfilenode OID, or maybe splitting it all the way into three separate
patches, one for each. This could allow the uncontroversial parts to
get committed first while we're wondering what to do about the problem
described above.

- There are two places in the patch, one in dumpDatabase() and one in
generate_old_dump() where blank lines are removed with no other
changes. Please avoid whitespace-only hunks.

- If possible, please try to pgindent the new code. It's pretty good
what you did, but e.g. the declaration of
binary_upgrade_next_pg_tablespace_oid probably has less whitespace
than pgindent is going to want.

- The comments in dumpDatabase() claim that "postgres" and "template1"
are handled specially in some way, but there seems to be no code that
matches those comments.

- heap_create()'s logic around setting create_storage looks slightly
redundant. I'm not positive what would be better, but ... suppose you
just took the part that's currently gated by if (!IsBinaryUpgrade) and
did it unconditionally. Then put if (IsBinaryUpgrade) around the else
clause, but delete the last bit from there that sets create_storage.
Maybe we'd still want a comment saying that it's intentional that
create_storage = true even though it will be overwritten later, but
then, I think, we wouldn't need to set create_storage in two different
places. Maybe I'm wrong.

- If we're not going to do that, then I think you should swap the if
and else clauses and reverse the sense of the test. In createdb(),
CreateTableSpace(), and a bunch of existing places, we do if
(IsBinaryUpgrade) { ... } else { ... } so I don't think it makes sense
for this one to instead do if (!IsBinaryUpgrade) { ... } else { ... }.

- I'm not sure that I'd bother renaming
binary_upgrade_set_pg_class_oids_and_relfilenodes(). It's such a long
name, and a relfilenode is kind of an OID, so the current name isn't
even really wrong. I'd probably drop the header comment too, since it
seems rather obvious. But both of these things are judgement calls.

- Inside that function, there is a comment that says "Indexes cannot
have toast tables, so we need not make this probe in the index code
path." However, you have moved the code from someplace where it didn't
happen for indexes to someplace where it happens for both tables and
indexes. Therefore the comment, which was true when the code was where
it was before, is now false. So you need to update it.

- It is not clear to me why pg_upgrade's Makefile needs to be changed
to include -DFRONTEND in CPPFLAGS. All of the .c files in this
directory include postgres_fe.h rather than postgres.h, and that file
has #define FRONTEND 1. Moreover, there are no actual code changes in
this directory, so why should the Makefile need any change?

- A couple of comment changes - and the commit message - mention data
encryption, but that's not a feature that this patch implements, nor
are we committed to adding it in the immediate future (or ever,
really). So I think those places should be revised to say that we do
this because we want the filenames to match between the old and new
clusters, and leave the reasons why that might be a good thing up to
the reader's imagination.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> To me, adding a separate TOC entry for a thing that is not really a
> separate object seems like a scary hack that might come back to bite
> us. Unfortunately, I don't know enough about pg_dump to say exactly
> how it might come back to bite us, which leaves wide open the
> possibility that I am completely wrong.... I just think it's the
> intention that archive entries correspond to actual objects in the
> database, not commands that we want executed in some particular order.

I agree, this seems like a moderately bad idea.  It could get broken
either by executing only one of the TOC entries during restore, or
by executing them in the wrong order.  The latter possibility could
be forestalled by adding a dependency, which I do not see this hunk
doing, which is clearly a bug.  The former possibility would require
user intervention, so maybe it's in the category of "if you break
this you get to keep both pieces".  Still, it's ugly.

> If that criticism is indeed correct, then my proposal would be to
> instead add a WITH OID = nnn option to CREATE DATABASE and allow it to
> be used only in binary upgrade mode.

If it's not too complicated to implement, that seems like an OK idea
from here.  I don't have any great love for the way we handle OID
preservation in binary upgrade mode, so not doing it exactly the same
way for databases doesn't seem like a disadvantage.

            regards, tom lane



Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > To me, adding a separate TOC entry for a thing that is not really a
> > separate object seems like a scary hack that might come back to bite
> > us. Unfortunately, I don't know enough about pg_dump to say exactly
> > how it might come back to bite us, which leaves wide open the
> > possibility that I am completely wrong.... I just think it's the
> > intention that archive entries correspond to actual objects in the
> > database, not commands that we want executed in some particular order.
>
> I agree, this seems like a moderately bad idea.  It could get broken
> either by executing only one of the TOC entries during restore, or
> by executing them in the wrong order.  The latter possibility could
> be forestalled by adding a dependency, which I do not see this hunk
> doing, which is clearly a bug.  The former possibility would require
> user intervention, so maybe it's in the category of "if you break
> this you get to keep both pieces".  Still, it's ugly.

Yeah, agreed.

> > If that criticism is indeed correct, then my proposal would be to
> > instead add a WITH OID = nnn option to CREATE DATABASE and allow it to
> > be used only in binary upgrade mode.
>
> If it's not too complicated to implement, that seems like an OK idea
> from here.  I don't have any great love for the way we handle OID
> preservation in binary upgrade mode, so not doing it exactly the same
> way for databases doesn't seem like a disadvantage.

Also agreed on this, though I wonder- do we actually need to explicitly
make CREATE DATABASE q WITH OID = 1234; only work during binary upgrade
mode in the backend?  That strikes me as perhaps doing more work than we
really need to while also preventing something that users might actually
like to do.

Either way, we'll need to check that the OID given to us can be used for
the database, I'd think.

Having pg_dump only include it in binary upgrade mode is fine though.

Thanks,

Stephen

Attachment
Stephen Frost <sfrost@snowman.net> writes:
> Also agreed on this, though I wonder- do we actually need to explicitly
> make CREATE DATABASE q WITH OID = 1234; only work during binary upgrade
> mode in the backend?  That strikes me as perhaps doing more work than we
> really need to while also preventing something that users might actually
> like to do.

There should be adequate defenses against a duplicate OID already,
so +1 --- no reason to insist this only be used during binary upgrade.

Actually though ... I've not read the patch, but what does it do about
the fact that the postgres and template0 DBs do not have stable OIDs?
I cannot imagine any way to force those to match across PG versions
that would not be an unsustainable crock.

            regards, tom lane



On Tue, Aug 17, 2021 at 12:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Actually though ... I've not read the patch, but what does it do about
> the fact that the postgres and template0 DBs do not have stable OIDs?
> I cannot imagine any way to force those to match across PG versions
> that would not be an unsustainable crock.

Well, it's interesting that you mention that, because there's a
comment in the patch that probably has to do with this:

+    /*
+     * Make sure that pg_upgrade does not change database OID. Don't care
+     * about "postgres" database, backend will assign it fixed OID anyway.
+     * ("template1" has fixed OID too but the value 1 should not collide with
+     * any other OID so backend pays no attention to it.)
+     */

I wasn't able to properly understand that comment, and to be honest
I'm not sure I precisely understand your concern either. I don't quite
see why the template0 database matters. I think that database isn't
going to be dumped, or restored, so as far as pg_upgrade is concerned
it might as well not exist in either cluster, and I don't see why
pg_upgrade can't therefore just ignore it completely. But template1
and postgres are another matter. If I understand correctly, those
databases are going to be created in the new cluster by initdb, but
then pg_upgrade is going to populate them with data - including
relation files - from the old cluster. And, yeah, I don't see how we
could make those database OIDs match, which is not great.

To be honest, what I'd be inclined to do about that is just nail down
those OIDs for future releases. In fact, I'd probably go so far as to
hardcode that in such a way that even if you drop those databases and
recreate them, they get recreated with the same hard-coded OID. Now
that doesn't do anything to create stability when people upgrade from
an old release to a current one, but I don't really see that as an
enormous problem. The only hard requirement for this feature is if we
use the database OID for some kind of encryption or integrity checking
or checksum type feature. Then, you want to avoid having the database
OID change when you upgrade, so that the encryption or integrity check
or checksum in question does not have to be recomputed for every page
as part of pg_upgrade. But, that only matters if you're going between
two releases that support that feature, which will not be the case if
you're upgrading from some old release. Apart from that kind of
feature, it still seems like a nice-to-have to keep database OIDs the
same, but if those cases end up as exceptions, oh well.

Does that seem reasonable, or am I missing something big?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> I wasn't able to properly understand that comment, and to be honest
> I'm not sure I precisely understand your concern either. I don't quite
> see why the template0 database matters. I think that database isn't
> going to be dumped, or restored, so as far as pg_upgrade is concerned
> it might as well not exist in either cluster, and I don't see why
> pg_upgrade can't therefore just ignore it completely. But template1
> and postgres are another matter. If I understand correctly, those
> databases are going to be created in the new cluster by initdb, but
> then pg_upgrade is going to populate them with data - including
> relation files - from the old cluster.

Right.  If pg_upgrade explicitly ignores template0 then its OID
need not be stable ... at least, not unless there's a chance it
could conflict with some other database OID, which would become
a live possibility if we let users get at "WITH OID = n".

(Having said that, I'm not sure that pg_upgrade special-cases
template0, or that it should do so.)

> To be honest, what I'd be inclined to do about that is just nail down
> those OIDs for future releases.

Yeah, I was thinking along similar lines.

> In fact, I'd probably go so far as to
> hardcode that in such a way that even if you drop those databases and
> recreate them, they get recreated with the same hard-coded OID.

Less sure that this is a good idea, though.  In particular, I do not
think that you can make it work in the face of
    alter database template1 rename to oops;
    create database template1;

> The only hard requirement for this feature is if we
> use the database OID for some kind of encryption or integrity checking
> or checksum type feature.

It's fairly unclear to me why that is so important as to justify the
amount of klugery that this line of thought seems to be bringing.

            regards, tom lane



I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The only hard requirement for this feature is if we
>> use the database OID for some kind of encryption or integrity checking
>> or checksum type feature.

> It's fairly unclear to me why that is so important as to justify the
> amount of klugery that this line of thought seems to be bringing.

And, not to put too fine a point on it, how will you possibly do
that without entirely breaking CREATE DATABASE?

            regards, tom lane



On Tue, Aug 17, 2021 at 11:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 17, 2021 at 12:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Actually though ... I've not read the patch, but what does it do about
> > the fact that the postgres and template0 DBs do not have stable OIDs?
> > I cannot imagine any way to force those to match across PG versions
> > that would not be an unsustainable crock.
>
> Well, it's interesting that you mention that, because there's a
> comment in the patch that probably has to do with this:
>
> +    /*
> +     * Make sure that pg_upgrade does not change database OID. Don't care
> +     * about "postgres" database, backend will assign it fixed OID anyway.
> +     * ("template1" has fixed OID too but the value 1 should not collide with
> +     * any other OID so backend pays no attention to it.)
> +     */
>
In the original patch the author intended to avoid dumping the
postgres DB OID like below:
+ if (dopt->binary_upgrade && dbCatId.oid != PostgresDbOid)

Since postgres OID is not hardcoded/fixed I removed the check.
My bad I missed updating the comment section. Sorry for the confusion.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com



On Tue, Aug 17, 2021 at 11:56:30AM -0400, Robert Haas wrote:
> On Wed, Aug 11, 2021 at 3:41 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > I have fixed all the issues and now the patch is working as expected.
> 
> Hi,
> 
> I'm changing the subject line since the patch does something which was
> discussed on that thread but isn't really related to the old email
> subject. In general, I think this patch is uncontroversial and in
> reasonably good shape. However, there's one part that I'm not too sure
> about. If Tom Lane happens to be paying attention to this thread, I
> think his feedback would be particularly useful, since he knows a lot
> about the inner workings of pg_dump. Opinions from anybody else would
> be great, too. Anyway, here's the hunk that worries me:

What is the value of preserving db/ts/relfilenode OIDs?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 17, 2021 at 1:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right.  If pg_upgrade explicitly ignores template0 then its OID
> need not be stable ... at least, not unless there's a chance it
> could conflict with some other database OID, which would become
> a live possibility if we let users get at "WITH OID = n".

Well, that might be a good reason not to let them do that, then, at
least for n<64k.

> > In fact, I'd probably go so far as to
> > hardcode that in such a way that even if you drop those databases and
> > recreate them, they get recreated with the same hard-coded OID.
>
> Less sure that this is a good idea, though.  In particular, I do not
> think that you can make it work in the face of
>         alter database template1 rename to oops;
>         create database template1;

That is a really good point. If we can't categorically force the OID
of those databases to have a particular, fixed value, and based on
this example that seems to be impossible, then there's always a
possibility that we might find a value in the old cluster that doesn't
happen to match what is present in the new cluster. Seen from that
angle, the problem is really with databases that are pre-existent in
the new cluster but whose contents still need to be dumped. Maybe we
could (optionally? conditionally?) drop those databases from the new
cluster and then recreate them with the OID that we want them to have.

> > The only hard requirement for this feature is if we
> > use the database OID for some kind of encryption or integrity checking
> > or checksum type feature.
>
> It's fairly unclear to me why that is so important as to justify the
> amount of klugery that this line of thought seems to be bringing.

Well, I think it would make sense to figure out how small we can make
the kludge first, and then decide whether it's larger than we can
tolerate. From my point of view, I completely understand why people to
whom those kinds of features are important want to include all the
fields that make up a buffer tag in the checksum or other integrity
check. Right now, if somebody copies a page from one place to another,
or if the operating system fumbles things and switches some pages
around, we have no direct way of detecting that anything bad has
happened. This is not the only problem that would need to be solved in
order to fix that, but it's one of them, and I don't particularly see
why it's not a valid goal. It's not as if a 16-bit checksum that is
computed in exactly the same way for every page in the cluster is such
state-of-the-art technology that only fools question its surpassing
excellence.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



> The rest of this email will be detailed review comments on the patch
> as presented, and thus probably only interesting to someone actually
> working on the patch. Feel free to skip if that's not you.
>
> - I suggest splitting the patch into one portion that deals with
> database OID and another portion that deals with tablespace OID and
> relfilenode OID, or maybe splitting it all the way into three separate
> patches, one for each. This could allow the uncontroversial parts to
> get committed first while we're wondering what to do about the problem
> described above.

Thanks Robert for your comments.
I have split the patch into two portions. One that handles DB OID and
the other that
handles tablespace OID and relfilenode OID.

> - There are two places in the patch, one in dumpDatabase() and one in
> generate_old_dump() where blank lines are removed with no other
> changes. Please avoid whitespace-only hunks.

These changes are avoided.

> - If possible, please try to pgindent the new code. It's pretty good
> what you did, but e.g. the declaration of
> binary_upgrade_next_pg_tablespace_oid probably has less whitespace
> than pgindent is going to want.

Taken care in latest patches

> - The comments in dumpDatabase() claim that "postgres" and "template1"
> are handled specially in some way, but there seems to be no code that
> matches those comments.

The comment is removed.

> - heap_create()'s logic around setting create_storage looks slightly
> redundant. I'm not positive what would be better, but ... suppose you
> just took the part that's currently gated by if (!IsBinaryUpgrade) and
> did it unconditionally. Then put if (IsBinaryUpgrade) around the else
> clause, but delete the last bit from there that sets create_storage.
> Maybe we'd still want a comment saying that it's intentional that
> create_storage = true even though it will be overwritten later, but
> then, I think, we wouldn't need to set create_storage in two different
> places. Maybe I'm wrong.
>
> - If we're not going to do that, then I think you should swap the if
> and else clauses and reverse the sense of the test. In createdb(),
> CreateTableSpace(), and a bunch of existing places, we do if
> (IsBinaryUpgrade) { ... } else { ... } so I don't think it makes sense
> for this one to instead do if (!IsBinaryUpgrade) { ... } else { ... }.

I have avoided the redundant code and removed the comment as it does
not make sense now that we are setting the create_storage conditionally.
(In the original patch, create_storage was set to TRUE by default for binary
upgrade case which was wrong and was hitting assert in the following flow).

> - I'm not sure that I'd bother renaming
> binary_upgrade_set_pg_class_oids_and_relfilenodes(). It's such a long
> name, and a relfilenode is kind of an OID, so the current name isn't
> even really wrong. I'd probably drop the header comment too, since it
> seems rather obvious. But both of these things are judgement calls.

I agree. I have retained the old function name.

> - Inside that function, there is a comment that says "Indexes cannot
> have toast tables, so we need not make this probe in the index code
> path." However, you have moved the code from someplace where it didn't
> happen for indexes to someplace where it happens for both tables and
> indexes. Therefore the comment, which was true when the code was where
> it was before, is now false. So you need to update it.

The comment is updated.

> - It is not clear to me why pg_upgrade's Makefile needs to be changed
> to include -DFRONTEND in CPPFLAGS. All of the .c files in this
> directory include postgres_fe.h rather than postgres.h, and that file
> has #define FRONTEND 1. Moreover, there are no actual code changes in
> this directory, so why should the Makefile need any change?

Makefile change is removed.

> - A couple of comment changes - and the commit message - mention data
> encryption, but that's not a feature that this patch implements, nor
> are we committed to adding it in the immediate future (or ever,
> really). So I think those places should be revised to say that we do
> this because we want the filenames to match between the old and new
> clusters, and leave the reasons why that might be a good thing up to
> the reader's imagination.

Taken care.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> Thanks Robert for your comments.
> I have split the patch into two portions. One that handles DB OID and
> the other that
> handles tablespace OID and relfilenode OID.

It's pretty clear from the discussion, I think, that the database OID
one is going to need rework to be considered.

Regarding the other one:

- The comment in binary_upgrade_set_pg_class_oids() is still not
accurate. You removed the sentence which says "Indexes cannot have
toast tables, so we need not make this probe in the index code path"
but the immediately preceding sentence is still inaccurate in at least
two ways. First, it only talks about tables, but the code now applies
to indexes. Second, it only talks about OIDs, but now also deals with
refilenodes. It's really important to fully update every comment that
might be affected by your changes!

- The SQL query in that function isn't completely correct. There is a
left join from pg_class to pg_index whose ON clause includes
"c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's
likely that is because it is possible, in corner cases, for a TOAST
table to have multiple TOAST indexes. I forget exactly how that
happens, but I think it might be like if a REINDEX CONCURRENTLY on the
TOAST table fails midway through, or something of that sort. Now if
that happens, the LEFT JOIN you added is going to cause the output to
contain multiple rows, because you didn't replicate the i.indisvalid
condition into that ON clause. And then it will fail. Apparently we
don't have a pg_upgrade test case for this scenario; we probably
should. Actually what I think would be even better than putting
i.indisvalid into that ON clause would be to join off of i.indrelid
rather than c.reltoastrelid.

- The code that decodes the various columns of this query does so in a
slightly different order than the query itself. It would be better to
make it match. Perhaps put relkind first in both cases. I might also
think about trying to make the column naming a bit more consistent,
e.g. relkind, relfilenode, toast_oid, toast_relfilenode,
toast_index_oid, toast_index_relfilenode.

- In heap_create(), the wording of the error messages is not quite
consistent. You have "relfilenode value not set when in binary upgrade
mode", "toast relfilenode value not set when in binary upgrade mode",
and "pg_class index relfilenode value not set when in binary upgrade
mode". Why does the last one mention pg_class when the other two
don't?

- The code in heap_create() now has no comments whatsoever, which is a
shame, because it's actually kind of a tricky bit of logic. Someone
might wonder why we override the relfilenode inside that function
instead of doing it at the same places where we absorb
binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing
down the relfilenode. I think the answer is that passing down the
relfilenode from the caller would result in storage not actually being
created, whereas in this case we want it to be created but just with
the value we specify, and the reason we want that is because we need
later DDL that happens after these statements but before the old
cluster's relations are moved over to execute successfully, which it
won't if the storage is altogether absent.

However, that raises the question of whether this patch has even got
the basic design right. Maybe we ought to actually be absorbing the
relfilenode setting at the same places where we're doing so for the
OID, and then passing an additional parameter to heap_create() like
bool suppress_storage or something like that. Maybe, taking it even
further, we ought to be changing the signatures of
binary_upgrade_next_heap_pg_class_oid and friends to be two-argument
functions, and pass down the OID and the relfilenode in the same call,
rather than calling two separate functions. I'm not so much concerned
about the cost of calling two functions as the potential for
confusion. I'm not honestly sure that either of these changes are the
right thing to do, but I am pretty strongly inclined to do at least
the first part - trying to absorb reloid and relfilenode in the same
places. If we're not going to do that we certainly need to explain why
we're doing it the way we are in the comments.

It's not really this patch's fault, but it would sure be nice if we
had some better testing for this area. Suppose this patch somehow
changed nothing from the present behavior. How would we know? Or
suppose it managed to somehow set all the relfilenodes in the new
cluster to random values rather than the intended one? There's no
automated testing that would catch any of that, and it's not obvious
how it could be added to test.sh. I suppose what we really need to do
at some point is rewrite that as a TAP test, but that seems like a
separate project from this patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Thanks Robert for your comments.
> > I have split the patch into two portions. One that handles DB OID and
> > the other that
> > handles tablespace OID and relfilenode OID.
>
> It's pretty clear from the discussion, I think, that the database OID
> one is going to need rework to be considered.

Regarding that ... I have to wonder just what promises we feel we've
made when it comes to what a user is expected to be able to do with the
new cluster *before* pg_upgrade is run on it.  For my part, I sure feel
like it's "nothing", in which case it seems like we can do things that
we can't do with a running system, like literally just DROP and recreate
with the correct OID of any databases we need to, or even push that back
to the user to do that at initdb time with some kind of error thrown by
pg_upgrade during the --check phase.  "Initial databases have
non-standard OIDs, recreate destination cluster with initdb
--with-oid=12341" or something along those lines.

Also open to the idea of simply forcing 'template1' to always being
OID=1 even if it's dropped/recreated and then just dropping/recreating
the template0 and postgres databases if they've got different OIDs than
what the old cluster did- after all, they should be getting entirely
re-populated as part of the pg_upgrade process itself.

Thanks,

Stephen

Attachment
On Mon, Aug 23, 2021 at 04:57:31PM -0400, Robert Haas wrote:
> On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Thanks Robert for your comments.
> > I have split the patch into two portions. One that handles DB OID and
> > the other that
> > handles tablespace OID and relfilenode OID.
> 
> It's pretty clear from the discussion, I think, that the database OID
> one is going to need rework to be considered.

I assume this patch is not going to be applied until there is an actual
use case for preserving these values.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 24, 2021 at 5:59 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Mon, Aug 23, 2021 at 04:57:31PM -0400, Robert Haas wrote:
> > On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > > Thanks Robert for your comments.
> > > I have split the patch into two portions. One that handles DB OID and
> > > the other that
> > > handles tablespace OID and relfilenode OID.
> >
> > It's pretty clear from the discussion, I think, that the database OID
> > one is going to need rework to be considered.
>
> I assume this patch is not going to be applied until there is an actual
> use case for preserving these values.

JFI, I added an entry into commit fest for this patch.
link: https://commitfest.postgresql.org/34/3296/



On Mon, Aug 23, 2021 at 5:12 PM Stephen Frost <sfrost@snowman.net> wrote:
> Regarding that ... I have to wonder just what promises we feel we've
> made when it comes to what a user is expected to be able to do with the
> new cluster *before* pg_upgrade is run on it.  For my part, I sure feel
> like it's "nothing", in which case it seems like we can do things that
> we can't do with a running system, like literally just DROP and recreate
> with the correct OID of any databases we need to, or even push that back
> to the user to do that at initdb time with some kind of error thrown by
> pg_upgrade during the --check phase.  "Initial databases have
> non-standard OIDs, recreate destination cluster with initdb
> --with-oid=12341" or something along those lines.

Yeah, possibly. Honestly, I find it weird that pg_upgrade expects the
new cluster to already exist. It seems like it would be more sensible
if it created the cluster itself. That's not entirely trivial, because
for example you have to create it with the correct locale settings and
stuff. But if you require the cluster to exist already, then you run
into the kinds of questions that you're asking here, and whether the
answer is "nothing" as you propose here or something more than that,
it's clearly not "whatever you want" nor anything close to that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote:
> I assume this patch is not going to be applied until there is an actual
> use case for preserving these values.

My interpretation of the preceding discussion was that several people
thought this change was a good idea regardless of whether anything
ever happens with TDE, so I wasn't seeing a reason to wait.
Personally, I've always thought that it was quite odd that pg_upgrade
didn't preserve the relfilenode values, so I'm in favor of the change.
I bet we could even make some simplifications to that code if we got
all of this sorted out, which seems like it would be nice.

I think it was also mentioned that this might be nice for pgBackRest,
which apparently permits incremental backups across major version
upgrades but likes filenames to match.

That being said, if you or somebody else thinks that this is a bad
idea or that the reasons offered up until now are insufficient, feel
free to make that argument. I just work here...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote:
>> I assume this patch is not going to be applied until there is an actual
>> use case for preserving these values.

> ...

> That being said, if you or somebody else thinks that this is a bad
> idea or that the reasons offered up until now are insufficient, feel
> free to make that argument. I just work here...

Per upthread discussion, it seems impractical to fully guarantee
that database OIDs match, which seems to mean that the whole premise
collapses.  Like Bruce, I want to see a plausible use case justifying
any partial-guarantee scenario before we add more complication (= bugs)
to pg_upgrade.

            regards, tom lane



On Tue, Aug 24, 2021 at 12:04:00PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote:
> >> I assume this patch is not going to be applied until there is an actual
> >> use case for preserving these values.
> 
> > ...
> 
> > That being said, if you or somebody else thinks that this is a bad
> > idea or that the reasons offered up until now are insufficient, feel
> > free to make that argument. I just work here...
> 
> Per upthread discussion, it seems impractical to fully guarantee
> that database OIDs match, which seems to mean that the whole premise
> collapses.  Like Bruce, I want to see a plausible use case justifying
> any partial-guarantee scenario before we add more complication (= bugs)
> to pg_upgrade.

Yes, pg_upgrade is already complex enough, so why add more complexity
for some cosmetic value.  (I think "cosmetic" flew out the window with
pg_upgrade long ago.  ;-)  )

I know that pgBackRest has asked for stable relfilenodes to make
incremental file system backups after pg_upgrade smaller, but if we want
to make relfilenodes stable, we had better understand that is _why_ we
are adding this complexity.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 24, 2021 at 11:28:37AM -0400, Robert Haas wrote:
> On Mon, Aug 23, 2021 at 8:29 PM Bruce Momjian <bruce@momjian.us> wrote:
> > I assume this patch is not going to be applied until there is an actual
> > use case for preserving these values.
> 
> My interpretation of the preceding discussion was that several people
> thought this change was a good idea regardless of whether anything
> ever happens with TDE, so I wasn't seeing a reason to wait.
> Personally, I've always thought that it was quite odd that pg_upgrade
> didn't preserve the relfilenode values, so I'm in favor of the change.
> I bet we could even make some simplifications to that code if we got
> all of this sorted out, which seems like it would be nice.

Yes, if this ends up being a cleanup with no added complexity, that
would be nice, but I had not seen how that was possible in the psat.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 24, 2021 at 11:24:21AM -0400, Robert Haas wrote:
> On Mon, Aug 23, 2021 at 5:12 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Regarding that ... I have to wonder just what promises we feel we've
> > made when it comes to what a user is expected to be able to do with the
> > new cluster *before* pg_upgrade is run on it.  For my part, I sure feel
> > like it's "nothing", in which case it seems like we can do things that
> > we can't do with a running system, like literally just DROP and recreate
> > with the correct OID of any databases we need to, or even push that back
> > to the user to do that at initdb time with some kind of error thrown by
> > pg_upgrade during the --check phase.  "Initial databases have
> > non-standard OIDs, recreate destination cluster with initdb
> > --with-oid=12341" or something along those lines.
> 
> Yeah, possibly. Honestly, I find it weird that pg_upgrade expects the
> new cluster to already exist. It seems like it would be more sensible
> if it created the cluster itself. That's not entirely trivial, because
> for example you have to create it with the correct locale settings and
> stuff. But if you require the cluster to exist already, then you run
> into the kinds of questions that you're asking here, and whether the
> answer is "nothing" as you propose here or something more than that,
> it's clearly not "whatever you want" nor anything close to that.

Yes, it is a trade-off.  If we had pg_upgrade create the new cluster,
the pg_upgrade instructions would be simpler, but pg_upgrade would be
more complex since it has to adjust _everything_ properly so pg_upgrade
works --- I never got to that point, but I am willing to explore what
would be required.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 24, 2021 at 12:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Per upthread discussion, it seems impractical to fully guarantee
> that database OIDs match, which seems to mean that the whole premise
> collapses.  Like Bruce, I want to see a plausible use case justifying
> any partial-guarantee scenario before we add more complication (= bugs)
> to pg_upgrade.

I think you might be overlooking the emails from Stephen and I where
we suggested how that could be made to work?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Aug 23, 2021 at 5:12 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Regarding that ... I have to wonder just what promises we feel we've
> > made when it comes to what a user is expected to be able to do with the
> > new cluster *before* pg_upgrade is run on it.  For my part, I sure feel
> > like it's "nothing", in which case it seems like we can do things that
> > we can't do with a running system, like literally just DROP and recreate
> > with the correct OID of any databases we need to, or even push that back
> > to the user to do that at initdb time with some kind of error thrown by
> > pg_upgrade during the --check phase.  "Initial databases have
> > non-standard OIDs, recreate destination cluster with initdb
> > --with-oid=12341" or something along those lines.
>
> Yeah, possibly. Honestly, I find it weird that pg_upgrade expects the
> new cluster to already exist. It seems like it would be more sensible
> if it created the cluster itself. That's not entirely trivial, because
> for example you have to create it with the correct locale settings and
> stuff. But if you require the cluster to exist already, then you run
> into the kinds of questions that you're asking here, and whether the
> answer is "nothing" as you propose here or something more than that,
> it's clearly not "whatever you want" nor anything close to that.

Yeah, I'd had a similar thought and also tend to agree that it'd make
more sense for pg_upgrade to set up the new cluster too, and doing so in
a way that makes sure that it matches the old cluster as that's rather
important.  Having the user do it also implies that there is some
freedom for the user to mess around with the new cluster before running
pg_upgrade, it seems to me anyway, and that's certainly not something
that we've built anything into pg_upgrade to deal with cleanly..

It isn't like initdb takes all *that* long to run either, and reducing
the number of steps that the user has to take to perform an upgrade sure
seems like a good thing to do.  Anyhow, just wanted to throw that out
there as another way we might approach this.

Thanks,

Stephen

Attachment
On Tue, Aug 24, 2021 at 12:43 PM Bruce Momjian <bruce@momjian.us> wrote:
> Yes, it is a trade-off.  If we had pg_upgrade create the new cluster,
> the pg_upgrade instructions would be simpler, but pg_upgrade would be
> more complex since it has to adjust _everything_ properly so pg_upgrade
> works --- I never got to that point, but I am willing to explore what
> would be required.

It's probably a topic for another thread, rather than this one, but I
think that would be very cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Aug 24, 2021 at 12:43 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Yes, it is a trade-off.  If we had pg_upgrade create the new cluster,
> > the pg_upgrade instructions would be simpler, but pg_upgrade would be
> > more complex since it has to adjust _everything_ properly so pg_upgrade
> > works --- I never got to that point, but I am willing to explore what
> > would be required.
>
> It's probably a topic for another thread, rather than this one, but I
> think that would be very cool.

Yes, definite +1 on this.

Thanks,

Stephen

Attachment
On Tue, Aug 24, 2021 at 12:43:20PM -0400, Bruce Momjian wrote:
> Yes, it is a trade-off.  If we had pg_upgrade create the new cluster,
> the pg_upgrade instructions would be simpler, but pg_upgrade would be
> more complex since it has to adjust _everything_ properly so pg_upgrade
> works --- I never got to that point, but I am willing to explore what
> would be required.

One other issue --- the more that pg_upgrade preserves, the more likely
pg_upgrade will break when some internal changes happen in Postgres. 
Therefore, if you want pg_upgrade to preserve something, you have to
have a good reason --- even code simplicity might not be a sufficient
reason.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 24, 2021 at 2:16 PM Bruce Momjian <bruce@momjian.us> wrote:
> One other issue --- the more that pg_upgrade preserves, the more likely
> pg_upgrade will break when some internal changes happen in Postgres.
> Therefore, if you want pg_upgrade to preserve something, you have to
> have a good reason --- even code simplicity might not be a sufficient
> reason.

While I accept that as a general principle, I don't think it's really
applicable in this case. pg_upgrade already knows all about
relfilenodes; it has a source file called relfilenode.c. I don't see
that a pg_upgrade that preserves relfilenodes is any more or less
likely to break in the future than a pg_upgrade that renumbers all the
files so that the relation OID and the relfilenode are equal. You've
got about the same amount of reliance on the on-disk layout either
way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Aug 24, 2021 at 02:34:26PM -0400, Robert Haas wrote:
> On Tue, Aug 24, 2021 at 2:16 PM Bruce Momjian <bruce@momjian.us> wrote:
> > One other issue --- the more that pg_upgrade preserves, the more likely
> > pg_upgrade will break when some internal changes happen in Postgres.
> > Therefore, if you want pg_upgrade to preserve something, you have to
> > have a good reason --- even code simplicity might not be a sufficient
> > reason.
> 
> While I accept that as a general principle, I don't think it's really
> applicable in this case. pg_upgrade already knows all about
> relfilenodes; it has a source file called relfilenode.c. I don't see
> that a pg_upgrade that preserves relfilenodes is any more or less
> likely to break in the future than a pg_upgrade that renumbers all the
> files so that the relation OID and the relfilenode are equal. You've
> got about the same amount of reliance on the on-disk layout either
> way.

I was making more of a general statement that preservation can be
problematic and its impact must be researched.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 17, 2021 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Less sure that this is a good idea, though.  In particular, I do not
> > think that you can make it work in the face of
> >         alter database template1 rename to oops;
> >         create database template1;
>
> That is a really good point. If we can't categorically force the OID
> of those databases to have a particular, fixed value, and based on
> this example that seems to be impossible, then there's always a
> possibility that we might find a value in the old cluster that doesn't
> happen to match what is present in the new cluster. Seen from that
> angle, the problem is really with databases that are pre-existent in
> the new cluster but whose contents still need to be dumped. Maybe we
> could (optionally? conditionally?) drop those databases from the new
> cluster and then recreate them with the OID that we want them to have.

Actually, we do that already. create_new_objects() runs pg_restore
with --create for most databases, but with --clean --create for
template1 and postgres. This means that template1 and postgres will
always be recreated in the new cluster, and other databases are
assumed not to exist in the new cluster and the upgrade will fail if
they unexpectedly do. And the reason why pg_upgrade does that is that
it wants to "propagate [the] database-level properties" of postgres
and template1. So suppose we just make the database OID one of the
database-level properties that we want to propagate. That should
mostly just work, but where can things go wrong?

The only real failure mode is we try to create a database in the new
cluster and find out that the OID is already in use. If the new OID
that collides >64k, then the user has messed with the new cluster
before doing that. And since pg_upgrade is pretty clearly already
assuming that you shouldn't do that, it's fine to also make that
assumption in this case. We can disregard such cases as user error.

If the new OID that collides is <64k, then it must be colliding with
template0, template1, or postgres in the new cluster, because those
are the only databases that can have such OIDs since, currently, we
don't allow users to specify an OID for a new database. And the
problem cannot be with template1, because we hard-code its OID to 1.
If there is a database with OID 1 in either cluster, it must be
template1, and if there is a database with OID 1 in both clusters, it
must be template1 in both cases, and we'll just drop and recreate it
with OID 1 and everything is fine. So we need only consider template0
and postgres, which are created with system-generated OIDs. And, it
would be no issue if either of those databases had the same OID in the
old and new cluster, so the only possible OID collision is one where
the same system-generated OID was assigned to one of those databases
in the old cluster and to the other in the new cluster.

First consider the case where template0 has OID, say, 13000, in the
old cluster, and postgres has that OID in the new cluster. No problem
occurs, because template0 isn't transferred anyway. The reverse
direction is a problem, though. If postgres had been assigned OID
13000 in the old cluster and, by sheer chance, template0 had that OID
in the new cluster, then the upgrade would fail, because it wouldn't
be able to recreate the postgres database with the correct OID.

But that doesn't seem very difficult to fix. I think all we need to do
is have initdb assign a fixed OID to template0 at creation time. Then,
in any new release to which someone might be trying to upgrade, the
system-generated OID assigned to postgres in the old release can't
match the fixed OID assigned to template0 in the new release, so the
one problem case is ruled out. We do need, however, to make sure that
the assign-my-database-a-fixed-OID syntax is either entirely
restricted to initdb & pg_upgrade or at least that OIDS < 64k can only
be assigned in one of those modes. Otherwise, some creative person
could manufacture new problem cases by setting up the source database
so that the OID of one of their databases matches the fixed OID we
gave to template0 or template1, or the system-generated OID for
postgres in the new cluster.

In short, as far as I can see, all we need to do to preserve database
OIDs across pg_upgrade is:

1. Add a new syntax for creating a database with a given OID, and use
it in pg_dump --binary-upgrade.
2. Don't let users use it at least for OIDs <64k, or maybe just don't
let them use it at all.
3. But let initdb use it, and have initdb set the initial OID for
template0 to a fixed value < 10000. If the user changes it later, no
problem; the cluster into which they are upgrading won't contain any
databases with high-numbered OIDs.

Anyone see a flaw in that analysis?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> Anyone see a flaw in that analysis?

I am still waiting to hear the purpose of this preservation.  As long as
you don't apply the patch, I guess I will just stop asking.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Aug 26, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > Anyone see a flaw in that analysis?
>
> I am still waiting to hear the purpose of this preservation.  As long as
> you don't apply the patch, I guess I will just stop asking.

You make it sound like I didn't answer that question the last time you
asked it, but I did.[1] I went back to the previous thread and found
that, in fact, there's at least one email *from you* appearing to
endorse that concept for reasons unrelated to TDE[2] and another where
you appear to agree that it would be useful for TDE to do it.[3]
Stephen Frost also wrote up his discussion during the Unconference and
some of his reasons for liking the idea.[4]

If you've changed your mind about this being a good idea, or if you no
longer think it's useful without TDE, that's fine. Everyone is
entitled to change their opinion. But then please say that straight
out. It baffles me why you're now acting as if it hasn't been
discussed when it clearly has been, and both you and I were
participants in that discussion.

[1] https://www.postgresql.org/message-id/CA+Tgmob7msyh3VRaY87USr22UakvvSyy4zBaQw2AO2CfoUD3rA@mail.gmail.com
[2] https://www.postgresql.org/message-id/20210601140949.GC22012@momjian.us
[3] https://www.postgresql.org/message-id/20210527210023.GJ5646@momjian.us
[4] https://www.postgresql.org/message-id/20210531201652.GY20766@tamriel.snowman.net

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > Anyone see a flaw in that analysis?
>
> I am still waiting to hear the purpose of this preservation.  As long as
> you don't apply the patch, I guess I will just stop asking.

I'm a bit confused why this question keeps coming up as we've discussed
multiple reasons (incremental backups, possible use for TDE which would
make this required, general improved sanity when working with pg_upgrade
is frankly a benefit in its own right too...).  If the additional code
was a huge burden or even a moderate one then that might be an argument
against, but it hardly sounds like it will be given Robert's thorough
analysis so far and the (admittedly not complete, but not that far from
it based on the DB OID review) proposed patch.

Thanks,

Stephen

Attachment
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Aug 17, 2021 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Less sure that this is a good idea, though.  In particular, I do not
> > > think that you can make it work in the face of
> > >         alter database template1 rename to oops;
> > >         create database template1;
> >
> > That is a really good point. If we can't categorically force the OID
> > of those databases to have a particular, fixed value, and based on
> > this example that seems to be impossible, then there's always a
> > possibility that we might find a value in the old cluster that doesn't
> > happen to match what is present in the new cluster. Seen from that
> > angle, the problem is really with databases that are pre-existent in
> > the new cluster but whose contents still need to be dumped. Maybe we
> > could (optionally? conditionally?) drop those databases from the new
> > cluster and then recreate them with the OID that we want them to have.
>
> Actually, we do that already. create_new_objects() runs pg_restore
> with --create for most databases, but with --clean --create for
> template1 and postgres. This means that template1 and postgres will
> always be recreated in the new cluster, and other databases are
> assumed not to exist in the new cluster and the upgrade will fail if
> they unexpectedly do. And the reason why pg_upgrade does that is that
> it wants to "propagate [the] database-level properties" of postgres
> and template1. So suppose we just make the database OID one of the
> database-level properties that we want to propagate. That should
> mostly just work, but where can things go wrong?
>
> The only real failure mode is we try to create a database in the new
> cluster and find out that the OID is already in use. If the new OID
> that collides >64k, then the user has messed with the new cluster
> before doing that. And since pg_upgrade is pretty clearly already
> assuming that you shouldn't do that, it's fine to also make that
> assumption in this case. We can disregard such cases as user error.
>
> If the new OID that collides is <64k, then it must be colliding with
> template0, template1, or postgres in the new cluster, because those
> are the only databases that can have such OIDs since, currently, we
> don't allow users to specify an OID for a new database. And the
> problem cannot be with template1, because we hard-code its OID to 1.
> If there is a database with OID 1 in either cluster, it must be
> template1, and if there is a database with OID 1 in both clusters, it
> must be template1 in both cases, and we'll just drop and recreate it
> with OID 1 and everything is fine. So we need only consider template0
> and postgres, which are created with system-generated OIDs. And, it
> would be no issue if either of those databases had the same OID in the
> old and new cluster, so the only possible OID collision is one where
> the same system-generated OID was assigned to one of those databases
> in the old cluster and to the other in the new cluster.
>
> First consider the case where template0 has OID, say, 13000, in the
> old cluster, and postgres has that OID in the new cluster. No problem
> occurs, because template0 isn't transferred anyway. The reverse
> direction is a problem, though. If postgres had been assigned OID
> 13000 in the old cluster and, by sheer chance, template0 had that OID
> in the new cluster, then the upgrade would fail, because it wouldn't
> be able to recreate the postgres database with the correct OID.
>
> But that doesn't seem very difficult to fix. I think all we need to do
> is have initdb assign a fixed OID to template0 at creation time. Then,
> in any new release to which someone might be trying to upgrade, the
> system-generated OID assigned to postgres in the old release can't
> match the fixed OID assigned to template0 in the new release, so the
> one problem case is ruled out. We do need, however, to make sure that
> the assign-my-database-a-fixed-OID syntax is either entirely
> restricted to initdb & pg_upgrade or at least that OIDS < 64k can only
> be assigned in one of those modes. Otherwise, some creative person
> could manufacture new problem cases by setting up the source database
> so that the OID of one of their databases matches the fixed OID we
> gave to template0 or template1, or the system-generated OID for
> postgres in the new cluster.
>
> In short, as far as I can see, all we need to do to preserve database
> OIDs across pg_upgrade is:
>
> 1. Add a new syntax for creating a database with a given OID, and use
> it in pg_dump --binary-upgrade.
> 2. Don't let users use it at least for OIDs <64k, or maybe just don't
> let them use it at all.
> 3. But let initdb use it, and have initdb set the initial OID for
> template0 to a fixed value < 10000. If the user changes it later, no
> problem; the cluster into which they are upgrading won't contain any
> databases with high-numbered OIDs.
>
> Anyone see a flaw in that analysis?

This looks like a pretty good analysis to me.  As it relates to the
question about allowing users to specify an OID, I'd be inclined to
allow it but only for OIDs >64k.  We've certainly reserved things in the
past and I don't see any issue with having that reservation here, but if
we're going to build the capability to specify the OID into CREATE
DATABASE then it seems a bit odd to disallow users from using it, as
long as we're preventing them from causing problems with it.

Are there issues that you see with allowing users to specify the OID
even with the >64k restriction..?  I can't think of one offhand but
perhaps I'm missing something.

Thanks,

Stephen

Attachment
On Thu, Aug 26, 2021 at 11:35:01AM -0400, Robert Haas wrote:
> On Thu, Aug 26, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote:
> > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > > Anyone see a flaw in that analysis?
> >
> > I am still waiting to hear the purpose of this preservation.  As long as
> > you don't apply the patch, I guess I will just stop asking.
> 
> You make it sound like I didn't answer that question the last time you
> asked it, but I did.[1] I went back to the previous thread and found
> that, in fact, there's at least one email *from you* appearing to
> endorse that concept for reasons unrelated to TDE[2] and another where
> you appear to agree that it would be useful for TDE to do it.[3]
> Stephen Frost also wrote up his discussion during the Unconference and
> some of his reasons for liking the idea.[4]
> 
> If you've changed your mind about this being a good idea, or if you no
> longer think it's useful without TDE, that's fine. Everyone is
> entitled to change their opinion. But then please say that straight
> out. It baffles me why you're now acting as if it hasn't been
> discussed when it clearly has been, and both you and I were
> participants in that discussion.
> 
> [1] https://www.postgresql.org/message-id/CA+Tgmob7msyh3VRaY87USr22UakvvSyy4zBaQw2AO2CfoUD3rA@mail.gmail.com
> [2] https://www.postgresql.org/message-id/20210601140949.GC22012@momjian.us
> [3] https://www.postgresql.org/message-id/20210527210023.GJ5646@momjian.us
> [4] https://www.postgresql.org/message-id/20210531201652.GY20766@tamriel.snowman.net

Yes, it would help incremental backup of pgBackRest, as reported by the
developers.  However, I have seen no discussion if this is useful enough
reason to add the complexity to preserve this.  The TODO list shows
"Desirability" as the first item to be discussed, so I expected that to
be discussed first.  Also, with TDE not progressing (and my approach not
even needing this), I have not seen a full discussion if this item is
desirable based on its complexity.

What I did see is this patch appear with no context of why it is useful
given our current plans, except for pgBackRest, which I think I
mentioned.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > > Anyone see a flaw in that analysis?
> > 
> > I am still waiting to hear the purpose of this preservation.  As long as
> > you don't apply the patch, I guess I will just stop asking.
> 
> I'm a bit confused why this question keeps coming up as we've discussed
> multiple reasons (incremental backups, possible use for TDE which would

I have not seen much explaination on pgBackRest, except me mentioning
it.  Is this something really useful?

As far as TDE, I haven't seen any concrete plan for that, so why add
this code for that reason?

> make this required, general improved sanity when working with pg_upgrade
> is frankly a benefit in its own right too...).  If the additional code

How?  I am not aware of any advantage except cosmetic.

> was a huge burden or even a moderate one then that might be an argument
> against, but it hardly sounds like it will be given Robert's thorough
> analysis so far and the (admittedly not complete, but not that far from
> it based on the DB OID review) proposed patch.

I am find to add it if it is minor, but I want to see the calculus of
its value vs complexity, which I have not seen spelled out.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Aug 26, 2021 at 11:39 AM Stephen Frost <sfrost@snowman.net> wrote:
> This looks like a pretty good analysis to me.  As it relates to the
> question about allowing users to specify an OID, I'd be inclined to
> allow it but only for OIDs >64k.  We've certainly reserved things in the
> past and I don't see any issue with having that reservation here, but if
> we're going to build the capability to specify the OID into CREATE
> DATABASE then it seems a bit odd to disallow users from using it, as
> long as we're preventing them from causing problems with it.
>
> Are there issues that you see with allowing users to specify the OID
> even with the >64k restriction..?  I can't think of one offhand but
> perhaps I'm missing something.

So I actually should have said 16k here, not 64k, as somebody already
pointed out to me off-list. Whee!

I don't know of a reason not to let people do that, other than that it
seems like an attractive nuisance. People will do it and it will fail
because they chose a duplicate OID, or they'll complain that a regular
dump and restore didn't preserve their database OIDs, or maybe they'll
expect that they can copy a database from one cluster to another
because they gave it the same OID! That said, I don't see a great harm
in it. It just seems to me like exposing knobs to users that don't
seem to have any legitimate use may be borrowing trouble.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > > > Anyone see a flaw in that analysis?
> > >
> > > I am still waiting to hear the purpose of this preservation.  As long as
> > > you don't apply the patch, I guess I will just stop asking.
> >
> > I'm a bit confused why this question keeps coming up as we've discussed
> > multiple reasons (incremental backups, possible use for TDE which would
>
> I have not seen much explaination on pgBackRest, except me mentioning
> it.  Is this something really useful?

Being able to quickly perform a backup on a newly upgraded cluster would
certainly be valuable and that's definitely not possible today due to
all of the filenames changing.

> As far as TDE, I haven't seen any concrete plan for that, so why add
> this code for that reason?

That this would help with TDE (of which there seems little doubt...) is
an additional benefit to this.  Specifically, taking the existing work
that's already been done to allow block-by-block encryption and
adjusting it for AES-XTS and then using the db-dir+relfileno+block
number as the IV, just like many disk encryption systems do, avoids the
concerns that were brought up about using LSN for the IV with CTR and
it's certainly not difficult to do, but it does depend on this change.
This was all discussed previously and it sure looks like a sensible
approach to use that mirrors what many other systems already do
successfully.

> > make this required, general improved sanity when working with pg_upgrade
> > is frankly a benefit in its own right too...).  If the additional code
>
> How?  I am not aware of any advantage except cosmetic.

Having to resort to matching up inode numbers between the two clusters
after a pg_upgrade to figure out what files are actually the same
underneath is a pain that goes beyond just cosmetics imv.  Removing that
additional level that admins, and developers for that matter, have to go
through would be a nice improvement on its own.

> > was a huge burden or even a moderate one then that might be an argument
> > against, but it hardly sounds like it will be given Robert's thorough
> > analysis so far and the (admittedly not complete, but not that far from
> > it based on the DB OID review) proposed patch.
>
> I am find to add it if it is minor, but I want to see the calculus of
> its value vs complexity, which I have not seen spelled out.

I feel that this, along with the prior discussions, spells it out
sufficiently given the patch's complexity looks to be reasonably minor
and very similar to the existing things that pg_upgrade already does.
Had pg_upgrade done this in the first place, I don't think there would
have been nearly this amount of discussion about it.

Thanks,

Stephen

Attachment
On Thu, Aug 26, 2021 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote:
> I am find to add it if it is minor, but I want to see the calculus of
> its value vs complexity, which I have not seen spelled out.

I don't think it's going to be all that complicated, but we're going
to have to wait until we have something closer to a final patch before
we can really evaluate that. I am honestly a little puzzled about why
you think complexity is such a big issue for this patch in particular.
I feel we do probably several hundred things every release cycle that
are more complicated than this, so it doesn't seem like this is
particularly extraordinary or needs a lot of extra scrutiny. I do
think there is some risk that there are messy cases we can't handle
cleanly, but if that becomes an issue then I'll abandon the effort
until a solution can be found. I'm not trying to relentlessly drive
something through that is a bad idea on principle.

I agree with all Stephen's comments, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Aug 26, 2021 at 11:39 AM Stephen Frost <sfrost@snowman.net> wrote:
> > This looks like a pretty good analysis to me.  As it relates to the
> > question about allowing users to specify an OID, I'd be inclined to
> > allow it but only for OIDs >64k.  We've certainly reserved things in the
> > past and I don't see any issue with having that reservation here, but if
> > we're going to build the capability to specify the OID into CREATE
> > DATABASE then it seems a bit odd to disallow users from using it, as
> > long as we're preventing them from causing problems with it.
> >
> > Are there issues that you see with allowing users to specify the OID
> > even with the >64k restriction..?  I can't think of one offhand but
> > perhaps I'm missing something.
>
> So I actually should have said 16k here, not 64k, as somebody already
> pointed out to me off-list. Whee!

Hah, yes, of course.

> I don't know of a reason not to let people do that, other than that it
> seems like an attractive nuisance. People will do it and it will fail
> because they chose a duplicate OID, or they'll complain that a regular
> dump and restore didn't preserve their database OIDs, or maybe they'll
> expect that they can copy a database from one cluster to another
> because they gave it the same OID! That said, I don't see a great harm
> in it. It just seems to me like exposing knobs to users that don't
> seem to have any legitimate use may be borrowing trouble.

We're going to have to gate this somehow to allow the OIDs under 16k to
be used, so it seems like what you're suggesting is that we have that
gate in place but then allow any OID to be used if you've crossed that
gate?

That is, if we do something like:

SELECT pg_catalog.binary_upgrade_allow_setting_db_oid();
CREATE DATABASE blah WITH OID 1234;

for pg_upgrade, well, users who are interested may well figure out how
to do that themselves if they decide they want to set the OID, whereas
if it 'just works' provided they don't try to use an OID too low then
maybe they won't try to bypass the restriction against using system
OIDs..?

Ok, I'll give you that this is a stretch and I'm on the fence about if
it's worthwhile or not to include and document and if, as you say, it's
inviting trouble to allow users to set it.  Users do seem to have a
knack for finding things even when they aren't documented and then we
get to deal with those complaints too. :)

Perhaps others have some stronger feelings one way or another.

Thanks,

Stephen

Attachment
On Thu, Aug 26, 2021 at 12:34:56PM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote:
> > > * Bruce Momjian (bruce@momjian.us) wrote:
> > > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > > > > Anyone see a flaw in that analysis?
> > > > 
> > > > I am still waiting to hear the purpose of this preservation.  As long as
> > > > you don't apply the patch, I guess I will just stop asking.
> > > 
> > > I'm a bit confused why this question keeps coming up as we've discussed
> > > multiple reasons (incremental backups, possible use for TDE which would
> > 
> > I have not seen much explaination on pgBackRest, except me mentioning
> > it.  Is this something really useful?
> 
> Being able to quickly perform a backup on a newly upgraded cluster would
> certainly be valuable and that's definitely not possible today due to
> all of the filenames changing.

You mean incremental backup, right?  I was told this by the pgBackRest
developers during PGCon, but I have not heard that stated publicly, so I
hate to go just on what I heard rather than seeing that stated publicly.

> > As far as TDE, I haven't seen any concrete plan for that, so why add
> > this code for that reason?
> 
> That this would help with TDE (of which there seems little doubt...) is
> an additional benefit to this.  Specifically, taking the existing work
> that's already been done to allow block-by-block encryption and
> adjusting it for AES-XTS and then using the db-dir+relfileno+block
> number as the IV, just like many disk encryption systems do, avoids the
> concerns that were brought up about using LSN for the IV with CTR and
> it's certainly not difficult to do, but it does depend on this change.
> This was all discussed previously and it sure looks like a sensible
> approach to use that mirrors what many other systems already do
> successfully.

Well, I would think we would not add this for TDE until we were sure
someone was working on adding TDE.

> > > make this required, general improved sanity when working with pg_upgrade
> > > is frankly a benefit in its own right too...).  If the additional code
> > 
> > How?  I am not aware of any advantage except cosmetic.
> 
> Having to resort to matching up inode numbers between the two clusters
> after a pg_upgrade to figure out what files are actually the same
> underneath is a pain that goes beyond just cosmetics imv.  Removing that
> additional level that admins, and developers for that matter, have to go
> through would be a nice improvement on its own.

OK, I was just not aware anyone did that, since I have never hard anyone
complain about it before.
 
> > > was a huge burden or even a moderate one then that might be an argument
> > > against, but it hardly sounds like it will be given Robert's thorough
> > > analysis so far and the (admittedly not complete, but not that far from
> > > it based on the DB OID review) proposed patch.
> > 
> > I am find to add it if it is minor, but I want to see the calculus of
> > its value vs complexity, which I have not seen spelled out.
> 
> I feel that this, along with the prior discussions, spells it out
> sufficiently given the patch's complexity looks to be reasonably minor
> and very similar to the existing things that pg_upgrade already does.
> Had pg_upgrade done this in the first place, I don't think there would
> have been nearly this amount of discussion about it.

Well, there is a reason pg_upgrade didn't initially do this --- because
it adds complexity, and potentially makes future changes to pg_upgrade
necessary if the server behavior changes.

I am not saying this change is wrong, but I think the reasons need to be
stated in this thread, rather than just moving forward.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Aug 26, 2021 at 12:37:19PM -0400, Robert Haas wrote:
> On Thu, Aug 26, 2021 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote:
> > I am find to add it if it is minor, but I want to see the calculus of
> > its value vs complexity, which I have not seen spelled out.
> 
> I don't think it's going to be all that complicated, but we're going
> to have to wait until we have something closer to a final patch before
> we can really evaluate that. I am honestly a little puzzled about why
> you think complexity is such a big issue for this patch in particular.
> I feel we do probably several hundred things every release cycle that
> are more complicated than this, so it doesn't seem like this is
> particularly extraordinary or needs a lot of extra scrutiny. I do
> think there is some risk that there are messy cases we can't handle
> cleanly, but if that becomes an issue then I'll abandon the effort
> until a solution can be found. I'm not trying to relentlessly drive
> something through that is a bad idea on principle.
> 
> I agree with all Stephen's comments, too.

I just don't want to add requirements/complexity to pg_upgrade without
clearly stated reasons because future database changes will need to
honor this new preservation behavior.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Aug 26, 2021 at 12:34:56PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > On Thu, Aug 26, 2021 at 11:36:51AM -0400, Stephen Frost wrote:
> > > > * Bruce Momjian (bruce@momjian.us) wrote:
> > > > > On Thu, Aug 26, 2021 at 11:00:47AM -0400, Robert Haas wrote:
> > > > > > Anyone see a flaw in that analysis?
> > > > >
> > > > > I am still waiting to hear the purpose of this preservation.  As long as
> > > > > you don't apply the patch, I guess I will just stop asking.
> > > >
> > > > I'm a bit confused why this question keeps coming up as we've discussed
> > > > multiple reasons (incremental backups, possible use for TDE which would
> > >
> > > I have not seen much explaination on pgBackRest, except me mentioning
> > > it.  Is this something really useful?
> >
> > Being able to quickly perform a backup on a newly upgraded cluster would
> > certainly be valuable and that's definitely not possible today due to
> > all of the filenames changing.
>
> You mean incremental backup, right?  I was told this by the pgBackRest
> developers during PGCon, but I have not heard that stated publicly, so I
> hate to go just on what I heard rather than seeing that stated publicly.

Yes, we're talking about either incremental (or perhaps differential)
backup where only the files which are actually different would be backed
up.  Just like with PG, I can't provide any complete guarantees that
we'd be able to actually make this possible after a major version with
pgBackRest with this change, but it definitely isn't possible *without*
this change.  I can't see any reason why we wouldn't be able to do a
checksum-based incremental backup though (which would be *much* faster
than a regular backup) once this change is made and have that be a
reliable and trustworthy backup.  I'd want to think about it more and
discuss it with David in some detail before saying if we could maybe
perform a timestamp-based incremental backup (without checksum'ing the
files, as we do in normal situations), but that would really just be a
bonus.

> > > As far as TDE, I haven't seen any concrete plan for that, so why add
> > > this code for that reason?
> >
> > That this would help with TDE (of which there seems little doubt...) is
> > an additional benefit to this.  Specifically, taking the existing work
> > that's already been done to allow block-by-block encryption and
> > adjusting it for AES-XTS and then using the db-dir+relfileno+block
> > number as the IV, just like many disk encryption systems do, avoids the
> > concerns that were brought up about using LSN for the IV with CTR and
> > it's certainly not difficult to do, but it does depend on this change.
> > This was all discussed previously and it sure looks like a sensible
> > approach to use that mirrors what many other systems already do
> > successfully.
>
> Well, I would think we would not add this for TDE until we were sure
> someone was working on adding TDE.

That this would help with TDE is what I'd consider an added bonus.

> > > > make this required, general improved sanity when working with pg_upgrade
> > > > is frankly a benefit in its own right too...).  If the additional code
> > >
> > > How?  I am not aware of any advantage except cosmetic.
> >
> > Having to resort to matching up inode numbers between the two clusters
> > after a pg_upgrade to figure out what files are actually the same
> > underneath is a pain that goes beyond just cosmetics imv.  Removing that
> > additional level that admins, and developers for that matter, have to go
> > through would be a nice improvement on its own.
>
> OK, I was just not aware anyone did that, since I have never hard anyone
> complain about it before.

I've certainly done it and I'd be kind of surprised if others haven't,
but I've also played a lot with pg_dump in various modes, so perhaps
that's not a great representation.  I've definitely had to explain to
clients why there's a whole different set of filenames after a
pg_upgrade and why that is the case for an 'in place' upgrade before
too.

> > > > was a huge burden or even a moderate one then that might be an argument
> > > > against, but it hardly sounds like it will be given Robert's thorough
> > > > analysis so far and the (admittedly not complete, but not that far from
> > > > it based on the DB OID review) proposed patch.
> > >
> > > I am find to add it if it is minor, but I want to see the calculus of
> > > its value vs complexity, which I have not seen spelled out.
> >
> > I feel that this, along with the prior discussions, spells it out
> > sufficiently given the patch's complexity looks to be reasonably minor
> > and very similar to the existing things that pg_upgrade already does.
> > Had pg_upgrade done this in the first place, I don't think there would
> > have been nearly this amount of discussion about it.
>
> Well, there is a reason pg_upgrade didn't initially do this --- because
> it adds complexity, and potentially makes future changes to pg_upgrade
> necessary if the server behavior changes.

I have a very hard time seeing what changes might happen in the server
in this space that wouldn't have an impact on pg_upgrade, with or
without this.

> I am not saying this change is wrong, but I think the reasons need to be
> stated in this thread, rather than just moving forward.

Ok, they've been stated and it seems to at least Robert and myself that
this is worthwhile to at least continue through to a concluded patch,
after which we can contemplate that patch's complexity against these
reasons.

Thanks,

Stephen

Attachment
On Thu, Aug 26, 2021 at 01:03:54PM -0400, Stephen Frost wrote:
> Yes, we're talking about either incremental (or perhaps differential)
> backup where only the files which are actually different would be backed
> up.  Just like with PG, I can't provide any complete guarantees that
> we'd be able to actually make this possible after a major version with
> pgBackRest with this change, but it definitely isn't possible *without*
> this change.  I can't see any reason why we wouldn't be able to do a
> checksum-based incremental backup though (which would be *much* faster
> than a regular backup) once this change is made and have that be a
> reliable and trustworthy backup.  I'd want to think about it more and
> discuss it with David in some detail before saying if we could maybe
> perform a timestamp-based incremental backup (without checksum'ing the
> files, as we do in normal situations), but that would really just be a
> bonus.

Well, it would be nice to know exactly how it would help pgBackRest if
that is one of the reasons we are adding this feature.

> > > > As far as TDE, I haven't seen any concrete plan for that, so why add
> > > > this code for that reason?
> > > 
> > > That this would help with TDE (of which there seems little doubt...) is
> > > an additional benefit to this.  Specifically, taking the existing work
> > > that's already been done to allow block-by-block encryption and
> > > adjusting it for AES-XTS and then using the db-dir+relfileno+block
> > > number as the IV, just like many disk encryption systems do, avoids the
> > > concerns that were brought up about using LSN for the IV with CTR and
> > > it's certainly not difficult to do, but it does depend on this change.
> > > This was all discussed previously and it sure looks like a sensible
> > > approach to use that mirrors what many other systems already do
> > > successfully.
> > 
> > Well, I would think we would not add this for TDE until we were sure
> > someone was working on adding TDE.
> 
> That this would help with TDE is what I'd consider an added bonus.

Not if we have no plans to implement TDE, which was my point.  Why not
wait to see if we are actually going to implement TDE rather than adding
it now.  It is just so obvious, why do I have to state this?

> > > > > make this required, general improved sanity when working with pg_upgrade
> > > > > is frankly a benefit in its own right too...).  If the additional code
> > > > 
> > > > How?  I am not aware of any advantage except cosmetic.
> > > 
> > > Having to resort to matching up inode numbers between the two clusters
> > > after a pg_upgrade to figure out what files are actually the same
> > > underneath is a pain that goes beyond just cosmetics imv.  Removing that
> > > additional level that admins, and developers for that matter, have to go
> > > through would be a nice improvement on its own.
> > 
> > OK, I was just not aware anyone did that, since I have never hard anyone
> > complain about it before.
> 
> I've certainly done it and I'd be kind of surprised if others haven't,
> but I've also played a lot with pg_dump in various modes, so perhaps
> that's not a great representation.  I've definitely had to explain to
> clients why there's a whole different set of filenames after a
> pg_upgrade and why that is the case for an 'in place' upgrade before
> too.

Uh, so I guess I am right that few people have mentioned this in the
past.  Why were users caring about the file names?

> > > > > was a huge burden or even a moderate one then that might be an argument
> > > > > against, but it hardly sounds like it will be given Robert's thorough
> > > > > analysis so far and the (admittedly not complete, but not that far from
> > > > > it based on the DB OID review) proposed patch.
> > > > 
> > > > I am find to add it if it is minor, but I want to see the calculus of
> > > > its value vs complexity, which I have not seen spelled out.
> > > 
> > > I feel that this, along with the prior discussions, spells it out
> > > sufficiently given the patch's complexity looks to be reasonably minor
> > > and very similar to the existing things that pg_upgrade already does.
> > > Had pg_upgrade done this in the first place, I don't think there would
> > > have been nearly this amount of discussion about it.
> > 
> > Well, there is a reason pg_upgrade didn't initially do this --- because
> > it adds complexity, and potentially makes future changes to pg_upgrade
> > necessary if the server behavior changes.
> 
> I have a very hard time seeing what changes might happen in the server
> in this space that wouldn't have an impact on pg_upgrade, with or
> without this.

I don't know, but I have to ask since I can't know the future, so any
"preseration" has to be studied.

> > I am not saying this change is wrong, but I think the reasons need to be
> > stated in this thread, rather than just moving forward.
> 
> Ok, they've been stated and it seems to at least Robert and myself that
> this is worthwhile to at least continue through to a concluded patch,
> after which we can contemplate that patch's complexity against these
> reasons.

OK, that works for me.  What bothers me is that the Desirability of this
changes has not be clearly stated in this thread.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Aug 26, 2021 at 12:51 PM Bruce Momjian <bruce@momjian.us> wrote:
> I just don't want to add requirements/complexity to pg_upgrade without
> clearly stated reasons because future database changes will need to
> honor this new preservation behavior.

Well, I agree that it's good to have reasons clearly stated and I hope
that at this point you agree that they have been. Whether you agree
with them is another question, but I hope you at least agree that they
have been stated.

As far as the other part of your concern, what I think makes this
change pretty safe is that we are preserving more things rather than
fewer. I can imagine some server behavior depending on something being
the same between the old and the new clusters, but it is harder to
imagine a dependency on something not being preserved. For example, we
know that the OIDs of pg_type rows have to be the same in the old and
new cluster because arrays are stored on disk with the type OIDs
included. Therefore those need to be preserved. If in the future we
changed things so that arrays - and other container types - did not
include the type OIDs in the on-disk representation, then perhaps it
would no longer be necessary to preserve the OIDs of pg_type rows
across a pg_upgrade. However, it would not be harmful to do so. It
just might not be required.

So I think this proposed change is in the safe direction. If
relfilenodes were currently preserved and we wanted to make them not
be preserved, then I think you would be quite right to say "whoa,
whoa, that could be a problem." Indeed it could. If anyone then in the
future wanted to introduce a dependency on them staying the same, they
would have a problem. However, nothing in the server itself can care
about relfilenodes - or anything else - being *different* across a
pg_upgrade. The whole point of pg_upgrade is to make it feel like you
have the same database after you run it as you did before you ran it,
even though under the hood a lot of surgery has been done. Barring
bugs, you can never be sad about there being too LITTLE difference
between the post-upgrade database and the pre-upgrade database.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Aug 26, 2021 at 01:03:54PM -0400, Stephen Frost wrote:
> > Yes, we're talking about either incremental (or perhaps differential)
> > backup where only the files which are actually different would be backed
> > up.  Just like with PG, I can't provide any complete guarantees that
> > we'd be able to actually make this possible after a major version with
> > pgBackRest with this change, but it definitely isn't possible *without*
> > this change.  I can't see any reason why we wouldn't be able to do a
> > checksum-based incremental backup though (which would be *much* faster
> > than a regular backup) once this change is made and have that be a
> > reliable and trustworthy backup.  I'd want to think about it more and
> > discuss it with David in some detail before saying if we could maybe
> > perform a timestamp-based incremental backup (without checksum'ing the
> > files, as we do in normal situations), but that would really just be a
> > bonus.
>
> Well, it would be nice to know exactly how it would help pgBackRest if
> that is one of the reasons we are adding this feature.

pgBackRest keeps a manifest for every file in the PG data directory that
is backed up and we identify that file by the filename.  Further, we
calculate a checksum for every file.  If the filenames didn't change
then we'd be able to compare the file in the new cluster against the
file and checksum in the manifest in order to be able to perform the
incremental/differential backup.  We don't store the inodes in the
manifest though, and we don't have any concept of looking at multiple
data directories at the same time or anything like that (which would
also mean that the old data directory would have to be kept around for
that to even work, which seems like a good bit of additional
complication and risk that someone might start up the old cluster by
accident..).

That's how it'd be very helpful to pgBackRest for the filenames to be
preserved across pg_upgrade's.

> > > > > As far as TDE, I haven't seen any concrete plan for that, so why add
> > > > > this code for that reason?
> > > >
> > > > That this would help with TDE (of which there seems little doubt...) is
> > > > an additional benefit to this.  Specifically, taking the existing work
> > > > that's already been done to allow block-by-block encryption and
> > > > adjusting it for AES-XTS and then using the db-dir+relfileno+block
> > > > number as the IV, just like many disk encryption systems do, avoids the
> > > > concerns that were brought up about using LSN for the IV with CTR and
> > > > it's certainly not difficult to do, but it does depend on this change.
> > > > This was all discussed previously and it sure looks like a sensible
> > > > approach to use that mirrors what many other systems already do
> > > > successfully.
> > >
> > > Well, I would think we would not add this for TDE until we were sure
> > > someone was working on adding TDE.
> >
> > That this would help with TDE is what I'd consider an added bonus.
>
> Not if we have no plans to implement TDE, which was my point.  Why not
> wait to see if we are actually going to implement TDE rather than adding
> it now.  It is just so obvious, why do I have to state this?

There's been multiple years of effort put into implementing TDE and I'm
sure hopeful that it continues as I'm trying to put effort into moving
it forward myself.  I'm a bit baffled by the idea that we're just
suddenly going to stop putting effort into TDE as it is brought up time
and time again by clients that I've talked to as one of the few reasons
they haven't moved to PG yet- I can't believe that hasn't been
experienced by folks at other organizations too, I mean, there's people
maintaining forks of PG specifically for TDE ...

Seems like maybe we were both seeing something as obvious to the other
that wasn't actually the case.

> > > > > > make this required, general improved sanity when working with pg_upgrade
> > > > > > is frankly a benefit in its own right too...).  If the additional code
> > > > >
> > > > > How?  I am not aware of any advantage except cosmetic.
> > > >
> > > > Having to resort to matching up inode numbers between the two clusters
> > > > after a pg_upgrade to figure out what files are actually the same
> > > > underneath is a pain that goes beyond just cosmetics imv.  Removing that
> > > > additional level that admins, and developers for that matter, have to go
> > > > through would be a nice improvement on its own.
> > >
> > > OK, I was just not aware anyone did that, since I have never hard anyone
> > > complain about it before.
> >
> > I've certainly done it and I'd be kind of surprised if others haven't,
> > but I've also played a lot with pg_dump in various modes, so perhaps
> > that's not a great representation.  I've definitely had to explain to
> > clients why there's a whole different set of filenames after a
> > pg_upgrade and why that is the case for an 'in place' upgrade before
> > too.
>
> Uh, so I guess I am right that few people have mentioned this in the
> past.  Why were users caring about the file names?

This is a bit baffling to me.  Users and admins certainly care about
what files their data is stored in and knowing how to find them.
Covering the data directory structure is a commonly asked for part of
the training that I regularly do for clients.

> > > > > > was a huge burden or even a moderate one then that might be an argument
> > > > > > against, but it hardly sounds like it will be given Robert's thorough
> > > > > > analysis so far and the (admittedly not complete, but not that far from
> > > > > > it based on the DB OID review) proposed patch.
> > > > >
> > > > > I am find to add it if it is minor, but I want to see the calculus of
> > > > > its value vs complexity, which I have not seen spelled out.
> > > >
> > > > I feel that this, along with the prior discussions, spells it out
> > > > sufficiently given the patch's complexity looks to be reasonably minor
> > > > and very similar to the existing things that pg_upgrade already does.
> > > > Had pg_upgrade done this in the first place, I don't think there would
> > > > have been nearly this amount of discussion about it.
> > >
> > > Well, there is a reason pg_upgrade didn't initially do this --- because
> > > it adds complexity, and potentially makes future changes to pg_upgrade
> > > necessary if the server behavior changes.
> >
> > I have a very hard time seeing what changes might happen in the server
> > in this space that wouldn't have an impact on pg_upgrade, with or
> > without this.
>
> I don't know, but I have to ask since I can't know the future, so any
> "preseration" has to be studied.

We can gain, perhaps, some insight looking into the past and that seems
to indicate that this is certainly a very stable part of the server code
in the first place, which would imply that it's unlikely that there'll
be much need to adjust this code in the future in the first place.

> > > I am not saying this change is wrong, but I think the reasons need to be
> > > stated in this thread, rather than just moving forward.
> >
> > Ok, they've been stated and it seems to at least Robert and myself that
> > this is worthwhile to at least continue through to a concluded patch,
> > after which we can contemplate that patch's complexity against these
> > reasons.
>
> OK, that works for me.  What bothers me is that the Desirability of this
> changes has not be clearly stated in this thread.

I hope that this email and the many many prior ones have gotten across
the desirability of the change.

Thanks,

Stephen

Attachment
On Thu, Aug 26, 2021 at 01:20:38PM -0400, Robert Haas wrote:
> So I think this proposed change is in the safe direction. If
> relfilenodes were currently preserved and we wanted to make them not
> be preserved, then I think you would be quite right to say "whoa,
> whoa, that could be a problem." Indeed it could. If anyone then in the
> future wanted to introduce a dependency on them staying the same, they
> would have a problem. However, nothing in the server itself can care
> about relfilenodes - or anything else - being *different* across a
> pg_upgrade. The whole point of pg_upgrade is to make it feel like you
> have the same database after you run it as you did before you ran it,
> even though under the hood a lot of surgery has been done. Barring
> bugs, you can never be sad about there being too LITTLE difference
> between the post-upgrade database and the pre-upgrade database.

Yes, this makes sense, and it is good we have stated the possible
benefits now:

*  pgBackRest
*  pg_upgrade diagnostics
*  TDE (maybe)

We can eventually evaluate the value of this based on those items.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Aug 26, 2021 at 01:24:46PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Thu, Aug 26, 2021 at 01:03:54PM -0400, Stephen Frost wrote:
> > > Yes, we're talking about either incremental (or perhaps differential)
> > > backup where only the files which are actually different would be backed
> > > up.  Just like with PG, I can't provide any complete guarantees that
> > > we'd be able to actually make this possible after a major version with
> > > pgBackRest with this change, but it definitely isn't possible *without*
> > > this change.  I can't see any reason why we wouldn't be able to do a
> > > checksum-based incremental backup though (which would be *much* faster
> > > than a regular backup) once this change is made and have that be a
> > > reliable and trustworthy backup.  I'd want to think about it more and
> > > discuss it with David in some detail before saying if we could maybe
> > > perform a timestamp-based incremental backup (without checksum'ing the
> > > files, as we do in normal situations), but that would really just be a
> > > bonus.
> > 
> > Well, it would be nice to know exactly how it would help pgBackRest if
> > that is one of the reasons we are adding this feature.
> 
> pgBackRest keeps a manifest for every file in the PG data directory that
> is backed up and we identify that file by the filename.  Further, we
> calculate a checksum for every file.  If the filenames didn't change
> then we'd be able to compare the file in the new cluster against the
> file and checksum in the manifest in order to be able to perform the
> incremental/differential backup.  We don't store the inodes in the
> manifest though, and we don't have any concept of looking at multiple
> data directories at the same time or anything like that (which would
> also mean that the old data directory would have to be kept around for
> that to even work, which seems like a good bit of additional
> complication and risk that someone might start up the old cluster by
> accident..).
> 
> That's how it'd be very helpful to pgBackRest for the filenames to be
> preserved across pg_upgrade's.

OK, that is clear.

> > > > > > As far as TDE, I haven't seen any concrete plan for that, so why add
> > > > > > this code for that reason?
> > > > > 
> > > > > That this would help with TDE (of which there seems little doubt...) is
> > > > > an additional benefit to this.  Specifically, taking the existing work
> > > > > that's already been done to allow block-by-block encryption and
> > > > > adjusting it for AES-XTS and then using the db-dir+relfileno+block
> > > > > number as the IV, just like many disk encryption systems do, avoids the
> > > > > concerns that were brought up about using LSN for the IV with CTR and
> > > > > it's certainly not difficult to do, but it does depend on this change.
> > > > > This was all discussed previously and it sure looks like a sensible
> > > > > approach to use that mirrors what many other systems already do
> > > > > successfully.
> > > > 
> > > > Well, I would think we would not add this for TDE until we were sure
> > > > someone was working on adding TDE.
> > > 
> > > That this would help with TDE is what I'd consider an added bonus.
> > 
> > Not if we have no plans to implement TDE, which was my point.  Why not
> > wait to see if we are actually going to implement TDE rather than adding
> > it now.  It is just so obvious, why do I have to state this?
> 
> There's been multiple years of effort put into implementing TDE and I'm
> sure hopeful that it continues as I'm trying to put effort into moving
> it forward myself.  I'm a bit baffled by the idea that we're just

Well, this is the first time I am hearing this publicly.

> suddenly going to stop putting effort into TDE as it is brought up time
> and time again by clients that I've talked to as one of the few reasons
> they haven't moved to PG yet- I can't believe that hasn't been
> experienced by folks at other organizations too, I mean, there's people
> maintaining forks of PG specifically for TDE ...

Agreed.

> > > I've certainly done it and I'd be kind of surprised if others haven't,
> > > but I've also played a lot with pg_dump in various modes, so perhaps
> > > that's not a great representation.  I've definitely had to explain to
> > > clients why there's a whole different set of filenames after a
> > > pg_upgrade and why that is the case for an 'in place' upgrade before
> > > too.
> > 
> > Uh, so I guess I am right that few people have mentioned this in the
> > past.  Why were users caring about the file names?
> 
> This is a bit baffling to me.  Users and admins certainly care about
> what files their data is stored in and knowing how to find them.
> Covering the data directory structure is a commonly asked for part of
> the training that I regularly do for clients.

I just never thought people cared about the file names, since I have
never heard a complaint about how pg_upgrade works all these years.

> > > I have a very hard time seeing what changes might happen in the server
> > > in this space that wouldn't have an impact on pg_upgrade, with or
> > > without this.
> > 
> > I don't know, but I have to ask since I can't know the future, so any
> > "preseration" has to be studied.
> 
> We can gain, perhaps, some insight looking into the past and that seems
> to indicate that this is certainly a very stable part of the server code
> in the first place, which would imply that it's unlikely that there'll
> be much need to adjust this code in the future in the first place.

Good, it have to ask.

> > > > I am not saying this change is wrong, but I think the reasons need to be
> > > > stated in this thread, rather than just moving forward.
> > > 
> > > Ok, they've been stated and it seems to at least Robert and myself that
> > > this is worthwhile to at least continue through to a concluded patch,
> > > after which we can contemplate that patch's complexity against these
> > > reasons.
> > 
> > OK, that works for me.  What bothers me is that the Desirability of this
> > changes has not be clearly stated in this thread.
> 
> I hope that this email and the many many prior ones have gotten across
> the desirability of the change.

Yes, I think we are in a better position now to evaluate this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Sasasu
Date:
Hi, community,

It looks like we are still considering AES-CBC, AES-XTS, and
AES-GCM(-SIV). I want to say something that we don't think about.

For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
block number or something) is predictable. There are many CVE related to
AES-CBC with a predictable IV.

    https://cwe.mitre.org/data/definitions/329.html

For AES-XTS, use block number or any fixed variable as tweak still has
weaknesses similar to IV reuse (in CBC not GCM). the attacker can
decrypt one block if he knows a kind of plaintext of this block.
In Luks/BitLocker/HardwareBasedSolution, the physical location is not
available to the user. filesystem running in kernel space. and they not
do encrypt when filesystem allocating a data block.
But in PostgreSQL, the attacker can capture an encrypted 'ALL-ZERO' page
in `mdextend`, with this, the attacker can decode the ciphertext of all
following data in this block.

For AES-GCM, a predictable IV is fine. I think we can decrypt and
re-encrypt the user data in pg_upgrade. this will allows us to use
relfile oid + block number as nonce.

Attachment

Re: storing an explicit nonce

From
Sasasu
Date:

在 2021/9/5 下午10:51, Sasasu 写道:
>
> For AES-GCM, a predictable IV is fine. I think we can decrypt and
> re-encrypt the user data in pg_upgrade. this will allows us to use
> relfile oid + block number as nonce.

relfile oid + block number + some counter for heap table IV. I mean.

Attachment
On Tue, Aug 24, 2021 at 2:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It's pretty clear from the discussion, I think, that the database OID
> one is going to need rework to be considered.
>
> Regarding the other one:
>
> - The comment in binary_upgrade_set_pg_class_oids() is still not
> accurate. You removed the sentence which says "Indexes cannot have
> toast tables, so we need not make this probe in the index code path"
> but the immediately preceding sentence is still inaccurate in at least
> two ways. First, it only talks about tables, but the code now applies
> to indexes. Second, it only talks about OIDs, but now also deals with
> refilenodes. It's really important to fully update every comment that
> might be affected by your changes!

The comment is updated.

> - The SQL query in that function isn't completely correct. There is a
> left join from pg_class to pg_index whose ON clause includes
> "c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's
> likely that is because it is possible, in corner cases, for a TOAST
> table to have multiple TOAST indexes. I forget exactly how that
> happens, but I think it might be like if a REINDEX CONCURRENTLY on the
> TOAST table fails midway through, or something of that sort. Now if
> that happens, the LEFT JOIN you added is going to cause the output to
> contain multiple rows, because you didn't replicate the i.indisvalid
> condition into that ON clause. And then it will fail. Apparently we
> don't have a pg_upgrade test case for this scenario; we probably
> should. Actually what I think would be even better than putting
> i.indisvalid into that ON clause would be to join off of i.indrelid
> rather than c.reltoastrelid.

The SQL query will not result in duplicate rows because the first join
filters the duplicate rows if any with the on clause ' i.indisvalid'
on it. The result of the first join is further left joined with pg_class and
pg_class will not have duplicate rows for a given oid.

> - The code that decodes the various columns of this query does so in a
> slightly different order than the query itself. It would be better to
> make it match. Perhaps put relkind first in both cases. I might also
> think about trying to make the column naming a bit more consistent,
> e.g. relkind, relfilenode, toast_oid, toast_relfilenode,
> toast_index_oid, toast_index_relfilenode.

Fixed.

> - In heap_create(), the wording of the error messages is not quite
> consistent. You have "relfilenode value not set when in binary upgrade
> mode", "toast relfilenode value not set when in binary upgrade mode",
> and "pg_class index relfilenode value not set when in binary upgrade
> mode". Why does the last one mention pg_class when the other two
> don't?

The error message is made consistent. This code chuck is moved to a different
place as a part of another review comment fix.

> - The code in heap_create() now has no comments whatsoever, which is a
> shame, because it's actually kind of a tricky bit of logic. Someone
> might wonder why we override the relfilenode inside that function
> instead of doing it at the same places where we absorb
> binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing
> down the relfilenode. I think the answer is that passing down the
> relfilenode from the caller would result in storage not actually being
> created, whereas in this case we want it to be created but just with
> the value we specify, and the reason we want that is because we need
> later DDL that happens after these statements but before the old
> cluster's relations are moved over to execute successfully, which it
> won't if the storage is altogether absent.

> However, that raises the question of whether this patch has even got
> the basic design right. Maybe we ought to actually be absorbing the
> relfilenode setting at the same places where we're doing so for the
> OID, and then passing an additional parameter to heap_create() like
> bool suppress_storage or something like that. Maybe, taking it even
> further, we ought to be changing the signatures of
> binary_upgrade_next_heap_pg_class_oid and friends to be two-argument
> functions, and pass down the OID and the relfilenode in the same call,
> rather than calling two separate functions. I'm not so much concerned
> about the cost of calling two functions as the potential for
> confusion. I'm not honestly sure that either of these changes are the
> right thing to do, but I am pretty strongly inclined to do at least
> the first part - trying to absorb reloid and relfilenode in the same
> places. If we're not going to do that we certainly need to explain why
> we're doing it the way we are in the comments.

As per your suggestion, reloid and relfilenode are absorbed in the same place.
An additional parameter called 'suppress_storage' is passed to heap_create()
which indicates whether or not to create the storage when the caller
passed a valid relfilenode.

I did not make the changes to set the oid and relfilenode in the same call.
I feel the uniformity w.r.t the other function signatures in
pg_upgrade_support.c will be lost because currently each function sets
only one attribute.
Also, renaming the applicable function names to represent that they
set both oid and relfilenode will make the function name even longer.
We may opt to not include the relfilenode in the function name instead
use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but
then
we will end up with some functions that set two attributes and some
functions that set one attribute.

> It's not really this patch's fault, but it would sure be nice if we
> had some better testing for this area. Suppose this patch somehow
> changed nothing from the present behavior. How would we know? Or
> suppose it managed to somehow set all the relfilenodes in the new
> cluster to random values rather than the intended one? There's no
> automated testing that would catch any of that, and it's not obvious
> how it could be added to test.sh. I suppose what we really need to do
> at some point is rewrite that as a TAP test, but that seems like a
> separate project from this patch.

I have verified the table, index, toast table and toast index
relfilenode and DB oid in old and new cluster manually and it is
working as expected.

I have also attached the patch to preserve the DB oid. As discussed,
template0 will be created with a fixed oid during initdb. I am using
OID 2 for template0. Even though oid 2 is already in use for the
'pg_am' catalog I see no harm in using it for template0 DB because oid
doesn’t have to be unique across the database - it has to be unique
for the particular catalog table. Kindly let me know if I am missing
something?
Apparently, if we did decide to pick an unused oid for template0 then
I see a challenge in removing that oid from the unused oid list. I
could not come up with a feasible solution for handling it.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Sep 22, 2021 at 3:07 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > - The comment in binary_upgrade_set_pg_class_oids() is still not
> > accurate. You removed the sentence which says "Indexes cannot have
> > toast tables, so we need not make this probe in the index code path"
> > but the immediately preceding sentence is still inaccurate in at least
> > two ways. First, it only talks about tables, but the code now applies
> > to indexes. Second, it only talks about OIDs, but now also deals with
> > refilenodes. It's really important to fully update every comment that
> > might be affected by your changes!
>
> The comment is updated.

Looks good.

> The SQL query will not result in duplicate rows because the first join
> filters the duplicate rows if any with the on clause ' i.indisvalid'
> on it. The result of the first join is further left joined with pg_class and
> pg_class will not have duplicate rows for a given oid.

Oh, you're right. My mistake.

> As per your suggestion, reloid and relfilenode are absorbed in the same place.
> An additional parameter called 'suppress_storage' is passed to heap_create()
> which indicates whether or not to create the storage when the caller
> passed a valid relfilenode.

I find it confusing to have both suppress_storage and create_storage
with one basically as the negation of the other. To avoid that sort of
thing I generally have a policy that variables and options should say
whether something should happen, rather than whether it should be
prevented from happening. Sometimes there are good reasons - such as
strong existing precedent - to deviate from this practice but I think
it's good to follow when possible. So my proposal is to always have
create_storage and never suppress_storage, and if some function needs
to adjust the value of create_storage that was passed to it then OK.

> I did not make the changes to set the oid and relfilenode in the same call.
> I feel the uniformity w.r.t the other function signatures in
> pg_upgrade_support.c will be lost because currently each function sets
> only one attribute.
> Also, renaming the applicable function names to represent that they
> set both oid and relfilenode will make the function name even longer.
> We may opt to not include the relfilenode in the function name instead
> use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but
> then
> we will end up with some functions that set two attributes and some
> functions that set one attribute.

OK.

> I have also attached the patch to preserve the DB oid. As discussed,
> template0 will be created with a fixed oid during initdb. I am using
> OID 2 for template0. Even though oid 2 is already in use for the
> 'pg_am' catalog I see no harm in using it for template0 DB because oid
> doesn’t have to be unique across the database - it has to be unique
> for the particular catalog table. Kindly let me know if I am missing
> something?
> Apparently, if we did decide to pick an unused oid for template0 then
> I see a challenge in removing that oid from the unused oid list. I
> could not come up with a feasible solution for handling it.

You are correct that there is no intrinsic reason why the same OID
can't be used in various different catalogs. We have a policy of not
doing that, though; I'm not clear on the reason. Maybe it'd be OK to
deviate from that policy here, but another option would be to simply
change the unused_oids script (and maybe some of the others). It
already has:

my $FirstGenbkiObjectId =
  Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId');
push @{$oids}, $FirstGenbkiObjectId;

Presumably it could be easily adapted to push the value of some other
defined symbol into @{$oids} also, thus making that OID in effect
used.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Sun, Sep  5, 2021 at 10:51:42PM +0800, Sasasu wrote:
> Hi, community,
> 
> It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM(-SIV).
> I want to say something that we don't think about.
> 
> For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
> block number or something) is predictable. There are many CVE related to
> AES-CBC with a predictable IV.

The LSN would change every time the page is modified, so while the LSN
could be predicted, it would not be reused.  However, there is currently
no work being done on page-level encryption of Postgres.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Ants Aasma
Date:
On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Sep  5, 2021 at 10:51:42PM +0800, Sasasu wrote:
> Hi, community,
>
> It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM(-SIV).
> I want to say something that we don't think about.
>
> For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
> block number or something) is predictable. There are many CVE related to
> AES-CBC with a predictable IV.

The LSN would change every time the page is modified, so while the LSN
could be predicted, it would not be reused.  However, there is currently
no work being done on page-level encryption of Postgres.

We are still working on our TDE patch. Right now the focus is on refactoring temporary file access to make the TDE patch itself smaller. Reconsidering encryption mode choices given concerns expressed is next. Currently a viable option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an issue with predictable IV and isn't totally broken in case of IV reuse.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com
On Fri, Sep 24, 2021 at 12:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Sep 22, 2021 at 3:07 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > > - The comment in binary_upgrade_set_pg_class_oids() is still not
> > > accurate. You removed the sentence which says "Indexes cannot have
> > > toast tables, so we need not make this probe in the index code path"
> > > but the immediately preceding sentence is still inaccurate in at least
> > > two ways. First, it only talks about tables, but the code now applies
> > > to indexes. Second, it only talks about OIDs, but now also deals with
> > > refilenodes. It's really important to fully update every comment that
> > > might be affected by your changes!
> >
> > The comment is updated.
>
> Looks good.
>
> > The SQL query will not result in duplicate rows because the first join
> > filters the duplicate rows if any with the on clause ' i.indisvalid'
> > on it. The result of the first join is further left joined with pg_class and
> > pg_class will not have duplicate rows for a given oid.
>
> Oh, you're right. My mistake.
>
> > As per your suggestion, reloid and relfilenode are absorbed in the same place.
> > An additional parameter called 'suppress_storage' is passed to heap_create()
> > which indicates whether or not to create the storage when the caller
> > passed a valid relfilenode.
>
> I find it confusing to have both suppress_storage and create_storage
> with one basically as the negation of the other. To avoid that sort of
> thing I generally have a policy that variables and options should say
> whether something should happen, rather than whether it should be
> prevented from happening. Sometimes there are good reasons - such as
> strong existing precedent - to deviate from this practice but I think
> it's good to follow when possible. So my proposal is to always have
> create_storage and never suppress_storage, and if some function needs
> to adjust the value of create_storage that was passed to it then OK.

Sure, I agree. In the latest patch, only 'create_storage' is used.

> > I did not make the changes to set the oid and relfilenode in the same call.
> > I feel the uniformity w.r.t the other function signatures in
> > pg_upgrade_support.c will be lost because currently each function sets
> > only one attribute.
> > Also, renaming the applicable function names to represent that they
> > set both oid and relfilenode will make the function name even longer.
> > We may opt to not include the relfilenode in the function name instead
> > use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but
> > then
> > we will end up with some functions that set two attributes and some
> > functions that set one attribute.
>
> OK.
>
> > I have also attached the patch to preserve the DB oid. As discussed,
> > template0 will be created with a fixed oid during initdb. I am using
> > OID 2 for template0. Even though oid 2 is already in use for the
> > 'pg_am' catalog I see no harm in using it for template0 DB because oid
> > doesn’t have to be unique across the database - it has to be unique
> > for the particular catalog table. Kindly let me know if I am missing
> > something?
> > Apparently, if we did decide to pick an unused oid for template0 then
> > I see a challenge in removing that oid from the unused oid list. I
> > could not come up with a feasible solution for handling it.
>
> You are correct that there is no intrinsic reason why the same OID
> can't be used in various different catalogs. We have a policy of not
> doing that, though; I'm not clear on the reason. Maybe it'd be OK to
> deviate from that policy here, but another option would be to simply
> change the unused_oids script (and maybe some of the others). It
> already has:
>
> my $FirstGenbkiObjectId =
>   Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId');
> push @{$oids}, $FirstGenbkiObjectId;
>
> Presumably it could be easily adapted to push the value of some other
> defined symbol into @{$oids} also, thus making that OID in effect
> used.

Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4)
is fixed for the template0 and the same is removed from unused oid
list.

In addition to the review comment fixes, I have removed some code that
is no longer needed/doesn't make sense since we preserve the OIDs.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Sun, Sep  5, 2021 at 10:51:42PM +0800, Sasasu wrote:
>     > Hi, community,
>     >
>     > It looks like we are still considering AES-CBC, AES-XTS, and AES-GCM
>     (-SIV).
>     > I want to say something that we don't think about.
>     >
>     > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
>     > block number or something) is predictable. There are many CVE related to
>     > AES-CBC with a predictable IV.
> 
>     The LSN would change every time the page is modified, so while the LSN
>     could be predicted, it would not be reused.  However, there is currently
>     no work being done on page-level encryption of Postgres.
> 
> 
> We are still working on our TDE patch. Right now the focus is on refactoring
> temporary file access to make the TDE patch itself smaller. Reconsidering
> encryption mode choices given concerns expressed is next. Currently a viable
> option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> issue with predictable IV and isn't totally broken in case of IV reuse.

Sounds great, thanks!

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Ants Aasma (ants@cybertec.at) wrote:
> On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > On Sun, Sep  5, 2021 at 10:51:42PM +0800, Sasasu wrote:
> > > It looks like we are still considering AES-CBC, AES-XTS, and
> > AES-GCM(-SIV).
> > > I want to say something that we don't think about.
> > >
> > > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
> > > block number or something) is predictable. There are many CVE related to
> > > AES-CBC with a predictable IV.
> >
> > The LSN would change every time the page is modified, so while the LSN
> > could be predicted, it would not be reused.  However, there is currently
> > no work being done on page-level encryption of Postgres.
> >
>
> We are still working on our TDE patch. Right now the focus is on
> refactoring temporary file access to make the TDE patch itself smaller.
> Reconsidering encryption mode choices given concerns expressed is next.
> Currently a viable option seems to be AES-XTS with LSN added into the IV.
> XTS doesn't have an issue with predictable IV and isn't totally broken in
> case of IV reuse.

Probably worth a distinct thread to discuss this, just to be clear.

I do want to point out, as I think I did when we discussed this but want
to be sure it's also captured here- I don't think that temporary file
access should be forced to be block-oriented when it's naturally (in
very many cases) sequential.  To that point, I'm thinking that we need a
temp file access API through which various systems work that's
sequential and therefore relatively similar to the existing glibc, et
al, APIs, but by going through our own internal API (which more
consistently works with the glibc APIs and provides better error
reporting in the event of issues, etc) we can then extend it to work as
an encrypted stream instead.

Happy to discuss in more detail if you'd like but wanted to just bring
up this particular point, in case it got lost.

Thanks!

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote:
> I do want to point out, as I think I did when we discussed this but want
> to be sure it's also captured here- I don't think that temporary file
> access should be forced to be block-oriented when it's naturally (in
> very many cases) sequential.  To that point, I'm thinking that we need a
> temp file access API through which various systems work that's
> sequential and therefore relatively similar to the existing glibc, et
> al, APIs, but by going through our own internal API (which more
> consistently works with the glibc APIs and provides better error
> reporting in the event of issues, etc) we can then extend it to work as
> an encrypted stream instead.

Regarding this, would it use block-oriented access on the backend?

I agree that we need a better API layer through which all filesystem
access is routed. One of the notable weaknesses of the Cybertec patch
is that it has too large a code footprint,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, Oct 5, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote:
> > I do want to point out, as I think I did when we discussed this but want
> > to be sure it's also captured here- I don't think that temporary file
> > access should be forced to be block-oriented when it's naturally (in
> > very many cases) sequential.  To that point, I'm thinking that we need a
> > temp file access API through which various systems work that's
> > sequential and therefore relatively similar to the existing glibc, et
> > al, APIs, but by going through our own internal API (which more
> > consistently works with the glibc APIs and provides better error
> > reporting in the event of issues, etc) we can then extend it to work as
> > an encrypted stream instead.
>
> Regarding this, would it use block-oriented access on the backend?
>
> I agree that we need a better API layer through which all filesystem
> access is routed. One of the notable weaknesses of the Cybertec patch
> is that it has too large a code footprint,

(sent too soon)

...precisely because PostgreSQL doesn't have such a layer.

But I think ultimately we do want to encrypt and decrypt in blocks, so
if we create such a layer, it should expose byte-oriented APIs but
combine the actual I/Os somehow. That's also good for cutting down the
number of system calls, which is a benefit unto itself.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Tue, Oct 5, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > I do want to point out, as I think I did when we discussed this but want
> > > to be sure it's also captured here- I don't think that temporary file
> > > access should be forced to be block-oriented when it's naturally (in
> > > very many cases) sequential.  To that point, I'm thinking that we need a
> > > temp file access API through which various systems work that's
> > > sequential and therefore relatively similar to the existing glibc, et
> > > al, APIs, but by going through our own internal API (which more
> > > consistently works with the glibc APIs and provides better error
> > > reporting in the event of issues, etc) we can then extend it to work as
> > > an encrypted stream instead.
> >
> > Regarding this, would it use block-oriented access on the backend?
> >
> > I agree that we need a better API layer through which all filesystem
> > access is routed. One of the notable weaknesses of the Cybertec patch
> > is that it has too large a code footprint,
> 
> (sent too soon)
> 
> ...precisely because PostgreSQL doesn't have such a layer.

I'm just trying to make our changes to buffile.c less invasive. Or do you mean
that this module should be reworked regardless the encryption?

> But I think ultimately we do want to encrypt and decrypt in blocks, so
> if we create such a layer, it should expose byte-oriented APIs but
> combine the actual I/Os somehow. That's also good for cutting down the
> number of system calls, which is a benefit unto itself.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> We are still working on our TDE patch. Right now the focus is on refactoring
> temporary file access to make the TDE patch itself smaller. Reconsidering
> encryption mode choices given concerns expressed is next. Currently a viable
> option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> issue with predictable IV and isn't totally broken in case of IV reuse.

Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
16-byte blocks affect later blocks, meaning that hint bit changes would
also affect later blocks.  I think this means we would need to write WAL
full page images for hint bit changes to avoid torn pages.  Right now
hint bit (single bit) changes can be lost without causing torn pages. 
This was another of the advantages of using a stream cipher like CTR.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > We are still working on our TDE patch. Right now the focus is on refactoring
> > temporary file access to make the TDE patch itself smaller. Reconsidering
> > encryption mode choices given concerns expressed is next. Currently a viable
> > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > issue with predictable IV and isn't totally broken in case of IV reuse.
> 
> Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> 16-byte blocks affect later blocks, meaning that hint bit changes would
> also affect later blocks.  I think this means we would need to write WAL
> full page images for hint bit changes to avoid torn pages.  Right now
> hint bit (single bit) changes can be lost without causing torn pages. 
> This was another of the advantages of using a stream cipher like CTR.

Another problem caused by block mode ciphers is that to use the LSN as
part of the nonce, the LSN must not be encrypted, but you then have to
find a 16-byte block in the page that you don't need to encrypt.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > We are still working on our TDE patch. Right now the focus is on refactoring
> > temporary file access to make the TDE patch itself smaller. Reconsidering
> > encryption mode choices given concerns expressed is next. Currently a viable
> > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > issue with predictable IV and isn't totally broken in case of IV reuse.
> 
> Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> 16-byte blocks affect later blocks, meaning that hint bit changes would
> also affect later blocks.  I think this means we would need to write WAL
> full page images for hint bit changes to avoid torn pages.  Right now
> hint bit (single bit) changes can be lost without causing torn pages. 
> This was another of the advantages of using a stream cipher like CTR.

The above text isn't very clear.  What I am saying is that currently
torn pages can be tolerated by hint bit writes because only a single
byte is changing.  If we use a block cipher like AES-XTS, later 16-byte
encrypted blocks would be changed by hint bit changes, meaning torn
pages could not be tolerated.  This means we would have to use full page
writes for hint bit changes, perhaps making this feature have
unacceptable performance overhead.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, Oct 5, 2021 at 1:55 PM Antonin Houska <ah@cybertec.at> wrote:
> I'm just trying to make our changes to buffile.c less invasive. Or do you mean
> that this module should be reworked regardless the encryption?

I wasn't thinking of buffile.c specifically. I think improving that
might be a really good idea, although I'm not 100% sure I know what
that would look like. I was thinking that it's unfortunate that there
are so many different ways that I/O happens overall. Like, there are
direct write() and pg_pwrite() calls in various places, for example.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > We are still working on our TDE patch. Right now the focus is on refactoring
> > temporary file access to make the TDE patch itself smaller. Reconsidering
> > encryption mode choices given concerns expressed is next. Currently a viable
> > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > issue with predictable IV and isn't totally broken in case of IV reuse.
>
> Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> 16-byte blocks affect later blocks, meaning that hint bit changes would
> also affect later blocks.  I think this means we would need to write WAL
> full page images for hint bit changes to avoid torn pages.  Right now
> hint bit (single bit) changes can be lost without causing torn pages.
> This was another of the advantages of using a stream cipher like CTR.

This seems wrong to me. CTR requires that you not reuse the IV. If you
re-encrypt the page with a different IV, torn pages are a problem. If
you re-encrypt it with the same IV, then it's not secure any more.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Robert Haas
Date:
On Wed, Oct 6, 2021 at 9:35 AM Bruce Momjian <bruce@momjian.us> wrote:
> The above text isn't very clear.  What I am saying is that currently
> torn pages can be tolerated by hint bit writes because only a single
> byte is changing.  If we use a block cipher like AES-XTS, later 16-byte
> encrypted blocks would be changed by hint bit changes, meaning torn
> pages could not be tolerated.  This means we would have to use full page
> writes for hint bit changes, perhaps making this feature have
> unacceptable performance overhead.

Actually, I think this would have *less* performance overhead than your patch.

If you enable checksums or set wal_log_hints=on, then you might incur
a some write-ahead log records that would otherwise be avoided, and
those records will include full page images. This can happen once per
page per checkpoint cycle. However, if the first modification to a
particular page within a given checkpoint cycle is a regular
WAL-logged operation rather than a hint bit change, then the extra WAL
record and full-page image are not needed so the overhead is zero.
Also, if the first modification is a hint bit change, and then the
page is evicted, prompting a full page write, but a regular WAL-logged
operation occurs later within the same checkpoint, the later operation
no longer needs a full page write. So you still paid the cost of an
extra WAL record, but you didn't pay the cost of an extra full page
write. In other words, enabling checksums or turning wal_log_hints=on
has a relatively low cost except when you have pages that incur only
hint-type changes, and no regular changes, within the course of a
single checkpoint cycle.

On the other hand, in order to avoid IV reuse, your patch needed to
bump the page LSN for every change, or at least for every eviction.
That means you could potentially incur the overhead of an extra full
page write multiple times per checkpoint cycle, and even if there were
non-hint changes to that page in the same checkpoint cycle. Now you
could say, well, let's not bump the page LSN for every hint-type
change, and then your patch would have lower overhead than an approach
based on XTS, but I think that also loses a ton of security, because
now you're reusing IVs with an encryption system that is documented
not to tolerate the reuse of IVs.

I'm not here to try to pretend that encryption is going to be cheap. I
just don't believe this particular argument about why AES-XTS should
be more expensive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Oct 5, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > I do want to point out, as I think I did when we discussed this but want
> > > to be sure it's also captured here- I don't think that temporary file
> > > access should be forced to be block-oriented when it's naturally (in
> > > very many cases) sequential.  To that point, I'm thinking that we need a
> > > temp file access API through which various systems work that's
> > > sequential and therefore relatively similar to the existing glibc, et
> > > al, APIs, but by going through our own internal API (which more
> > > consistently works with the glibc APIs and provides better error
> > > reporting in the event of issues, etc) we can then extend it to work as
> > > an encrypted stream instead.
> >
> > Regarding this, would it use block-oriented access on the backend?
> >
> > I agree that we need a better API layer through which all filesystem
> > access is routed. One of the notable weaknesses of the Cybertec patch
> > is that it has too large a code footprint,
>
> (sent too soon)
>
> ...precisely because PostgreSQL doesn't have such a layer.
>
> But I think ultimately we do want to encrypt and decrypt in blocks, so
> if we create such a layer, it should expose byte-oriented APIs but
> combine the actual I/Os somehow. That's also good for cutting down the
> number of system calls, which is a benefit unto itself.

I have to say that this seems to be moving the goalposts quite far down
the road from just developing a layer to allow for sequential reading
and writing to files that allows us to get away from bare write() calls.
While I agree that we want to encrypt/decrypt in blocks when we're
working with our own blocks, I don't know that it's really necessary to
do for these kinds of use cases.  I appreciate the goal of reducing the
number of syscalls though.

Part of my concern here is that a patch which changes all of our
existing sequential access using write() and friends to work in a block
manner instead ends up probably being just as big and invasive as those
parts of the TDE patch which did the same, and it isn't actually
necessary as there are stream ciphers which we could certainly use for,
well, stream-based access patterns.  No, that doesn't improve the
situation around the number of syscalls, but it also doesn't make that
any worse than it is today.

Perhaps this is all too meta and we need to work through some specific
ideas around just what this would look like.  In particular, thinking
about what this API would look like and how it would be used by
reorderbuffer.c, which builds up changes in memory and then does a bare
write() call, seems like a main use-case to consider.  The gist there
being "can we come up with an API to do all these things that doesn't
require entirely rewriting ReorderBufferSerializeChange()?"

Seems like it'd be easier to achieve that by having something that looks
very close to how write() looks, but just happens to have the option to
run the data through a stream cipher and maybe does better error
handling for us.  Making that layer also do block-based access to the
files underneath seems like a much larger effort that, sure, may make
some things better too but if we could do that with the same API then it
could also be done later if someone's interested in that.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Wed, Oct 6, 2021 at 11:22 AM Stephen Frost <sfrost@snowman.net> wrote:
> Seems like it'd be easier to achieve that by having something that looks
> very close to how write() looks, but just happens to have the option to
> run the data through a stream cipher and maybe does better error
> handling for us.  Making that layer also do block-based access to the
> files underneath seems like a much larger effort that, sure, may make
> some things better too but if we could do that with the same API then it
> could also be done later if someone's interested in that.

Yeah, it's possible that is the best option, but I'm not really
convinced. I think the places that are doing I/O in small chunks are
pretty questionable. Like look at this code from pgstat.c, with block
comments omitted for brevity:

    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
    (void) rc;                  /* we'll check for error with ferror */
    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
    (void) rc;                  /* we'll check for error with ferror */
    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
    (void) rc;                  /* we'll check for error with ferror */
    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
    (void) rc;                  /* we'll check for error with ferror */
   rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
    (void) rc;                  /* we'll check for error with ferror */

I don't know exactly what the best way to write this code is, but I'm
fairly sure this isn't it. I suppose that whoever wrote this code
chose to use fwrite() rather than write() to get buffering, but that
had the effect of delaying the error checking by an amount that I
would consider unacceptable in new code -- we do all the fwrite()
calls to generate the entire file and then only check ferror() once at
the very end! If we did our own buffering, we could do this a lot
better. And if we used that same buffering layer everywhere, it might
not be too hard to make it use a block cipher rather than a stream
cipher. Now I don't intrinsically have strong feeling about whether
block ciphers or stream ciphers are better, but I think it's going to
be easier to analyze the security of the system and to maintain it
across future developments in cryptography if we can use the same kind
of cipher everywhere. If we use block ciphers for some things and
stream ciphers for other things, it is more complicated.

Perhaps that is unavoidable and I just shouldn't worry about it. It
may work out that we'll end up needing to do that anyway for one
reason or another. But all things being equal, I think it's nice if we
make all the places where we do I/O look more like each other, not
specifically because of TDE but because that's just better in general.
For example Andres is working on async I/O. Maybe this particular
piece of code is moot in terms of that project because I think Andres
is hoping to get the shared memory stats collector patch committed.
But, say that doesn't happen. The more all of the I/O looks the same,
the easier it will be to make all of it use whatever async I/O
infrastructure he creates. The more every module does things in its
own way, the harder it is. And batching work together into
reasonable-sized blocks is probably necessary for async I/O too.

I just can't look at code like that shown above and think anything
other than "blech".

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Oct  6, 2021 at 11:01:25AM -0400, Robert Haas wrote:
> On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote:
> > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > > We are still working on our TDE patch. Right now the focus is on refactoring
> > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > encryption mode choices given concerns expressed is next. Currently a viable
> > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > issue with predictable IV and isn't totally broken in case of IV reuse.
> >
> > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > also affect later blocks.  I think this means we would need to write WAL
> > full page images for hint bit changes to avoid torn pages.  Right now
> > hint bit (single bit) changes can be lost without causing torn pages.
> > This was another of the advantages of using a stream cipher like CTR.
> 
> This seems wrong to me. CTR requires that you not reuse the IV. If you
> re-encrypt the page with a different IV, torn pages are a problem. If
> you re-encrypt it with the same IV, then it's not secure any more.

We were not changing the IV for hint bit changes, meaning the hint bit
changes were visible if you compared the blocks.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Oct  6, 2021 at 12:54:49PM -0400, Bruce Momjian wrote:
> On Wed, Oct  6, 2021 at 11:01:25AM -0400, Robert Haas wrote:
> > On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > > > We are still working on our TDE patch. Right now the focus is on refactoring
> > > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > > encryption mode choices given concerns expressed is next. Currently a viable
> > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > >
> > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > > also affect later blocks.  I think this means we would need to write WAL
> > > full page images for hint bit changes to avoid torn pages.  Right now
> > > hint bit (single bit) changes can be lost without causing torn pages.
> > > This was another of the advantages of using a stream cipher like CTR.
> > 
> > This seems wrong to me. CTR requires that you not reuse the IV. If you
> > re-encrypt the page with a different IV, torn pages are a problem. If
> > you re-encrypt it with the same IV, then it's not secure any more.
> 
> We were not changing the IV for hint bit changes, meaning the hint bit
> changes were visible if you compared the blocks.

Oops, I was wrong above, and my patch docs prove it:

    Hint Bits
    - - - - -
    
    For hint bit changes, the LSN normally doesn't change, which is a
    problem.  By enabling wal_log_hints, you get full page writes to the WAL
    after the first hint bit change of the checkpoint.  This is useful for
    two reasons.  First, it generates a new LSN, which is needed for the IV
    to be secure.  Second, full page images protect against torn pages,
    which is an even bigger requirement for encryption because the new LSN
    is re-encrypting the entire page, not just the hint bit changes.  You
    can safely lose the hint bit changes, but you need to use the same LSN
    to decrypt the entire page, so a torn page with an LSN change cannot be
    decrypted.  To prevent this, wal_log_hints guarantees that the
    pre-hint-bit version (and previous LSN version) of the page is restored.
    
    However, if a hint-bit-modified page is written to the file system
    during a checkpoint, and there is a later hint bit change switching the
    same page from clean to dirty during the same checkpoint, we need a new
    LSN, and wal_log_hints doesn't give us a new LSN here.  The fix for this
    is to update the page LSN by writing a dummy WAL record via
    xloginsert.c::LSNForEncryption() in such cases.

Seems my performance concerns were unfounded.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Oct  6, 2021 at 11:17:59AM -0400, Robert Haas wrote:
> If you enable checksums or set wal_log_hints=on, then you might incur
> a some write-ahead log records that would otherwise be avoided, and
> those records will include full page images. This can happen once per
> page per checkpoint cycle. However, if the first modification to a
> particular page within a given checkpoint cycle is a regular
> WAL-logged operation rather than a hint bit change, then the extra WAL
> record and full-page image are not needed so the overhead is zero.
> Also, if the first modification is a hint bit change, and then the
> page is evicted, prompting a full page write, but a regular WAL-logged
> operation occurs later within the same checkpoint, the later operation
> no longer needs a full page write. So you still paid the cost of an
> extra WAL record, but you didn't pay the cost of an extra full page
> write. In other words, enabling checksums or turning wal_log_hints=on
> has a relatively low cost except when you have pages that incur only
> hint-type changes, and no regular changes, within the course of a
> single checkpoint cycle.
> 
> On the other hand, in order to avoid IV reuse, your patch needed to
> bump the page LSN for every change, or at least for every eviction.
> That means you could potentially incur the overhead of an extra full
> page write multiple times per checkpoint cycle, and even if there were
> non-hint changes to that page in the same checkpoint cycle. Now you
> could say, well, let's not bump the page LSN for every hint-type
> change, and then your patch would have lower overhead than an approach
> based on XTS, but I think that also loses a ton of security, because
> now you're reusing IVs with an encryption system that is documented
> not to tolerate the reuse of IVs.
> 
> I'm not here to try to pretend that encryption is going to be cheap. I
> just don't believe this particular argument about why AES-XTS should
> be more expensive.

OK, good to know.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > > We are still working on our TDE patch. Right now the focus is on refactoring
> > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > encryption mode choices given concerns expressed is next. Currently a viable
> > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > issue with predictable IV and isn't totally broken in case of IV reuse.
> >
> > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > also affect later blocks.  I think this means we would need to write WAL
> > full page images for hint bit changes to avoid torn pages.  Right now
> > hint bit (single bit) changes can be lost without causing torn pages.
> > This was another of the advantages of using a stream cipher like CTR.
>
> Another problem caused by block mode ciphers is that to use the LSN as
> part of the nonce, the LSN must not be encrypted, but you then have to
> find a 16-byte block in the page that you don't need to encrypt.

With AES-XTS, we don't need to use the LSN as part of the nonce though,
so I don't think this argument is actually valid..?  As discussed
previously regarding AES-XTS, the general idea was to use the path to
the file and the filename itself plus the block number as the IV, and
that works fine for XTS because it's ok to reuse it (unlike with CTR).

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Oct  6, 2021 at 03:17:00PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > > > We are still working on our TDE patch. Right now the focus is on refactoring
> > > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > > encryption mode choices given concerns expressed is next. Currently a viable
> > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
        -----------------------------------------------------
> > > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > > 
> > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > > also affect later blocks.  I think this means we would need to write WAL
> > > full page images for hint bit changes to avoid torn pages.  Right now
> > > hint bit (single bit) changes can be lost without causing torn pages. 
> > > This was another of the advantages of using a stream cipher like CTR.
> > 
> > Another problem caused by block mode ciphers is that to use the LSN as
> > part of the nonce, the LSN must not be encrypted, but you then have to
> > find a 16-byte block in the page that you don't need to encrypt.
> 
> With AES-XTS, we don't need to use the LSN as part of the nonce though,
> so I don't think this argument is actually valid..?  As discussed
> previously regarding AES-XTS, the general idea was to use the path to
> the file and the filename itself plus the block number as the IV, and
> that works fine for XTS because it's ok to reuse it (unlike with CTR).

Yes, I would prefer we don't use the LSN.  I only mentioned it since
Ants Aasma mentioned LSN use above.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Mon, Oct 4, 2021 at 12:44 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4)
> is fixed for the template0 and the same is removed from unused oid
> list.
>
> In addition to the review comment fixes, I have removed some code that
> is no longer needed/doesn't make sense since we preserve the OIDs.

This is not a full review, but I'm wondering about this bit of code:

-       if (!RELKIND_HAS_STORAGE(relkind) || OidIsValid(relfilenode))
+       if (!RELKIND_HAS_STORAGE(relkind) || (OidIsValid(relfilenode)
&& !create_storage))
                create_storage = false;
        else
        {
                create_storage = true;
-               relfilenode = relid;
+
+               /*
+                * Create the storage with oid same as relid if relfilenode is
+                * unspecified by the caller
+                */
+               if (!OidIsValid(relfilenode))
+                       relfilenode = relid;
        }

This seems hard to understand, and I wonder if perhaps it can be
simplified. If !RELKIND_HAS_STORAGE(relkind), then we're going to set
create_storage to false if it was previously true, and otherwise just
do nothing. Otherwise, if !create_storage, we'll enter the
create_storage = false branch which effectively does nothing.
Otherwise, if !OidIsValid(relfilenode), we'll set relfilenode = relid.
So couldn't we express that like this?

if (!RELKIND_HAS_STORAGE(relkind))
    create_storage = false;
else if (create_storage && !OidIsValid(relfilenode))
    relfilenode = relid;

If so, that seems more clear.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Thu, Oct 7, 2021 at 2:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 4, 2021 at 12:44 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4)
> > is fixed for the template0 and the same is removed from unused oid
> > list.
> >
> > In addition to the review comment fixes, I have removed some code that
> > is no longer needed/doesn't make sense since we preserve the OIDs.
>
> This is not a full review, but I'm wondering about this bit of code:
>
> -       if (!RELKIND_HAS_STORAGE(relkind) || OidIsValid(relfilenode))
> +       if (!RELKIND_HAS_STORAGE(relkind) || (OidIsValid(relfilenode)
> && !create_storage))
>                 create_storage = false;
>         else
>         {
>                 create_storage = true;
> -               relfilenode = relid;
> +
> +               /*
> +                * Create the storage with oid same as relid if relfilenode is
> +                * unspecified by the caller
> +                */
> +               if (!OidIsValid(relfilenode))
> +                       relfilenode = relid;
>         }
>
> This seems hard to understand, and I wonder if perhaps it can be
> simplified. If !RELKIND_HAS_STORAGE(relkind), then we're going to set
> create_storage to false if it was previously true, and otherwise just
> do nothing. Otherwise, if !create_storage, we'll enter the
> create_storage = false branch which effectively does nothing.
> Otherwise, if !OidIsValid(relfilenode), we'll set relfilenode = relid.
> So couldn't we express that like this?
>
> if (!RELKIND_HAS_STORAGE(relkind))
>     create_storage = false;
> else if (create_storage && !OidIsValid(relfilenode))
>     relfilenode = relid;
>
> If so, that seems more clear.

'create_storage' flag says whether or not to create the storage when a
valid relfilenode is passed. 'create_storage' flag alone cannot make
the storage creation decision in heap_create().
Only binary upgrade flow sets the 'create_storage' flag to true and
expects storage gets created with specified relfilenode. Every other
caller/flow passes false for 'create_storage' and we still need to
create storage in heap_create() if relkind has storage. That's why I
have explicitly set 'create_storage = true' in the else flow and
initialize relfilenode on need basis.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Antonin Houska
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > > We are still working on our TDE patch. Right now the focus is on refactoring
> > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > encryption mode choices given concerns expressed is next. Currently a viable
> > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > issue with predictable IV and isn't totally broken in case of IV reuse.
> >
> > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > also affect later blocks.  I think this means we would need to write WAL
> > full page images for hint bit changes to avoid torn pages.  Right now
> > hint bit (single bit) changes can be lost without causing torn pages.
> > This was another of the advantages of using a stream cipher like CTR.
>
> The above text isn't very clear.  What I am saying is that currently
> torn pages can be tolerated by hint bit writes because only a single
> byte is changing.  If we use a block cipher like AES-XTS, later 16-byte
> encrypted blocks would be changed by hint bit changes, meaning torn
> pages could not be tolerated.  This means we would have to use full page
> writes for hint bit changes, perhaps making this feature have
> unacceptable performance overhead.

IIRC, in the XTS scheme, a change of a single byte in the 16-byte block causes
the whole encrypted block to be different after the next encryption, however
the following blocks are not affected. CBC (cipher-block chaining) is the mode
where the change in one block does affect the encryption of the following
block.

I'm not sure if this fact is important from the hint bit perspective
though. It would be an important difference if there was a guarantee that the
16-byte blocks are consitent even on torn page - does e.g. proper alignment of
pages guarantee that? Nevertheless, the absence of the chaining may be a
reason to prefer CBC to XTS anyway.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



On Thu, Oct 7, 2021 at 3:24 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> Every other
> caller/flow passes false for 'create_storage' and we still need to
> create storage in heap_create() if relkind has storage.

That seems surprising.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Wed, Oct  6, 2021 at 03:17:00PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian <bruce@momjian.us> wrote:
> > > > > We are still working on our TDE patch. Right now the focus is on refactoring
> > > > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > > > encryption mode choices given concerns expressed is next. Currently a viable
> > > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
>         -----------------------------------------------------
> > > > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > > >
> > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > > > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > > > also affect later blocks.  I think this means we would need to write WAL
> > > > full page images for hint bit changes to avoid torn pages.  Right now
> > > > hint bit (single bit) changes can be lost without causing torn pages.
> > > > This was another of the advantages of using a stream cipher like CTR.
> > >
> > > Another problem caused by block mode ciphers is that to use the LSN as
> > > part of the nonce, the LSN must not be encrypted, but you then have to
> > > find a 16-byte block in the page that you don't need to encrypt.
> >
> > With AES-XTS, we don't need to use the LSN as part of the nonce though,
> > so I don't think this argument is actually valid..?  As discussed
> > previously regarding AES-XTS, the general idea was to use the path to
> > the file and the filename itself plus the block number as the IV, and
> > that works fine for XTS because it's ok to reuse it (unlike with CTR).
>
> Yes, I would prefer we don't use the LSN.  I only mentioned it since
> Ants Aasma mentioned LSN use above.

Ohhh, apologies for missing that, makes more sense now.

Thanks!

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Wed, Oct 6, 2021 at 3:17 PM Stephen Frost <sfrost@snowman.net> wrote:
> With AES-XTS, we don't need to use the LSN as part of the nonce though,
> so I don't think this argument is actually valid..?  As discussed
> previously regarding AES-XTS, the general idea was to use the path to
> the file and the filename itself plus the block number as the IV, and
> that works fine for XTS because it's ok to reuse it (unlike with CTR).

However, there's also the option of storing a nonce in each page, as
suggested by the subject of this thread. I think that's probably a
pretty workable approach, as demonstrated by the patch that started
this thread. We'd need to think a bit carefully about whether any of
the compile-time calculations the patch moves to runtime are expensive
enough to matter and whether any such impacts can be mitigated, but I
think there is a good chance that such issues are manageable.

I'm a little concerned by the email from "Sasasu" saying that even in
XTS reusing the IV is not cryptographically weak. I don't know enough
about these different encryption modes to know if he's right, but if
he is then perhaps we need to consider his suggestion of using
AES-GCM. Or, uh, something else.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 10:28:55AM -0400, Robert Haas wrote:
> However, there's also the option of storing a nonce in each page, as
> suggested by the subject of this thread. I think that's probably a
> pretty workable approach, as demonstrated by the patch that started
> this thread. We'd need to think a bit carefully about whether any of
> the compile-time calculations the patch moves to runtime are expensive
> enough to matter and whether any such impacts can be mitigated, but I
> think there is a good chance that such issues are manageable.
> 
> I'm a little concerned by the email from "Sasasu" saying that even in
> XTS reusing the IV is not cryptographically weak. I don't know enough
> about these different encryption modes to know if he's right, but if
> he is then perhaps we need to consider his suggestion of using
> AES-GCM. Or, uh, something else.

I continue to be concerned that a page format change will decrease the
desirability of this feature by making migration complex and increasing
its code complexity.  I am unclear if it is necessary.

I think the big question is whether XTS with db/relfilenode/blocknumber
is sufficient as an IV without a nonce that changes for updates.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Oct 6, 2021 at 3:17 PM Stephen Frost <sfrost@snowman.net> wrote:
> > With AES-XTS, we don't need to use the LSN as part of the nonce though,
> > so I don't think this argument is actually valid..?  As discussed
> > previously regarding AES-XTS, the general idea was to use the path to
> > the file and the filename itself plus the block number as the IV, and
> > that works fine for XTS because it's ok to reuse it (unlike with CTR).
>
> However, there's also the option of storing a nonce in each page, as
> suggested by the subject of this thread. I think that's probably a
> pretty workable approach, as demonstrated by the patch that started
> this thread. We'd need to think a bit carefully about whether any of
> the compile-time calculations the patch moves to runtime are expensive
> enough to matter and whether any such impacts can be mitigated, but I
> think there is a good chance that such issues are manageable.

I agree with this in general, though I would think we'd use that for GCM
or another authenticated encryption mode (perhaps GCM-SIV with the LSN
as the IV) at some point off in the future.  Alternatively, we could use
that technique to just provide a better per-page checksum than what we
have today.  Maybe we could figure out how to leverage that to move to
64bit transaction IDs with some page-level epoch.  Definitely a lot of
possibilities.  Ultimately though, regarding TDE at least, I would think
we'd rather start with something that's block level and doesn't require
a page format change.

> I'm a little concerned by the email from "Sasasu" saying that even in
> XTS reusing the IV is not cryptographically weak. I don't know enough
> about these different encryption modes to know if he's right, but if
> he is then perhaps we need to consider his suggestion of using
> AES-GCM. Or, uh, something else.

Think you meant 'strong' above (or maybe omit the 'not', either way the
oppostie of the double-negative that seems to be what was written).

As I understand it, XTS isn't great for dealing with someone who has
ongoing access to watch writes over time, just in general, but that
wasn't what it is generally used to address (and isn't what we would
be looking for it to address either).  Perhaps there's other modes which
don't require that we change the page format to support them besides XTS
(in particular, as our pages are multiples of 16 bytes, it's possible we
don't really need XTS since there aren't any partial blocks and could
simply use XEX instead..)

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 10:27:15AM +0200, Antonin Houska wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> > The above text isn't very clear.  What I am saying is that currently
> > torn pages can be tolerated by hint bit writes because only a single
> > byte is changing.  If we use a block cipher like AES-XTS, later 16-byte
> > encrypted blocks would be changed by hint bit changes, meaning torn
> > pages could not be tolerated.  This means we would have to use full page
> > writes for hint bit changes, perhaps making this feature have
> > unacceptable performance overhead.
> 
> IIRC, in the XTS scheme, a change of a single byte in the 16-byte block causes
> the whole encrypted block to be different after the next encryption, however
> the following blocks are not affected. CBC (cipher-block chaining) is the mode
> where the change in one block does affect the encryption of the following
> block.

Oh, good point.  I was not aware of that.  It means XTS does not feed
the previous block as part of the nonce to the next block.

> I'm not sure if this fact is important from the hint bit perspective
> though. It would be an important difference if there was a guarantee that the
> 16-byte blocks are consitent even on torn page - does e.g. proper alignment of
> pages guarantee that? Nevertheless, the absence of the chaining may be a
> reason to prefer CBC to XTS anyway.

Uh, technically most drives use 512-byte sectors, but I don't know if
there is any guarantee that 512-byte sectors will not be torn --- I have
a feeling there isn't.  I think we get away with the hint bit case
because you can't tear a single bit.  ;-)  However, my patch created a
full page write for hint bit changes.  If we don't use the LSN, those
full page writes will only happen once per checkpoint, which seems
acceptable, at least to Robert.

Interesting on the CBC idea which would force the rest of the page to
change --- not sure if that is valuable.

I know stream ciphers can be diff'ed to see data because they are
xor'ing the data --- I don't remember if block ciphers have similar
weaknesses.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Andres Freund
Date:
Hi,

On October 7, 2021 8:54:54 AM PDT, Bruce Momjian <bruce@momjian.us> wrote:

>Uh, technically most drives use 512-byte sectors, but I don't know if
>there is any guarantee that 512-byte sectors will not be torn --- I have
>a feeling there isn't.  I think we get away with the hint bit case
>because you can't tear a single bit.  ;-)

We rely on it today, e.g. for the control file.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 11:45 AM Bruce Momjian <bruce@momjian.us> wrote:
> I continue to be concerned that a page format change will decrease the
> desirability of this feature by making migration complex and increasing
> its code complexity.  I am unclear if it is necessary.
>
> I think the big question is whether XTS with db/relfilenode/blocknumber
> is sufficient as an IV without a nonce that changes for updates.

Those are fair concerns. I think I agree with everything you say here.

There was some discussion earlier (not sure if it was on this thread)
about integrity verification. And I don't think that there's any way
we can do that without storing some kind of integrity verifier in each
page. And if we're doing that anyway to support that feature, then
there's no problem if it also includes the IV. I had read Stephen's
previous comments to indicate that he thought we should go this way,
and it sounded cool to me, too. However, it does make migrations
somewhat more complex, because you would then have to actually
dump-and-reload, rather than, perhaps, just encrypting all the
existing pages while the cluster was offline. Personally, I'm not that
fussed about that problem, but I'm also rarely the one who has to help
people migrate to new releases, so I may not be as sympathetic to
those problems there as I should be.

If we don't care about the integrity verification features, then as
you say the next question is whether it's acceptable to use a
predictable nonce that is computing from values that can be known
without looking at the block contents. If so, we can forget about
$SUBJECT and save ourselves some engineering work. If not, then I
think we need to do $SUBJECT anyway. And so far I am not really
convinced that we know which of those two things is the case. I don't,
anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote:
> We rely on it today, e.g. for the control file.

I think that's the only place, though. We can't rely on it for data
files because base backups don't go through shared buffers, so reads
and writes can get torn in memory and not just on sector boundaries.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 12:29:04PM -0400, Robert Haas wrote:
> On Thu, Oct 7, 2021 at 11:45 AM Bruce Momjian <bruce@momjian.us> wrote:
> > I continue to be concerned that a page format change will decrease the
> > desirability of this feature by making migration complex and increasing
> > its code complexity.  I am unclear if it is necessary.
> >
> > I think the big question is whether XTS with db/relfilenode/blocknumber
> > is sufficient as an IV without a nonce that changes for updates.
> 
> Those are fair concerns. I think I agree with everything you say here.
> 
> There was some discussion earlier (not sure if it was on this thread)
> about integrity verification. And I don't think that there's any way
> we can do that without storing some kind of integrity verifier in each
> page. And if we're doing that anyway to support that feature, then
> there's no problem if it also includes the IV. I had read Stephen's

Agreed.

> previous comments to indicate that he thought we should go this way,
> and it sounded cool to me, too. However, it does make migrations

Uh, what has not been publicly stated yet is that there was a meeting,
prompted by Stephen, with him, Cybertec staff, and myself on September
16 at the Cybertec office in Vienna to discuss this.  After vigorous
discussion, it was agreed that a simpliied version of this feature would
be implemented that would not have temper detection (beyond encrypted
checksums) and would use XTS so that the LSN would not need to be used.

> If we don't care about the integrity verification features, then as
> you say the next question is whether it's acceptable to use a
> predictable nonce that is computing from values that can be known
> without looking at the block contents. If so, we can forget about
> $SUBJECT and save ourselves some engineering work. If not, then I

Yes, that is now the question.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 09:26:26AM -0700, Andres Freund wrote:
> Hi, 
> 
> On October 7, 2021 8:54:54 AM PDT, Bruce Momjian <bruce@momjian.us> wrote:
> 
> >Uh, technically most drives use 512-byte sectors, but I don't know if
> >there is any guarantee that 512-byte sectors will not be torn --- I have
> >a feeling there isn't.  I think we get away with the hint bit case
> >because you can't tear a single bit.  ;-) 
> 
> We rely on it today, e.g. for the control file.

OK, good to know, and we can be sure the 16-byte blocks will terminate
on 512-byte boundaries.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 12:32:16PM -0400, Robert Haas wrote:
> On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote:
> > We rely on it today, e.g. for the control file.
> 
> I think that's the only place, though. We can't rely on it for data
> files because base backups don't go through shared buffers, so reads
> and writes can get torn in memory and not just on sector boundaries.

Uh, do backups get torn and later used?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 11:45 AM Bruce Momjian <bruce@momjian.us> wrote:
> > I continue to be concerned that a page format change will decrease the
> > desirability of this feature by making migration complex and increasing
> > its code complexity.  I am unclear if it is necessary.
> >
> > I think the big question is whether XTS with db/relfilenode/blocknumber
> > is sufficient as an IV without a nonce that changes for updates.
>
> Those are fair concerns. I think I agree with everything you say here.
>
> There was some discussion earlier (not sure if it was on this thread)
> about integrity verification. And I don't think that there's any way
> we can do that without storing some kind of integrity verifier in each
> page. And if we're doing that anyway to support that feature, then
> there's no problem if it also includes the IV. I had read Stephen's
> previous comments to indicate that he thought we should go this way,
> and it sounded cool to me, too. However, it does make migrations
> somewhat more complex, because you would then have to actually
> dump-and-reload, rather than, perhaps, just encrypting all the
> existing pages while the cluster was offline. Personally, I'm not that
> fussed about that problem, but I'm also rarely the one who has to help
> people migrate to new releases, so I may not be as sympathetic to
> those problems there as I should be.

Yes, for integrity verification (also known as 'authenticated
encryption') we'd definitely need to store a larger nonce value.  In the
very, very long term, I think it'd be great to have that, and the patch
proposed on this thread seems really cool as a way to get us there.

> If we don't care about the integrity verification features, then as
> you say the next question is whether it's acceptable to use a
> predictable nonce that is computing from values that can be known
> without looking at the block contents. If so, we can forget about
> $SUBJECT and save ourselves some engineering work. If not, then I
> think we need to do $SUBJECT anyway. And so far I am not really
> convinced that we know which of those two things is the case. I don't,
> anyway.

Having TDE, even without authenticated encryption, is certainly
valuable.  Reducing the amount of engineering required to get there is
worthwhile.  Implementing TDE w/ XTS or similar, provided we do agree
that we can do so with an IV that we don't need to find additional space
for, would avoid that page-level format change.  I agree we should do
some research to make sure we at least have a reasonable answer to that
question.  I've spent a bit of time on that and haven't gotten to a sure
answer one way or the other as yet, but will continue to look.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 12:56:22PM -0400, Bruce Momjian wrote:
> On Thu, Oct  7, 2021 at 12:32:16PM -0400, Robert Haas wrote:
> > On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote:
> > > We rely on it today, e.g. for the control file.
> > 
> > I think that's the only place, though. We can't rely on it for data
> > files because base backups don't go through shared buffers, so reads
> > and writes can get torn in memory and not just on sector boundaries.
> 
> Uh, do backups get torn and later used?

Are you saying a base backup could read a page from the file system and
see a partial write, even though the write is written as 8k?  I had not
thought about that.

I think this whole discussion is about whether we need full page images
for hint bit changes.  I think we do if we use the LSN for the nonce (in
the old patch), and probably need it for hint bit changes when using
block cipher modes (XTS) if we feel basebackup could read only part of a
16-byte page change.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 12:56 PM Bruce Momjian <bruce@momjian.us> wrote:
> Uh, do backups get torn and later used?

Yep. That's why base backup mode forces full_page_writes on
temporarily even if it's off in general.

Crazy, right?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Ants Aasma
Date:
On Wed, 6 Oct 2021 at 23:08, Bruce Momjian <bruce@momjian.us> wrote:
Yes, I would prefer we don't use the LSN.  I only mentioned it since
Ants Aasma mentioned LSN use above.

Is there a particular reason why you would prefer not to use LSN? I suggested it because in my view having a variable tweak is still better than not having it even if we deem the risks of XTS tweak reuse not important for our use case. The comment was made under the assumption that requiring wal_log_hints for encryption is acceptable.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 12:26 PM Andres Freund <andres@anarazel.de> wrote:
> > We rely on it today, e.g. for the control file.
>
> I think that's the only place, though. We can't rely on it for data
> files because base backups don't go through shared buffers, so reads
> and writes can get torn in memory and not just on sector boundaries.

There was a recent discussion with Munro, as I recall, that actually
points out how we probably shouldn't be relying on that even for the
control file and proposed having multiple control files (something which
I generally agree with as a good idea), particularly due to SSD
technology, as I recall.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 09:38:45PM +0300, Ants Aasma wrote:
> On Wed, 6 Oct 2021 at 23:08, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     Yes, I would prefer we don't use the LSN.  I only mentioned it since
>     Ants Aasma mentioned LSN use above.
> 
> 
> Is there a particular reason why you would prefer not to use LSN? I suggested
> it because in my view having a variable tweak is still better than not having
> it even if we deem the risks of XTS tweak reuse not important for our use case.
> The comment was made under the assumption that requiring wal_log_hints for
> encryption is acceptable.

Well, using the LSN means we have to store the LSN unencrypted, and that
means we have to carve out a 16-byte block on the page that is not
encrypted.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 1:09 PM Bruce Momjian <bruce@momjian.us> wrote:
> Are you saying a base backup could read a page from the file system and
> see a partial write, even though the write is written as 8k?  I had not
> thought about that.

Yes; see my other response.

> I think this whole discussion is about whether we need full page images
> for hint bit changes.  I think we do if we use the LSN for the nonce (in
> the old patch), and probably need it for hint bit changes when using
> block cipher modes (XTS) if we feel basebackup could read only part of a
> 16-byte page change.

I think all the encryption modes that we're still considering have the
(very desirable) property that changing a single bit of the
unencrypted page perturbs the entire output. But that just means that
encrypted clusters will have to run in the same mode as clusters with
checksums, or clusters with wal_log_hints=on, features which the
community has already accepted as having reasonable overhead. I have
in the past expressed skepticism about whether that overhead is really
small enough to be considered acceptable, but if I recall correctly,
the test results posted to the list suggest that you need a working
set just a little bit large than shared_buffers to make it really
sting. And that's not a super-common thing to do. Anyway, if people
aren't screaming about the overhead of that system now, they're not
likely to complain about applying it to some new situation either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 12:56 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Uh, do backups get torn and later used?
>
> Yep. That's why base backup mode forces full_page_writes on
> temporarily even if it's off in general.

Right, so this shouldn't be an issue as any such torn pages will have
an FPI in the WAL that will be replayed as part of restoring that
backup.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Oct  7, 2021 at 09:38:45PM +0300, Ants Aasma wrote:
> > On Wed, 6 Oct 2021 at 23:08, Bruce Momjian <bruce@momjian.us> wrote:
> >
> >     Yes, I would prefer we don't use the LSN.  I only mentioned it since
> >     Ants Aasma mentioned LSN use above.
> >
> >
> > Is there a particular reason why you would prefer not to use LSN? I suggested
> > it because in my view having a variable tweak is still better than not having
> > it even if we deem the risks of XTS tweak reuse not important for our use case.
> > The comment was made under the assumption that requiring wal_log_hints for
> > encryption is acceptable.
>
> Well, using the LSN means we have to store the LSN unencrypted, and that
> means we have to carve out a 16-byte block on the page that is not
> encrypted.

With XTS this isn't actually the case though, is it..?  Part of the
point of XTS is that the last block doesn't have to be a full 16 bytes.
What you're saying is true for XEX, but that's also why XEX isn't used
for FDE in a lot of cases, because disk sectors aren't typically
divisible by 16.

https://en.wikipedia.org/wiki/Disk_encryption_theory

Assuming that's correct, and I don't see any reason to doubt it, then
perhaps it would make sense to have the LSN be unencrypted and include
it in the tweak as that would limit the risk from re-use of the same
tweak over time.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 02:44:43PM -0400, Robert Haas wrote:
> > I think this whole discussion is about whether we need full page images
> > for hint bit changes.  I think we do if we use the LSN for the nonce (in
> > the old patch), and probably need it for hint bit changes when using
> > block cipher modes (XTS) if we feel basebackup could read only part of a
> > 16-byte page change.
> 
> I think all the encryption modes that we're still considering have the
> (very desirable) property that changing a single bit of the
> unencrypted page perturbs the entire output. But that just means that

Well, XTS perturbs the 16-byte block, while CBC changes the rest of the
page.

> encrypted clusters will have to run in the same mode as clusters with
> checksums, or clusters with wal_log_hints=on, features which the
> community has already accepted as having reasonable overhead. I have
> in the past expressed skepticism about whether that overhead is really
> small enough to be considered acceptable, but if I recall correctly,
> the test results posted to the list suggest that you need a working
> set just a little bit large than shared_buffers to make it really
> sting. And that's not a super-common thing to do. Anyway, if people
> aren't screaming about the overhead of that system now, they're not
> likely to complain about applying it to some new situation either.

Yes, agreed, good conclusions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 1:09 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Are you saying a base backup could read a page from the file system and
> > see a partial write, even though the write is written as 8k?  I had not
> > thought about that.
>
> Yes; see my other response.

Yes, that is something that has been seen before.

> > I think this whole discussion is about whether we need full page images
> > for hint bit changes.  I think we do if we use the LSN for the nonce (in
> > the old patch), and probably need it for hint bit changes when using
> > block cipher modes (XTS) if we feel basebackup could read only part of a
> > 16-byte page change.
>
> I think all the encryption modes that we're still considering have the
> (very desirable) property that changing a single bit of the
> unencrypted page perturbs the entire output. But that just means that
> encrypted clusters will have to run in the same mode as clusters with
> checksums, or clusters with wal_log_hints=on, features which the
> community has already accepted as having reasonable overhead. I have
> in the past expressed skepticism about whether that overhead is really
> small enough to be considered acceptable, but if I recall correctly,
> the test results posted to the list suggest that you need a working
> set just a little bit large than shared_buffers to make it really
> sting. And that's not a super-common thing to do. Anyway, if people
> aren't screaming about the overhead of that system now, they're not
> likely to complain about applying it to some new situation either.

Agreed.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Ants Aasma
Date:
On Thu, 7 Oct 2021 at 21:52, Stephen Frost <sfrost@snowman.net> wrote:
With XTS this isn't actually the case though, is it..?  Part of the
point of XTS is that the last block doesn't have to be a full 16 bytes.
What you're saying is true for XEX, but that's also why XEX isn't used
for FDE in a lot of cases, because disk sectors aren't typically
divisible by 16.

https://en.wikipedia.org/wiki/Disk_encryption_theory

Assuming that's correct, and I don't see any reason to doubt it, then
perhaps it would make sense to have the LSN be unencrypted and include
it in the tweak as that would limit the risk from re-use of the same
tweak over time.

Right, my thought was to leave the first 8 bytes of pages, the LSN, unencrypted and include the value in the tweak. Just tested that OpenSSL aes-256-xts handles non multiple-of-16 messages just fine.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote:
> Yes, for integrity verification (also known as 'authenticated
> encryption') we'd definitely need to store a larger nonce value.  In the
> very, very long term, I think it'd be great to have that, and the patch
> proposed on this thread seems really cool as a way to get us there.

OK. I'm not sure why that has to be relegated to the very, very long
term, but I'm really very happy to hear that you think the approach is
cool.

> Having TDE, even without authenticated encryption, is certainly
> valuable.  Reducing the amount of engineering required to get there is
> worthwhile.  Implementing TDE w/ XTS or similar, provided we do agree
> that we can do so with an IV that we don't need to find additional space
> for, would avoid that page-level format change.  I agree we should do
> some research to make sure we at least have a reasonable answer to that
> question.  I've spent a bit of time on that and haven't gotten to a sure
> answer one way or the other as yet, but will continue to look.

I mean, I feel like this meeting that Bruce was talking about was
perhaps making decisions in the wrong order. We have to decide which
encryption mode is secure enough for our needs FIRST, and then AFTER
that we can decide whether we need to store a nonce in the page. Now
if it turns out that we can do either with or without a nonce in the
page, then I'm just as happy as anyone else to start with the method
that works without a nonce in the page, because like you say, that's
less work. But unless we've established that such a method is actually
going to survive scrutiny by smart cryptographers, we can't really
decide that storing the nonce is off the table. And it doesn't seem
like we've established that yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 02:52:07PM -0400, Stephen Frost wrote:
> > > Is there a particular reason why you would prefer not to use LSN? I suggested
> > > it because in my view having a variable tweak is still better than not having
> > > it even if we deem the risks of XTS tweak reuse not important for our use case.
> > > The comment was made under the assumption that requiring wal_log_hints for
> > > encryption is acceptable.
> > 
> > Well, using the LSN means we have to store the LSN unencrypted, and that
> > means we have to carve out a 16-byte block on the page that is not
> > encrypted.
> 
> With XTS this isn't actually the case though, is it..?  Part of the
> point of XTS is that the last block doesn't have to be a full 16 bytes.

> What you're saying is true for XEX, but that's also why XEX isn't used
> for FDE in a lot of cases, because disk sectors aren't typically
> divisible by 16.

Oh, I was not aware of that XTS feature.  Nice.

> https://en.wikipedia.org/wiki/Disk_encryption_theory
> 
> Assuming that's correct, and I don't see any reason to doubt it, then
> perhaps it would make sense to have the LSN be unencrypted and include
> it in the tweak as that would limit the risk from re-use of the same
> tweak over time.

Yes, seems like a plan.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 09:59:31PM +0300, Ants Aasma wrote:
> On Thu, 7 Oct 2021 at 21:52, Stephen Frost <sfrost@snowman.net> wrote:
> 
>     With XTS this isn't actually the case though, is it..?  Part of the
>     point of XTS is that the last block doesn't have to be a full 16 bytes.
>     What you're saying is true for XEX, but that's also why XEX isn't used
>     for FDE in a lot of cases, because disk sectors aren't typically
>     divisible by 16.
> 
>     https://en.wikipedia.org/wiki/Disk_encryption_theory
> 
>     Assuming that's correct, and I don't see any reason to doubt it, then
>     perhaps it would make sense to have the LSN be unencrypted and include
>     it in the tweak as that would limit the risk from re-use of the same
>     tweak over time.
> 
> 
> Right, my thought was to leave the first 8 bytes of pages, the LSN, unencrypted
> and include the value in the tweak. Just tested that OpenSSL aes-256-xts
> handles non multiple-of-16 messages just fine.

Great.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost <sfrost@snowman.net> wrote:
> Assuming that's correct, and I don't see any reason to doubt it, then
> perhaps it would make sense to have the LSN be unencrypted and include
> it in the tweak as that would limit the risk from re-use of the same
> tweak over time.

Talking about things like "limiting the risk" makes me super-nervous.

Maybe we're all on the same page here, but just to make my assumptions
explicit: I think we have to approach this feature with the idea in
mind that there are going to be very smart people actively attacking
any TDE implementation we ship. I expect that if you are lucky enough
to get your hands on a PostgreSQL cluster's data files and they happen
to be encrypted, your best option for handling that situation is not
going to be attacking the encryption, but rather something like
calling the person who has the password and pretending to be someone
to whom they ought to disclose it. However, I also believe that
PostgreSQL is a sufficiently high-profile project that security
researchers will find it a tempting target. And if they manage to
write a shell script or tool that breaks our encryption without too
much difficulty, it will generate a ton of negative PR for the
project. This will be especially true if the problem can't be fixed
without re-engineering the whole thing, because we're not
realistically going to be able to re-engineer the whole thing in a
minor release, and thus will be saddled with the defective
implementation for many years.

Now none of that is to say that we shouldn't limit risk - I mean less
risk is always better than more. But we need to be sure this is not
like a 90% thing, where we're pretty sure it works. We can get by with
that for a lot of things, but I think here we had better try
extra-hard to make sure that we don't have any exposures. We probably
will anyway, but at least if they're just bugs and not architectural
deficiencies, we can hope to be able to patch them as they are
discovered.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Ashwin Agrawal
Date:
On Thu, Oct 7, 2021 at 12:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost <sfrost@snowman.net> wrote:
> Assuming that's correct, and I don't see any reason to doubt it, then
> perhaps it would make sense to have the LSN be unencrypted and include
> it in the tweak as that would limit the risk from re-use of the same
> tweak over time.

Talking about things like "limiting the risk" makes me super-nervous.

Maybe we're all on the same page here, but just to make my assumptions
explicit: I think we have to approach this feature with the idea in
mind that there are going to be very smart people actively attacking
any TDE implementation we ship. I expect that if you are lucky enough
to get your hands on a PostgreSQL cluster's data files and they happen
to be encrypted, your best option for handling that situation is not
going to be attacking the encryption, but rather something like
calling the person who has the password and pretending to be someone
to whom they ought to disclose it. However, I also believe that
PostgreSQL is a sufficiently high-profile project that security
researchers will find it a tempting target. And if they manage to
write a shell script or tool that breaks our encryption without too
much difficulty, it will generate a ton of negative PR for the
project. This will be especially true if the problem can't be fixed
without re-engineering the whole thing, because we're not
realistically going to be able to re-engineer the whole thing in a
minor release, and thus will be saddled with the defective
implementation for many years.

Now none of that is to say that we shouldn't limit risk - I mean less
risk is always better than more. But we need to be sure this is not
like a 90% thing, where we're pretty sure it works. We can get by with
that for a lot of things, but I think here we had better try
extra-hard to make sure that we don't have any exposures. We probably
will anyway, but at least if they're just bugs and not architectural
deficiencies, we can hope to be able to patch them as they are
discovered.

Not at all knowledgeable on security topics (bravely using terms and recommendation), can we approach decisions like AES-XTS vs AES-GCM (which in turn decides whether we need to store nonce or not) based on which compliance it can achieve or not. Like can using AES-XTS make it FIPS 140-2 compliant or not?

Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 3:31 PM Ashwin Agrawal <ashwinstar@gmail.com> wrote:
> Not at all knowledgeable on security topics (bravely using terms and recommendation), can we approach decisions like
AES-XTSvs AES-GCM (which in turn decides whether we need to store nonce or not) based on which compliance it can
achieveor not. Like can using AES-XTS make it FIPS 140-2 compliant or not? 

To the best of my knowledge, the encryption mode doesn't have much to
do with whether such compliance can be achieved. The encryption
algorithm could matter, but I assume everyone still thinks AES is
acceptable. (We should assume that will eventually change.) The
encryption mode is, at least as I understand, more of an internal
thing that you have to get right to avoid having people break your
encryption and write papers about how they did it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Assuming that's correct, and I don't see any reason to doubt it, then
> > perhaps it would make sense to have the LSN be unencrypted and include
> > it in the tweak as that would limit the risk from re-use of the same
> > tweak over time.
>
> Talking about things like "limiting the risk" makes me super-nervous.

All of this is about limiting risks. :)

> Maybe we're all on the same page here, but just to make my assumptions
> explicit: I think we have to approach this feature with the idea in
> mind that there are going to be very smart people actively attacking
> any TDE implementation we ship. I expect that if you are lucky enough
> to get your hands on a PostgreSQL cluster's data files and they happen
> to be encrypted, your best option for handling that situation is not
> going to be attacking the encryption, but rather something like
> calling the person who has the password and pretending to be someone
> to whom they ought to disclose it. However, I also believe that
> PostgreSQL is a sufficiently high-profile project that security
> researchers will find it a tempting target. And if they manage to
> write a shell script or tool that breaks our encryption without too
> much difficulty, it will generate a ton of negative PR for the
> project. This will be especially true if the problem can't be fixed
> without re-engineering the whole thing, because we're not
> realistically going to be able to re-engineer the whole thing in a
> minor release, and thus will be saddled with the defective
> implementation for many years.

While I certainly also appreciate that we want to get this as right as
we possibly can from the start, I strongly suspect we'll have one of two
reactions- either we'll be more-or-less ignored and it'll be crickets
from the security folks, or we're going to get beat up by them for
$reasons, almost regardless of what we actually do.  Best bet to
limit the risk ( ;) ) of the latter happening would be to try our best
to do what existing solutions already do- such as by using XTS.
There's things we can do to limit the risk of known-plaintext attacks,
like simply not encrypting empty pages, or about possible known-IV
risks, like using the LSN as part of the IV/tweak.  Will we get
everything?  Probably not, but I don't think that we're going to really
go wrong by using XTS as it's quite popularly used today and it's
explicitly used for cases where you haven't got a place to store the
extra nonce that you would need for AEAD encryption schemes.

> Now none of that is to say that we shouldn't limit risk - I mean less
> risk is always better than more. But we need to be sure this is not
> like a 90% thing, where we're pretty sure it works. We can get by with
> that for a lot of things, but I think here we had better try
> extra-hard to make sure that we don't have any exposures. We probably
> will anyway, but at least if they're just bugs and not architectural
> deficiencies, we can hope to be able to patch them as they are
> discovered.

As long as we're clear that this initial version of TDE is with XTS then
I really don't think we'll end up with anyone showing up and saying we
screwed up by not generating a per-page nonce to store with it- the point
of XTS is that you don't need that.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 03:38:58PM -0400, Stephen Frost wrote:
> > Now none of that is to say that we shouldn't limit risk - I mean less
> > risk is always better than more. But we need to be sure this is not
> > like a 90% thing, where we're pretty sure it works. We can get by with
> > that for a lot of things, but I think here we had better try
> > extra-hard to make sure that we don't have any exposures. We probably
> > will anyway, but at least if they're just bugs and not architectural
> > deficiencies, we can hope to be able to patch them as they are
> > discovered.
> 
> As long as we're clear that this initial version of TDE is with XTS then
> I really don't think we'll end up with anyone showing up and saying we
> screwed up by not generating a per-page nonce to store with it- the point
> of XTS is that you don't need that.

I am sold.  ;-)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
> While I certainly also appreciate that we want to get this as right as
> we possibly can from the start, I strongly suspect we'll have one of two
> reactions- either we'll be more-or-less ignored and it'll be crickets
> from the security folks, or we're going to get beat up by them for
> $reasons, almost regardless of what we actually do.  Best bet to
> limit the risk ( ;) ) of the latter happening would be to try our best
> to do what existing solutions already do- such as by using XTS.
> There's things we can do to limit the risk of known-plaintext attacks,
> like simply not encrypting empty pages, or about possible known-IV
> risks, like using the LSN as part of the IV/tweak.  Will we get
> everything?  Probably not, but I don't think that we're going to really
> go wrong by using XTS as it's quite popularly used today and it's
> explicitly used for cases where you haven't got a place to store the
> extra nonce that you would need for AEAD encryption schemes.

I agree that using a popular approach is a good way to go. If we do
what other people do, then hopefully our stuff won't be significantly
more broken than their stuff, and whatever is can be fixed.

> As long as we're clear that this initial version of TDE is with XTS then
> I really don't think we'll end up with anyone showing up and saying we
> screwed up by not generating a per-page nonce to store with it- the point
> of XTS is that you don't need that.

I agree that we shouldn't really catch flack for any weaknesses of the
underlying algorithm: if XTS turns out to be secure even when used
properly, and we use it properly, the resulting weakness is somebody
else's fault. On the other hand, if we use it improperly, that's our
fault, so we need to be really sure that we understand what guarantees
we need to provide from our end, and that we are providing them. Like
if we pick an encryption mode that requires nonces to be unique, we
will be at fault if they aren't; if it requires nonces to be
unpredictable, we will be at fault if they aren't; and so on.

So that's what is making me nervous here ... it doesn't seem likely we
have complete unanimity about whether XTS is the right thing, though
that does seem to be the majority position certainly, and it is not
really clear to me that any of us can speak with authority about what
the requirements are around the nonces in particular.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 3:31 PM Ashwin Agrawal <ashwinstar@gmail.com> wrote:
> > Not at all knowledgeable on security topics (bravely using terms and recommendation), can we approach decisions
likeAES-XTS vs AES-GCM (which in turn decides whether we need to store nonce or not) based on which compliance it can
achieveor not. Like can using AES-XTS make it FIPS 140-2 compliant or not? 
>
> To the best of my knowledge, the encryption mode doesn't have much to
> do with whether such compliance can be achieved. The encryption
> algorithm could matter, but I assume everyone still thinks AES is
> acceptable. (We should assume that will eventually change.) The
> encryption mode is, at least as I understand, more of an internal
> thing that you have to get right to avoid having people break your
> encryption and write papers about how they did it.

The issue regarding FIPS 140-2 specifically is actually about the
encryption used (AES-XTS is approved) *and* about the actual library
which is doing the encryption, which isn't really anything to do with us
but rather is OpenSSL (or perhaps NSS if we can get that finished and
included), or maybe some third party that implements one of those APIs
that you decide to use (of which there's a few, some of which have FIPS
140-2 certification).

So, can you have a FIPS 140-2 compliant system with AES-XTS?  Yes, as
it's approved:

https://csrc.nist.gov/csrc/media/projects/cryptographic-module-validation-program/documents/fips140-2/fips1402ig.pdf

Will your system be FIPS 140-2 certified?  That's a big "it depends"
and will involve you actually taking your fully built system through a
testing lab to get it certified.  I certainly don't think we can make
any promises that taking it through such a test would be successful the
first time around, or even ever.  First step though would be to get
something implemented so that $someone can try and can provide feedback.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
> > While I certainly also appreciate that we want to get this as right as
> > we possibly can from the start, I strongly suspect we'll have one of two
> > reactions- either we'll be more-or-less ignored and it'll be crickets
> > from the security folks, or we're going to get beat up by them for
> > $reasons, almost regardless of what we actually do.  Best bet to
> > limit the risk ( ;) ) of the latter happening would be to try our best
> > to do what existing solutions already do- such as by using XTS.
> > There's things we can do to limit the risk of known-plaintext attacks,
> > like simply not encrypting empty pages, or about possible known-IV
> > risks, like using the LSN as part of the IV/tweak.  Will we get
> > everything?  Probably not, but I don't think that we're going to really
> > go wrong by using XTS as it's quite popularly used today and it's
> > explicitly used for cases where you haven't got a place to store the
> > extra nonce that you would need for AEAD encryption schemes.
>
> I agree that using a popular approach is a good way to go. If we do
> what other people do, then hopefully our stuff won't be significantly
> more broken than their stuff, and whatever is can be fixed.

Right.

> > As long as we're clear that this initial version of TDE is with XTS then
> > I really don't think we'll end up with anyone showing up and saying we
> > screwed up by not generating a per-page nonce to store with it- the point
> > of XTS is that you don't need that.
>
> I agree that we shouldn't really catch flack for any weaknesses of the
> underlying algorithm: if XTS turns out to be secure even when used
> properly, and we use it properly, the resulting weakness is somebody
> else's fault. On the other hand, if we use it improperly, that's our
> fault, so we need to be really sure that we understand what guarantees
> we need to provide from our end, and that we are providing them. Like
> if we pick an encryption mode that requires nonces to be unique, we
> will be at fault if they aren't; if it requires nonces to be
> unpredictable, we will be at fault if they aren't; and so on.

Sure, I get that.  Would be awesome if all these things were clearly
documented somewhere but I've never been able to find it quite as
explicitly laid out as one would like.

> So that's what is making me nervous here ... it doesn't seem likely we
> have complete unanimity about whether XTS is the right thing, though
> that does seem to be the majority position certainly, and it is not
> really clear to me that any of us can speak with authority about what
> the requirements are around the nonces in particular.

The authority to look at, in my view anyway, are NIST publications.
Following a bit more digging, I came across something which makes sense
to me as intuitive but explains it in a way that might help everyone
understand a bit better what's going on here:

https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf

specifically: Appendix C: Tweaks

Quoting a couple of paragraphs from that appendix:

"""
In general, if there is information that is available and statically
associated with a plaintext, it is recommended to use that information
as a tweak for the plaintext. Ideally, the non-secret tweak associated
with a plaintext is associated only with that plaintext.

Extensive tweaking means that fewer plaintexts are encrypted under any
given tweak. This corresponds, in the security model that is described
in [1], to fewer queries to the target instance of the encryption.
"""

The gist of this being- the more diverse the tweaking being used, the
better.  That's where I was going with my "limit the risk" comment.  If
we can make the tweak vary more for a given encryption invokation,
that's going to be better, pretty much by definition, and as explained
in publications by NIST.

That isn't to say that using the same tweak for the same block over and
over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads
directly to plaintext being recoverable), but it does mean that an
observer who can see the block writes over time could see what parts are
changing (and which aren't) and may be able to derive insight from that.
Now, as I mentioned before, that particular case isn't something that
XTS is particularly good at and that's generally accepted, yet lots of
folks use XTS anyway because the concern isn't "someone has root access
on the box and is watching all block writes" but rather "laptop was
stolen" where the attacker doesn't get to see multiple writes where the
same key+tweak has been used, and the latter is really the attack vector
we're looking to address with XTS too.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Oct 7, 2021 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Yes, for integrity verification (also known as 'authenticated
> > encryption') we'd definitely need to store a larger nonce value.  In the
> > very, very long term, I think it'd be great to have that, and the patch
> > proposed on this thread seems really cool as a way to get us there.
>
> OK. I'm not sure why that has to be relegated to the very, very long
> term, but I'm really very happy to hear that you think the approach is
> cool.

Folks are shy about a page format change and I get that.

> > Having TDE, even without authenticated encryption, is certainly
> > valuable.  Reducing the amount of engineering required to get there is
> > worthwhile.  Implementing TDE w/ XTS or similar, provided we do agree
> > that we can do so with an IV that we don't need to find additional space
> > for, would avoid that page-level format change.  I agree we should do
> > some research to make sure we at least have a reasonable answer to that
> > question.  I've spent a bit of time on that and haven't gotten to a sure
> > answer one way or the other as yet, but will continue to look.
>
> I mean, I feel like this meeting that Bruce was talking about was
> perhaps making decisions in the wrong order. We have to decide which
> encryption mode is secure enough for our needs FIRST, and then AFTER
> that we can decide whether we need to store a nonce in the page. Now
> if it turns out that we can do either with or without a nonce in the
> page, then I'm just as happy as anyone else to start with the method
> that works without a nonce in the page, because like you say, that's
> less work. But unless we've established that such a method is actually
> going to survive scrutiny by smart cryptographers, we can't really
> decide that storing the nonce is off the table. And it doesn't seem
> like we've established that yet.

Part of the meeting was specifically about "why are we doing this?" and
there were a few different answers- first and foremost was "because
people are asking for it", from which followed that, yes, in many cases
it's to satisfy an audit or similar requirement which any of the
proposed methods would address.  There was further discussion that we
could address *more* cases by providing something better, but the page
format changes were weighed against that and the general consensus was
that we should attack the simpler problem first and, potentially, gain
a solution for 90% of the folks asking for it, and then later see if
there's enough interest and desire to attack the remaining 10%.

As such, it's just not so simple as "what is 'secure enough'" because it
depends on who you're talking to.  Based on the collective discussion at
the meeting, XTS is 'secure enough' for the needs of probably 90% of
those asking, while the other 10% want better (an AEAD method such as
GCM or GCM-SIV).  Therefore, what should we do?  Spend all of the extra
resources and engineering effort to address the 10% and maybe not get
anything because of the level of difficulty, or go the simpler route
first and get the 90%?  Through that lense, the choice seemed reasonably
clear, at least to me, hence why I agreed that we should work on an XTS
based approach first.

(Admittedly, the overall discussion wasn't quite as specific as XTS vs.
GCM-SIV, but the gist was "page format change" vs. "no page format
change" and that seems to equate, based on this subsequent discussion
to the choice between XTS and GCM/GCM-SIV.)

Thanks!

Stephen

Attachment

Re: storing an explicit nonce

From
Antonin Houska
Date:
Stephen Frost <sfrost@snowman.net> wrote:

> Greetings,
>
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > While I certainly also appreciate that we want to get this as right as
> > > we possibly can from the start, I strongly suspect we'll have one of two
> > > reactions- either we'll be more-or-less ignored and it'll be crickets
> > > from the security folks, or we're going to get beat up by them for
> > > $reasons, almost regardless of what we actually do.  Best bet to
> > > limit the risk ( ;) ) of the latter happening would be to try our best
> > > to do what existing solutions already do- such as by using XTS.
> > > There's things we can do to limit the risk of known-plaintext attacks,
> > > like simply not encrypting empty pages, or about possible known-IV
> > > risks, like using the LSN as part of the IV/tweak.  Will we get
> > > everything?  Probably not, but I don't think that we're going to really
> > > go wrong by using XTS as it's quite popularly used today and it's
> > > explicitly used for cases where you haven't got a place to store the
> > > extra nonce that you would need for AEAD encryption schemes.
> >
> > I agree that using a popular approach is a good way to go. If we do
> > what other people do, then hopefully our stuff won't be significantly
> > more broken than their stuff, and whatever is can be fixed.
>
> Right.
>
> > > As long as we're clear that this initial version of TDE is with XTS then
> > > I really don't think we'll end up with anyone showing up and saying we
> > > screwed up by not generating a per-page nonce to store with it- the point
> > > of XTS is that you don't need that.
> >
> > I agree that we shouldn't really catch flack for any weaknesses of the
> > underlying algorithm: if XTS turns out to be secure even when used
> > properly, and we use it properly, the resulting weakness is somebody
> > else's fault. On the other hand, if we use it improperly, that's our
> > fault, so we need to be really sure that we understand what guarantees
> > we need to provide from our end, and that we are providing them. Like
> > if we pick an encryption mode that requires nonces to be unique, we
> > will be at fault if they aren't; if it requires nonces to be
> > unpredictable, we will be at fault if they aren't; and so on.
>
> Sure, I get that.  Would be awesome if all these things were clearly
> documented somewhere but I've never been able to find it quite as
> explicitly laid out as one would like.
>
> > So that's what is making me nervous here ... it doesn't seem likely we
> > have complete unanimity about whether XTS is the right thing, though
> > that does seem to be the majority position certainly, and it is not
> > really clear to me that any of us can speak with authority about what
> > the requirements are around the nonces in particular.
>
> The authority to look at, in my view anyway, are NIST publications.
> Following a bit more digging, I came across something which makes sense
> to me as intuitive but explains it in a way that might help everyone
> understand a bit better what's going on here:
>
> https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf
>
> specifically: Appendix C: Tweaks
>
> Quoting a couple of paragraphs from that appendix:
>
> """
> In general, if there is information that is available and statically
> associated with a plaintext, it is recommended to use that information
> as a tweak for the plaintext. Ideally, the non-secret tweak associated
> with a plaintext is associated only with that plaintext.
>
> Extensive tweaking means that fewer plaintexts are encrypted under any
> given tweak. This corresponds, in the security model that is described
> in [1], to fewer queries to the target instance of the encryption.
> """
>
> The gist of this being- the more diverse the tweaking being used, the
> better.  That's where I was going with my "limit the risk" comment.  If
> we can make the tweak vary more for a given encryption invokation,
> that's going to be better, pretty much by definition, and as explained
> in publications by NIST.
>
> That isn't to say that using the same tweak for the same block over and
> over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads
> directly to plaintext being recoverable), but it does mean that an
> observer who can see the block writes over time could see what parts are
> changing (and which aren't) and may be able to derive insight from that.

This reminds me of Joe Conway's response to me email earlier:

https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com

In the document he recommended

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf

specifically, in the Appendix C I read:

"""
For the CBC and CFB modes, the IVs must be unpredictable.  In particular, for
any given plaintext, it must not be possible to predict the IV that will be
associated to the plaintext in advance of the generation of the IV.

There are two recommended methods for generating unpredictable IVs. The first
method is to apply the forward cipher function, under the same key that is
used for the encryption of the plaintext, to a nonce.  The nonce must be a
data block that is unique to each execution of the encryption operation. For
example, the nonce may be a counter, as described in Appendix B, or a message
number. The second method is to generate a random data block using a FIPS-
approved random number generator.
"""

This is about modes that include CBC, while the documend you refer to seems to
deal with some other modes. So if we want to be confident that we use the XTS
mode correctly, more research is probably needed.

> Now, as I mentioned before, that particular case isn't something that
> XTS is particularly good at and that's generally accepted, yet lots of
> folks use XTS anyway because the concern isn't "someone has root access
> on the box and is watching all block writes" but rather "laptop was
> stolen" where the attacker doesn't get to see multiple writes where the
> same key+tweak has been used, and the latter is really the attack vector
> we're looking to address with XTS too.

I've heard a few times that database running in a cloud is also a valid use
case for the TDE. In that case I think it should be expected that "someone has
root access on the box and is watching all block writes".

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Antonin Houska (ah@cybertec.at) wrote:
> Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > > While I certainly also appreciate that we want to get this as right as
> > > > we possibly can from the start, I strongly suspect we'll have one of two
> > > > reactions- either we'll be more-or-less ignored and it'll be crickets
> > > > from the security folks, or we're going to get beat up by them for
> > > > $reasons, almost regardless of what we actually do.  Best bet to
> > > > limit the risk ( ;) ) of the latter happening would be to try our best
> > > > to do what existing solutions already do- such as by using XTS.
> > > > There's things we can do to limit the risk of known-plaintext attacks,
> > > > like simply not encrypting empty pages, or about possible known-IV
> > > > risks, like using the LSN as part of the IV/tweak.  Will we get
> > > > everything?  Probably not, but I don't think that we're going to really
> > > > go wrong by using XTS as it's quite popularly used today and it's
> > > > explicitly used for cases where you haven't got a place to store the
> > > > extra nonce that you would need for AEAD encryption schemes.
> > >
> > > I agree that using a popular approach is a good way to go. If we do
> > > what other people do, then hopefully our stuff won't be significantly
> > > more broken than their stuff, and whatever is can be fixed.
> >
> > Right.
> >
> > > > As long as we're clear that this initial version of TDE is with XTS then
> > > > I really don't think we'll end up with anyone showing up and saying we
> > > > screwed up by not generating a per-page nonce to store with it- the point
> > > > of XTS is that you don't need that.
> > >
> > > I agree that we shouldn't really catch flack for any weaknesses of the
> > > underlying algorithm: if XTS turns out to be secure even when used
> > > properly, and we use it properly, the resulting weakness is somebody
> > > else's fault. On the other hand, if we use it improperly, that's our
> > > fault, so we need to be really sure that we understand what guarantees
> > > we need to provide from our end, and that we are providing them. Like
> > > if we pick an encryption mode that requires nonces to be unique, we
> > > will be at fault if they aren't; if it requires nonces to be
> > > unpredictable, we will be at fault if they aren't; and so on.
> >
> > Sure, I get that.  Would be awesome if all these things were clearly
> > documented somewhere but I've never been able to find it quite as
> > explicitly laid out as one would like.
> >
> > > So that's what is making me nervous here ... it doesn't seem likely we
> > > have complete unanimity about whether XTS is the right thing, though
> > > that does seem to be the majority position certainly, and it is not
> > > really clear to me that any of us can speak with authority about what
> > > the requirements are around the nonces in particular.
> >
> > The authority to look at, in my view anyway, are NIST publications.
> > Following a bit more digging, I came across something which makes sense
> > to me as intuitive but explains it in a way that might help everyone
> > understand a bit better what's going on here:
> >
> > https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf
> >
> > specifically: Appendix C: Tweaks
> >
> > Quoting a couple of paragraphs from that appendix:
> >
> > """
> > In general, if there is information that is available and statically
> > associated with a plaintext, it is recommended to use that information
> > as a tweak for the plaintext. Ideally, the non-secret tweak associated
> > with a plaintext is associated only with that plaintext.
> >
> > Extensive tweaking means that fewer plaintexts are encrypted under any
> > given tweak. This corresponds, in the security model that is described
> > in [1], to fewer queries to the target instance of the encryption.
> > """
> >
> > The gist of this being- the more diverse the tweaking being used, the
> > better.  That's where I was going with my "limit the risk" comment.  If
> > we can make the tweak vary more for a given encryption invokation,
> > that's going to be better, pretty much by definition, and as explained
> > in publications by NIST.
> >
> > That isn't to say that using the same tweak for the same block over and
> > over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads
> > directly to plaintext being recoverable), but it does mean that an
> > observer who can see the block writes over time could see what parts are
> > changing (and which aren't) and may be able to derive insight from that.
>
> This reminds me of Joe Conway's response to me email earlier:
>
> https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com
>
> In the document he recommended
>
> https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
>
> specifically, in the Appendix C I read:
>
> """
> For the CBC and CFB modes, the IVs must be unpredictable.  In particular, for
> any given plaintext, it must not be possible to predict the IV that will be
> associated to the plaintext in advance of the generation of the IV.
>
> There are two recommended methods for generating unpredictable IVs. The first
> method is to apply the forward cipher function, under the same key that is
> used for the encryption of the plaintext, to a nonce.  The nonce must be a
> data block that is unique to each execution of the encryption operation. For
> example, the nonce may be a counter, as described in Appendix B, or a message
> number. The second method is to generate a random data block using a FIPS-
> approved random number generator.
> """
>
> This is about modes that include CBC, while the documend you refer to seems to
> deal with some other modes. So if we want to be confident that we use the XTS
> mode correctly, more research is probably needed.

What I think is missing from this discussion is the fact that, with XTS
(and XEX, on which XTS is built), the IV *is* run through a forward
cipher function, just as suggested above needs to be done for CBC.  I
don't see any reason to doubt that OpenSSL is correctly doing that.

This article shows this pretty clearly:

https://en.wikipedia.org/wiki/Disk_encryption_theory

I don't think that changes the fact that, if we're able to, we should be
varying the tweak/IV as often as we can, and including the LSN seems
like a good way to do just that.

Now, all that said, I'm all for looking at what others do to inform us
as to the right way to go about things and the above article lists a
number of users of XTS which we could go look at:

XTS is supported by BestCrypt, Botan, NetBSD's cgd,[13] dm-crypt,
FreeOTFE, TrueCrypt, VeraCrypt,[14] DiskCryptor, FreeBSD's geli, OpenBSD
softraid disk encryption software, OpenSSL, Mac OS X Lion's FileVault 2,
Windows 10's BitLocker[15] and wolfCrypt.

> > Now, as I mentioned before, that particular case isn't something that
> > XTS is particularly good at and that's generally accepted, yet lots of
> > folks use XTS anyway because the concern isn't "someone has root access
> > on the box and is watching all block writes" but rather "laptop was
> > stolen" where the attacker doesn't get to see multiple writes where the
> > same key+tweak has been used, and the latter is really the attack vector
> > we're looking to address with XTS too.
>
> I've heard a few times that database running in a cloud is also a valid use
> case for the TDE. In that case I think it should be expected that "someone has
> root access on the box and is watching all block writes".

Except that it isn't.  If you're using someone else's computer, they're
going to be able to look into shared buffers at tons of unencrypted
data, including the keys to decrypt everything.  That doesn't mean we
shouldn't try to be good about using a different IV to make it harder on
someone who has somehow gotten access to watch the writes go by, but TDE
isn't a solution to protect someone from their cloud provider gaining
access to their data.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Sasasu
Date:
On 2021/10/6 23:01, Robert Haas wrote:
 > This seems wrong to me. CTR requires that you not reuse the IV. If you
 > re-encrypt the page with a different IV, torn pages are a problem. If
 > you re-encrypt it with the same IV, then it's not secure any more.

for CBC if the IV is predictable will case "dictionary attack".
and for CBC and GCM reuse IV will case "known plaintext attack".
XTS works like CBC but adds a tweak step. the tweak step does not add 
randomness. It means XTS still has "known plaintext attack", due to the 
same reason from CBC.
many mails before this mail do a clear explanation, I just repeat. :>

On 2021/10/7 22:28, Robert Haas wrote:
 > I'm a little concerned by the email from "Sasasu" saying that even in
 > XTS reusing the IV is not cryptographically weak. I don't know enough
 > about these different encryption modes to know if he's right, but if
 > he is then perhaps we need to consider his suggestion of using
 > AES-GCM. Or, uh, something else.

a cryptography algorithm always lists some attack method (the scope), if 
the algorithm can defend against this attack, the algorithm is good. If 
software uses this algorithm but is attacked by the method not on that 
list. It is the software using this algorithm incorrectly, or should not 
use this algorithm.

On 2021/10/8 03:38, Stephen Frost wrote:
 >   I strongly suspect we'll have one of two
 > reactions- either we'll be more-or-less ignored and it'll be crickets
 > from the security folks, or we're going to get beat up by them for
 > $reasons, almost regardless of what we actually do.  Best bet to
 > limit the risk (;)  ) of the latter happening would be to try our best
 > to do what existing solutions already do- such as by using XTS.

If using an existing wonderful algorithm but out of the design scope, 
cryptographers will laugh at you.

On 2021/10/9 02:34, Stephen Frost wrote:
> Greetings,
> 
> * Antonin Houska (ah@cybertec.at) wrote:
>> Stephen Frost <sfrost@snowman.net> wrote:
>>> * Robert Haas (robertmhaas@gmail.com) wrote:
>>>> On Thu, Oct 7, 2021 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
>>>>> While I certainly also appreciate that we want to get this as right as
>>>>> we possibly can from the start, I strongly suspect we'll have one of two
>>>>> reactions- either we'll be more-or-less ignored and it'll be crickets
>>>>> from the security folks, or we're going to get beat up by them for
>>>>> $reasons, almost regardless of what we actually do.  Best bet to
>>>>> limit the risk ( ;) ) of the latter happening would be to try our best
>>>>> to do what existing solutions already do- such as by using XTS.
>>>>> There's things we can do to limit the risk of known-plaintext attacks,
>>>>> like simply not encrypting empty pages, or about possible known-IV
>>>>> risks, like using the LSN as part of the IV/tweak.  Will we get
>>>>> everything?  Probably not, but I don't think that we're going to really
>>>>> go wrong by using XTS as it's quite popularly used today and it's
>>>>> explicitly used for cases where you haven't got a place to store the
>>>>> extra nonce that you would need for AEAD encryption schemes.
>>>>
>>>> I agree that using a popular approach is a good way to go. If we do
>>>> what other people do, then hopefully our stuff won't be significantly
>>>> more broken than their stuff, and whatever is can be fixed.
>>>
>>> Right.
>>>
>>>>> As long as we're clear that this initial version of TDE is with XTS then
>>>>> I really don't think we'll end up with anyone showing up and saying we
>>>>> screwed up by not generating a per-page nonce to store with it- the point
>>>>> of XTS is that you don't need that.
>>>>
>>>> I agree that we shouldn't really catch flack for any weaknesses of the
>>>> underlying algorithm: if XTS turns out to be secure even when used
>>>> properly, and we use it properly, the resulting weakness is somebody
>>>> else's fault. On the other hand, if we use it improperly, that's our
>>>> fault, so we need to be really sure that we understand what guarantees
>>>> we need to provide from our end, and that we are providing them. Like
>>>> if we pick an encryption mode that requires nonces to be unique, we
>>>> will be at fault if they aren't; if it requires nonces to be
>>>> unpredictable, we will be at fault if they aren't; and so on.
>>>
>>> Sure, I get that.  Would be awesome if all these things were clearly
>>> documented somewhere but I've never been able to find it quite as
>>> explicitly laid out as one would like.
>>>
>>>> So that's what is making me nervous here ... it doesn't seem likely we
>>>> have complete unanimity about whether XTS is the right thing, though
>>>> that does seem to be the majority position certainly, and it is not
>>>> really clear to me that any of us can speak with authority about what
>>>> the requirements are around the nonces in particular.
>>>
>>> The authority to look at, in my view anyway, are NIST publications.
>>> Following a bit more digging, I came across something which makes sense
>>> to me as intuitive but explains it in a way that might help everyone
>>> understand a bit better what's going on here:
>>>
>>> https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38G.pdf
>>>
>>> specifically: Appendix C: Tweaks
>>>
>>> Quoting a couple of paragraphs from that appendix:
>>>
>>> """
>>> In general, if there is information that is available and statically
>>> associated with a plaintext, it is recommended to use that information
>>> as a tweak for the plaintext. Ideally, the non-secret tweak associated
>>> with a plaintext is associated only with that plaintext.
>>>
>>> Extensive tweaking means that fewer plaintexts are encrypted under any
>>> given tweak. This corresponds, in the security model that is described
>>> in [1], to fewer queries to the target instance of the encryption.
>>> """
>>>
>>> The gist of this being- the more diverse the tweaking being used, the
>>> better.  That's where I was going with my "limit the risk" comment.  If
>>> we can make the tweak vary more for a given encryption invokation,
>>> that's going to be better, pretty much by definition, and as explained
>>> in publications by NIST.
>>>
>>> That isn't to say that using the same tweak for the same block over and
>>> over "breaks" the encryption (unlike with CTR/GCM, where IV reuse leads
>>> directly to plaintext being recoverable), but it does mean that an
>>> observer who can see the block writes over time could see what parts are
>>> changing (and which aren't) and may be able to derive insight from that.
>>
>> This reminds me of Joe Conway's response to me email earlier:
>>
>> https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com
>>
>> In the document he recommended
>>
>> https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
>>
>> specifically, in the Appendix C I read:
>>
>> """
>> For the CBC and CFB modes, the IVs must be unpredictable.  In particular, for
>> any given plaintext, it must not be possible to predict the IV that will be
>> associated to the plaintext in advance of the generation of the IV.
>>
>> There are two recommended methods for generating unpredictable IVs. The first
>> method is to apply the forward cipher function, under the same key that is
>> used for the encryption of the plaintext, to a nonce.  The nonce must be a
>> data block that is unique to each execution of the encryption operation. For
>> example, the nonce may be a counter, as described in Appendix B, or a message
>> number. The second method is to generate a random data block using a FIPS-
>> approved random number generator.
>> """
>>
>> This is about modes that include CBC, while the documend you refer to seems to
>> deal with some other modes. So if we want to be confident that we use the XTS
>> mode correctly, more research is probably needed.
> 
> What I think is missing from this discussion is the fact that, with XTS
> (and XEX, on which XTS is built), the IV *is* run through a forward
> cipher function, just as suggested above needs to be done for CBC.  I
> don't see any reason to doubt that OpenSSL is correctly doing that.
> 
> This article shows this pretty clearly:
> 
> https://en.wikipedia.org/wiki/Disk_encryption_theory
> 
> I don't think that changes the fact that, if we're able to, we should be
> varying the tweak/IV as often as we can, and including the LSN seems
> like a good way to do just that.
> 
> Now, all that said, I'm all for looking at what others do to inform us
> as to the right way to go about things and the above article lists a
> number of users of XTS which we could go look at:
> 
> XTS is supported by BestCrypt, Botan, NetBSD's cgd,[13] dm-crypt,
> FreeOTFE, TrueCrypt, VeraCrypt,[14] DiskCryptor, FreeBSD's geli, OpenBSD
> softraid disk encryption software, OpenSSL, Mac OS X Lion's FileVault 2,
> Windows 10's BitLocker[15] and wolfCrypt.
> 
>>> Now, as I mentioned before, that particular case isn't something that
>>> XTS is particularly good at and that's generally accepted, yet lots of
>>> folks use XTS anyway because the concern isn't "someone has root access
>>> on the box and is watching all block writes" but rather "laptop was
>>> stolen" where the attacker doesn't get to see multiple writes where the
>>> same key+tweak has been used, and the latter is really the attack vector
>>> we're looking to address with XTS too.
>>
>> I've heard a few times that database running in a cloud is also a valid use
>> case for the TDE. In that case I think it should be expected that "someone has
>> root access on the box and is watching all block writes".
> 
> Except that it isn't.  If you're using someone else's computer, they're
> going to be able to look into shared buffers at tons of unencrypted
> data, including the keys to decrypt everything.  That doesn't mean we
> shouldn't try to be good about using a different IV to make it harder on
> someone who has somehow gotten access to watch the writes go by, but TDE
> isn't a solution to protect someone from their cloud provider gaining
> access to their data.
> 
> Thanks,
> 
> Stephen
> 

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Thu, Oct  7, 2021 at 11:32:07PM -0400, Stephen Frost wrote:
> Part of the meeting was specifically about "why are we doing this?" and
> there were a few different answers- first and foremost was "because
> people are asking for it", from which followed that, yes, in many cases
> it's to satisfy an audit or similar requirement which any of the
> proposed methods would address.  There was further discussion that we

Yes, Cybertec's experience with their TDE patch's adoption supported
this.

> could address *more* cases by providing something better, but the page
> format changes were weighed against that and the general consensus was
> that we should attack the simpler problem first and, potentially, gain
> a solution for 90% of the folks asking for it, and then later see if
> there's enough interest and desire to attack the remaining 10%.

It is more than just the page format --- it would also be the added
code, possible performance impact, and later code maintenance to allow
for are a more complex or two different page formats.

As an example, I think the online checksum patch failed because it
wasn't happy with that 90% and went for the extra 10% of restartability,
but once you saw the 100% solution, the patch was too big and was
rejected.

> As such, it's just not so simple as "what is 'secure enough'" because it
> depends on who you're talking to.  Based on the collective discussion at
> the meeting, XTS is 'secure enough' for the needs of probably 90% of
> those asking, while the other 10% want better (an AEAD method such as
> GCM or GCM-SIV).  Therefore, what should we do?  Spend all of the extra
> resources and engineering effort to address the 10% and maybe not get
> anything because of the level of difficulty, or go the simpler route
> first and get the 90%?  Through that lense, the choice seemed reasonably
> clear, at least to me, hence why I agreed that we should work on an XTS
> based approach first.

Yes, that was the conclusion.  I think it helped to have the discussion
verbally with everyone hearing every word, rather than via email where
people jump into the discussion not hearing earlier points.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Oct  7, 2021 at 11:32:07PM -0400, Stephen Frost wrote:
> > Part of the meeting was specifically about "why are we doing this?" and
> > there were a few different answers- first and foremost was "because
> > people are asking for it", from which followed that, yes, in many cases
> > it's to satisfy an audit or similar requirement which any of the
> > proposed methods would address.  There was further discussion that we
>
> Yes, Cybertec's experience with their TDE patch's adoption supported
> this.
>
> > could address *more* cases by providing something better, but the page
> > format changes were weighed against that and the general consensus was
> > that we should attack the simpler problem first and, potentially, gain
> > a solution for 90% of the folks asking for it, and then later see if
> > there's enough interest and desire to attack the remaining 10%.
>
> It is more than just the page format --- it would also be the added
> code, possible performance impact, and later code maintenance to allow
> for are a more complex or two different page formats.

Yes, there is more to it than just the page format, I agree.  I'm still
of the mind that it's something we're going to get to eventually, if for
no other reason than that our current page format is certainly not
perfect and it'd be pretty awesome if we could make improvements to it
(independently of TDE or anything else discussed currently).

> As an example, I think the online checksum patch failed because it
> wasn't happy with that 90% and went for the extra 10% of restartability,
> but once you saw the 100% solution, the patch was too big and was
> rejected.

I'm, at least, still hopeful that we get the online checksum patch done.
I'm not sure that I agree that this was 'the' reason it didn't make it
in, but I don't think it'd be helpful to tangent this thread to
discussing some other patch.

> > As such, it's just not so simple as "what is 'secure enough'" because it
> > depends on who you're talking to.  Based on the collective discussion at
> > the meeting, XTS is 'secure enough' for the needs of probably 90% of
> > those asking, while the other 10% want better (an AEAD method such as
> > GCM or GCM-SIV).  Therefore, what should we do?  Spend all of the extra
> > resources and engineering effort to address the 10% and maybe not get
> > anything because of the level of difficulty, or go the simpler route
> > first and get the 90%?  Through that lense, the choice seemed reasonably
> > clear, at least to me, hence why I agreed that we should work on an XTS
> > based approach first.
>
> Yes, that was the conclusion.  I think it helped to have the discussion
> verbally with everyone hearing every word, rather than via email where
> people jump into the discussion not hearing earlier points.

Yes, agreed.  Certainly am hopeful that we are able to have more of
those in the (relatively) near future too!

Thanks!

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Fri, Oct  8, 2021 at 02:34:20PM -0400, Stephen Frost wrote:
> What I think is missing from this discussion is the fact that, with XTS
> (and XEX, on which XTS is built), the IV *is* run through a forward
> cipher function, just as suggested above needs to be done for CBC.  I
> don't see any reason to doubt that OpenSSL is correctly doing that.
> 
> This article shows this pretty clearly:
> 
> https://en.wikipedia.org/wiki/Disk_encryption_theory
> 
> I don't think that changes the fact that, if we're able to, we should be
> varying the tweak/IV as often as we can, and including the LSN seems
> like a good way to do just that.

Keep in mind that in our existiing code (not my patch), the LSN is zero
for unlogged relations, a fixed value for some GiST index pages, and
unchanged for some hint bit changes.  Therefore, while we can include
the LSN in the IV because it _might_ help, we can't rely on it.

We probably need to have a discussion about whether LSN and checksum
should be encrypted on the page.  I think we are currently leaning to no
encryption for LSN because we can use it as part of the nonce (where is
it is variable) and encrypting the checksum for rudimenary integrity
checking.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Mon, Oct 11, 2021 at 01:01:08PM -0400, Stephen Frost wrote:
> > It is more than just the page format --- it would also be the added
> > code, possible performance impact, and later code maintenance to allow
> > for are a more complex or two different page formats.
> 
> Yes, there is more to it than just the page format, I agree.  I'm still
> of the mind that it's something we're going to get to eventually, if for
> no other reason than that our current page format is certainly not
> perfect and it'd be pretty awesome if we could make improvements to it
> (independently of TDE or anything else discussed currently).

Yes, 100% agree on that.  The good part is that TDE would not be paying
the cost for that.  ;-)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Oct  8, 2021 at 02:34:20PM -0400, Stephen Frost wrote:
> > What I think is missing from this discussion is the fact that, with XTS
> > (and XEX, on which XTS is built), the IV *is* run through a forward
> > cipher function, just as suggested above needs to be done for CBC.  I
> > don't see any reason to doubt that OpenSSL is correctly doing that.
> >
> > This article shows this pretty clearly:
> >
> > https://en.wikipedia.org/wiki/Disk_encryption_theory
> >
> > I don't think that changes the fact that, if we're able to, we should be
> > varying the tweak/IV as often as we can, and including the LSN seems
> > like a good way to do just that.
>
> Keep in mind that in our existiing code (not my patch), the LSN is zero
> for unlogged relations, a fixed value for some GiST index pages, and
> unchanged for some hint bit changes.  Therefore, while we can include
> the LSN in the IV because it _might_ help, we can't rely on it.

Regarding unlogged LSNs at least, I would think that we'd want to
actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd
out.  The fixed value for GiST index pages is just during the index
build process, as I recall, and that's perhaps less of a concern.  Part
of the point of using XTS is to avoid the issue of the LSN not being
changed when hint bits are, or more generally not being unique in
various cases.

> We probably need to have a discussion about whether LSN and checksum
> should be encrypted on the page.  I think we are currently leaning to no
> encryption for LSN because we can use it as part of the nonce (where is
> it is variable) and encrypting the checksum for rudimenary integrity
> checking.

Yes, that's the direction that I was thinking also and specifically with
XTS as the encryption algorithm to allow us to exclude the LSN but keep
everything else, and to address the concern around the nonce/tweak/etc
being the same sometimes across multiple writes.  Another thing to
consider is if we want to encrypt zero'd page.  There was a point
brought up that if we do then we are encrypting a fair bit of very
predictable bytes and that's not great (though there's a fair bit about
our pages that someone could quite possibly predict anyway based on
table structures and such...).  I would think that if it's easy enough
to not encrypt zero'd pages that we should avoid doing so.  Don't recall
offhand which way zero'd pages were being handled already but thought it
made sense to mention that as part of this discussion.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Mon, Oct 11, 2021 at 01:30:38PM -0400, Stephen Frost wrote:
> Greetings,
> 
> > Keep in mind that in our existing code (not my patch), the LSN is zero
> > for unlogged relations, a fixed value for some GiST index pages, and
> > unchanged for some hint bit changes.  Therefore, while we can include
> > the LSN in the IV because it _might_ help, we can't rely on it.
> 
> Regarding unlogged LSNs at least, I would think that we'd want to
> actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd
> out.  The fixed value for GiST index pages is just during the index

Good idea.  For my patch I had to use a WAL-logged dummy LSN, but for
our use, re-using a fake LSN after a crash seems fine, so we can just
use the existing GetFakeLSNForUnloggedRel().  However, we might need to
use the part of my patch that removes the assumption that unlogged
relations always have zero LSNs, because right now they are only used
for GiST indexes --- I would have to research that more.

> Yes, that's the direction that I was thinking also and specifically with
> XTS as the encryption algorithm to allow us to exclude the LSN but keep
> everything else, and to address the concern around the nonce/tweak/etc
> being the same sometimes across multiple writes.  Another thing to
> consider is if we want to encrypt zero'd page.  There was a point
> brought up that if we do then we are encrypting a fair bit of very
> predictable bytes and that's not great (though there's a fair bit about
> our pages that someone could quite possibly predict anyway based on
> table structures and such...).  I would think that if it's easy enough
> to not encrypt zero'd pages that we should avoid doing so.  Don't recall
> offhand which way zero'd pages were being handled already but thought it
> made sense to mention that as part of this discussion.

Yeah, I wanted to mention that.  I don't see any security difference
between fully-zero pages, pages with headers and no tuples, and pages
with headers and only a few tuples.  If any of those are insecure, they
all are.  Therefore, I don't see any reason to treat them differently.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Ants Aasma
Date:
On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote:
> Yes, that's the direction that I was thinking also and specifically with
> XTS as the encryption algorithm to allow us to exclude the LSN but keep
> everything else, and to address the concern around the nonce/tweak/etc
> being the same sometimes across multiple writes.  Another thing to
> consider is if we want to encrypt zero'd page.  There was a point
> brought up that if we do then we are encrypting a fair bit of very
> predictable bytes and that's not great (though there's a fair bit about
> our pages that someone could quite possibly predict anyway based on
> table structures and such...).  I would think that if it's easy enough
> to not encrypt zero'd pages that we should avoid doing so.  Don't recall
> offhand which way zero'd pages were being handled already but thought it
> made sense to mention that as part of this discussion.

Yeah, I wanted to mention that.  I don't see any security difference
between fully-zero pages, pages with headers and no tuples, and pages
with headers and only a few tuples.  If any of those are insecure, they
all are.  Therefore, I don't see any reason to treat them differently.

We had to special case zero pages and not encrypt them because as far as I can tell, there is no atomic way to extend a file and initialize it to Enc(zero) in the same step.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote:
> On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     > Yes, that's the direction that I was thinking also and specifically with
>     > XTS as the encryption algorithm to allow us to exclude the LSN but keep
>     > everything else, and to address the concern around the nonce/tweak/etc
>     > being the same sometimes across multiple writes.  Another thing to
>     > consider is if we want to encrypt zero'd page.  There was a point
>     > brought up that if we do then we are encrypting a fair bit of very
>     > predictable bytes and that's not great (though there's a fair bit about
>     > our pages that someone could quite possibly predict anyway based on
>     > table structures and such...).  I would think that if it's easy enough
>     > to not encrypt zero'd pages that we should avoid doing so.  Don't recall
>     > offhand which way zero'd pages were being handled already but thought it
>     > made sense to mention that as part of this discussion.
> 
>     Yeah, I wanted to mention that.  I don't see any security difference
>     between fully-zero pages, pages with headers and no tuples, and pages
>     with headers and only a few tuples.  If any of those are insecure, they
>     all are.  Therefore, I don't see any reason to treat them differently.
> 
> 
> We had to special case zero pages and not encrypt them because as far as I can
> tell, there is no atomic way to extend a file and initialize it to Enc(zero) in
> the same step.

Oh, good point.  Yeah, we will need to handle that.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote:
> > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote:
> >
> >     > Yes, that's the direction that I was thinking also and specifically with
> >     > XTS as the encryption algorithm to allow us to exclude the LSN but keep
> >     > everything else, and to address the concern around the nonce/tweak/etc
> >     > being the same sometimes across multiple writes.  Another thing to
> >     > consider is if we want to encrypt zero'd page.  There was a point
> >     > brought up that if we do then we are encrypting a fair bit of very
> >     > predictable bytes and that's not great (though there's a fair bit about
> >     > our pages that someone could quite possibly predict anyway based on
> >     > table structures and such...).  I would think that if it's easy enough
> >     > to not encrypt zero'd pages that we should avoid doing so.  Don't recall
> >     > offhand which way zero'd pages were being handled already but thought it
> >     > made sense to mention that as part of this discussion.
> >
> >     Yeah, I wanted to mention that.  I don't see any security difference
> >     between fully-zero pages, pages with headers and no tuples, and pages
> >     with headers and only a few tuples.  If any of those are insecure, they
> >     all are.  Therefore, I don't see any reason to treat them differently.
> >
> >
> > We had to special case zero pages and not encrypt them because as far as I can
> > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in
> > the same step.
>
> Oh, good point.  Yeah, we will need to handle that.

Not sure what's meant here by 'handle that', but I don't see any
particular reason to avoid doing exactly the same for zero pages with
TDE in core..?  I don't think there's any reason we need to make things
complicated to ensure that we encrypt entirely empty pages.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct 12, 2021 at 08:25:52AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote:
> > > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote:
> > > 
> > >     > Yes, that's the direction that I was thinking also and specifically with
> > >     > XTS as the encryption algorithm to allow us to exclude the LSN but keep
> > >     > everything else, and to address the concern around the nonce/tweak/etc
> > >     > being the same sometimes across multiple writes.  Another thing to
> > >     > consider is if we want to encrypt zero'd page.  There was a point
> > >     > brought up that if we do then we are encrypting a fair bit of very
> > >     > predictable bytes and that's not great (though there's a fair bit about
> > >     > our pages that someone could quite possibly predict anyway based on
> > >     > table structures and such...).  I would think that if it's easy enough
> > >     > to not encrypt zero'd pages that we should avoid doing so.  Don't recall
> > >     > offhand which way zero'd pages were being handled already but thought it
> > >     > made sense to mention that as part of this discussion.
> > > 
> > >     Yeah, I wanted to mention that.  I don't see any security difference
> > >     between fully-zero pages, pages with headers and no tuples, and pages
> > >     with headers and only a few tuples.  If any of those are insecure, they
> > >     all are.  Therefore, I don't see any reason to treat them differently.
> > > 
> > > 
> > > We had to special case zero pages and not encrypt them because as far as I can
> > > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in
> > > the same step.
> > 
> > Oh, good point.  Yeah, we will need to handle that.
> 
> Not sure what's meant here by 'handle that', but I don't see any
> particular reason to avoid doing exactly the same for zero pages with
> TDE in core..?  I don't think there's any reason we need to make things
> complicated to ensure that we encrypt entirely empty pages.

I thought he was saying that when you extend a file, you might have to
extend it with all zeros, rather than being able to extend it with
an actual encrypted page of zeros.  For example, I think when a page is
corrupt in storage, it reads back as a fully zero page, and we would
need to handle that.  Are you saying we already have logic to handle
that so we don't need to change anything?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Oct 12, 2021 at 08:25:52AM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > On Tue, Oct 12, 2021 at 08:40:17AM +0300, Ants Aasma wrote:
> > > > On Mon, 11 Oct 2021 at 22:15, Bruce Momjian <bruce@momjian.us> wrote:
> > > >
> > > >     > Yes, that's the direction that I was thinking also and specifically with
> > > >     > XTS as the encryption algorithm to allow us to exclude the LSN but keep
> > > >     > everything else, and to address the concern around the nonce/tweak/etc
> > > >     > being the same sometimes across multiple writes.  Another thing to
> > > >     > consider is if we want to encrypt zero'd page.  There was a point
> > > >     > brought up that if we do then we are encrypting a fair bit of very
> > > >     > predictable bytes and that's not great (though there's a fair bit about
> > > >     > our pages that someone could quite possibly predict anyway based on
> > > >     > table structures and such...).  I would think that if it's easy enough
> > > >     > to not encrypt zero'd pages that we should avoid doing so.  Don't recall
> > > >     > offhand which way zero'd pages were being handled already but thought it
> > > >     > made sense to mention that as part of this discussion.
> > > >
> > > >     Yeah, I wanted to mention that.  I don't see any security difference
> > > >     between fully-zero pages, pages with headers and no tuples, and pages
> > > >     with headers and only a few tuples.  If any of those are insecure, they
> > > >     all are.  Therefore, I don't see any reason to treat them differently.
> > > >
> > > >
> > > > We had to special case zero pages and not encrypt them because as far as I can
> > > > tell, there is no atomic way to extend a file and initialize it to Enc(zero) in
> > > > the same step.
> > >
> > > Oh, good point.  Yeah, we will need to handle that.
> >
> > Not sure what's meant here by 'handle that', but I don't see any
> > particular reason to avoid doing exactly the same for zero pages with
> > TDE in core..?  I don't think there's any reason we need to make things
> > complicated to ensure that we encrypt entirely empty pages.
>
> I thought he was saying that when you extend a file, you might have to
> extend it with all zeros, rather than being able to extend it with
> an actual encrypted page of zeros.  For example, I think when a page is
> corrupt in storage, it reads back as a fully zero page, and we would
> need to handle that.  Are you saying we already have logic to handle
> that so we don't need to change anything?

When we extend a file, it gets extended with all zeros.  PG already
handles that case, PG w/ TDE would need to also recognize that case
(which is what Ants was saying their patch does) and handle it.  In
other words, we just need to realize when a page is all zeros and not
try to decrypt it when we're reading it.  Ants' patch does that and my
recollection is that it wasn't very complicated to do, and that seems
much simpler than trying to figure out a way to ensure we do encrypt a
zero'd page as part of extending a file.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct 12, 2021 at 08:49:28AM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > I thought he was saying that when you extend a file, you might have to
> > extend it with all zeros, rather than being able to extend it with
> > an actual encrypted page of zeros.  For example, I think when a page is
> > corrupt in storage, it reads back as a fully zero page, and we would
> > need to handle that.  Are you saying we already have logic to handle
> > that so we don't need to change anything?
> 
> When we extend a file, it gets extended with all zeros.  PG already
> handles that case, PG w/ TDE would need to also recognize that case
> (which is what Ants was saying their patch does) and handle it.  In
> other words, we just need to realize when a page is all zeros and not
> try to decrypt it when we're reading it.  Ants' patch does that and my
> recollection is that it wasn't very complicated to do, and that seems
> much simpler than trying to figure out a way to ensure we do encrypt a
> zero'd page as part of extending a file.

Well, how do you detect an all-zero page vs a page that encrypted to all
zeros?  I am thinking a zero LSN (which is not encrypted) would be the
only sure way, but we then have to make sure unlogged relations always
get a fake LSN.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 11:05 PM Stephen Frost <sfrost@snowman.net> wrote:
> Sure, I get that.  Would be awesome if all these things were clearly
> documented somewhere but I've never been able to find it quite as
> explicitly laid out as one would like.

:-(

> specifically: Appendix C: Tweaks
>
> Quoting a couple of paragraphs from that appendix:
>
> """
> In general, if there is information that is available and statically
> associated with a plaintext, it is recommended to use that information
> as a tweak for the plaintext. Ideally, the non-secret tweak associated
> with a plaintext is associated only with that plaintext.
>
> Extensive tweaking means that fewer plaintexts are encrypted under any
> given tweak. This corresponds, in the security model that is described
> in [1], to fewer queries to the target instance of the encryption.
> """
>
> The gist of this being- the more diverse the tweaking being used, the
> better.  That's where I was going with my "limit the risk" comment.  If
> we can make the tweak vary more for a given encryption invokation,
> that's going to be better, pretty much by definition, and as explained
> in publications by NIST.

I mean I don't have anything against that appendix, but I think we
need to understand - with confidence - what the expectations are
specifically around XTS, and that appendix seems much more general
than that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Robert Haas
Date:
On Mon, Oct 11, 2021 at 1:30 PM Stephen Frost <sfrost@snowman.net> wrote:
> Regarding unlogged LSNs at least, I would think that we'd want to
> actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd
> out.  The fixed value for GiST index pages is just during the index
> build process, as I recall, and that's perhaps less of a concern.  Part
> of the point of using XTS is to avoid the issue of the LSN not being
> changed when hint bits are, or more generally not being unique in
> various cases.

I don't believe there's anything to prevent the fake-LSN counter from
overtaking the real end-of-WAL, and if that should happen, then the
buffer manager would get confused. Maybe that can be fixed by doing
some sort of surgery on the buffer manager, but it doesn't seem to be
a trivial or ignorable problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Oct 11, 2021 at 1:30 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Regarding unlogged LSNs at least, I would think that we'd want to
> > actually use GetFakeLSNForUnloggedRel() instead of just having it zero'd
> > out.  The fixed value for GiST index pages is just during the index
> > build process, as I recall, and that's perhaps less of a concern.  Part
> > of the point of using XTS is to avoid the issue of the LSN not being
> > changed when hint bits are, or more generally not being unique in
> > various cases.
>
> I don't believe there's anything to prevent the fake-LSN counter from
> overtaking the real end-of-WAL, and if that should happen, then the
> buffer manager would get confused. Maybe that can be fixed by doing
> some sort of surgery on the buffer manager, but it doesn't seem to be
> a trivial or ignorable problem.

Using fake LSNs isn't new..  how is this not a concern already then?

Also wondering why the buffer manager would care about the LSN on pages
which are not BM_PERMANENT..?

I'll admit that I might certainly be missing something here.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Robert Haas
Date:
On Tue, Oct 12, 2021 at 10:39 AM Stephen Frost <sfrost@snowman.net> wrote:
> Using fake LSNs isn't new..  how is this not a concern already then?
>
> Also wondering why the buffer manager would care about the LSN on pages
> which are not BM_PERMANENT..?
>
> I'll admit that I might certainly be missing something here.

Oh, FlushBuffer has a guard against this case in it. I hadn't realized that.

Sorry for the noise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: storing an explicit nonce

From
Ants Aasma
Date:
On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
Well, how do you detect an all-zero page vs a page that encrypted to all
zeros?
Page encrypting to all zeros is for all practical purposes impossible to hit. Basically an attacker would have to be able to arbitrarily set the whole contents of the page and they would then achieve that this page gets ignored.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     Well, how do you detect an all-zero page vs a page that encrypted to all
>     zeros?
> 
> Page encrypting to all zeros is for all practical purposes impossible to hit.
> Basically an attacker would have to be able to arbitrarily set the whole
> contents of the page and they would then achieve that this page gets ignored.

Uh, how do we know that valid data can't produce an encrypted all-zero
page?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Ants Aasma
Date:

On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
>
>     Well, how do you detect an all-zero page vs a page that encrypted to all
>     zeros?
>
> Page encrypting to all zeros is for all practical purposes impossible to hit.
> Basically an attacker would have to be able to arbitrarily set the whole
> contents of the page and they would then achieve that this page gets ignored.

Uh, how do we know that valid data can't produce an encrypted all-zero
page?

Because the chances of that happening by accident are equivalent to making a series of commits to postgres and ending up with the same git commit hash 400 times in a row.
 
--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

On Tue, Oct 12, 2021 at 17:49 Ants Aasma <ants@cybertec.at> wrote:

On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> On Tue, 12 Oct 2021 at 16:14, Bruce Momjian <bruce@momjian.us> wrote:
>
>     Well, how do you detect an all-zero page vs a page that encrypted to all
>     zeros?
>
> Page encrypting to all zeros is for all practical purposes impossible to hit.
> Basically an attacker would have to be able to arbitrarily set the whole
> contents of the page and they would then achieve that this page gets ignored.

Uh, how do we know that valid data can't produce an encrypted all-zero
page?

Because the chances of that happening by accident are equivalent to making a series of commits to postgres and ending up with the same git commit hash 400 times in a row.

And to then have a valid checksum … seems next to impossible. 

Thanks,

Stephen

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote:
> On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
>     > Page encrypting to all zeros is for all practical purposes impossible to
>     hit.
>     > Basically an attacker would have to be able to arbitrarily set the whole
>     > contents of the page and they would then achieve that this page gets
>     ignored.
> 
>     Uh, how do we know that valid data can't produce an encrypted all-zero
>     page?
> 
> 
> Because the chances of that happening by accident are equivalent to making a
> series of commits to postgres and ending up with the same git commit hash 400
> times in a row.

Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an
empty page, and if not, an error?  Seems easier than checking if each
page contains all zeros every time.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: storing an explicit nonce

From
Ants Aasma
Date:
On Wed, 13 Oct 2021 at 02:20, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote:
> On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
>
>     On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
>     > Page encrypting to all zeros is for all practical purposes impossible to
>     hit.
>     > Basically an attacker would have to be able to arbitrarily set the whole
>     > contents of the page and they would then achieve that this page gets
>     ignored.
>
>     Uh, how do we know that valid data can't produce an encrypted all-zero
>     page?
>
>
> Because the chances of that happening by accident are equivalent to making a
> series of commits to postgres and ending up with the same git commit hash 400
> times in a row.

Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an
empty page, and if not, an error?  Seems easier than checking if each
page contains all zeros every time.

We already check it anyway, see PageIsVerifiedExtended(). 

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: storing an explicit nonce

From
Stephen Frost
Date:
Greetings,

* Ants Aasma (ants@cybertec.at) wrote:
> On Wed, 13 Oct 2021 at 02:20, Bruce Momjian <bruce@momjian.us> wrote:
> > On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote:
> > > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
> > >
> > >     On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> > >     > Page encrypting to all zeros is for all practical purposes
> > impossible to
> > >     hit.
> > >     > Basically an attacker would have to be able to arbitrarily set the
> > whole
> > >     > contents of the page and they would then achieve that this page
> > gets
> > >     ignored.
> > >
> > >     Uh, how do we know that valid data can't produce an encrypted
> > all-zero
> > >     page?
> > >
> > >
> > > Because the chances of that happening by accident are equivalent to
> > making a
> > > series of commits to postgres and ending up with the same git commit
> > hash 400
> > > times in a row.
> >
> > Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an
> > empty page, and if not, an error?  Seems easier than checking if each
> > page contains all zeros every time.
> >
>
> We already check it anyway, see PageIsVerifiedExtended().

Right- we check the LSN along with the rest of the page there.

Thanks,

Stephen

Attachment

Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Wed, Oct 13, 2021 at 09:16:37AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Ants Aasma (ants@cybertec.at) wrote:
> > On Wed, 13 Oct 2021 at 02:20, Bruce Momjian <bruce@momjian.us> wrote:
> > > On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote:
> > > > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian <bruce@momjian.us> wrote:
> > > >
> > > >     On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> > > >     > Page encrypting to all zeros is for all practical purposes
> > > impossible to
> > > >     hit.
> > > >     > Basically an attacker would have to be able to arbitrarily set the
> > > whole
> > > >     > contents of the page and they would then achieve that this page
> > > gets
> > > >     ignored.
> > > >
> > > >     Uh, how do we know that valid data can't produce an encrypted
> > > all-zero
> > > >     page?
> > > >
> > > >
> > > > Because the chances of that happening by accident are equivalent to
> > > making a
> > > > series of commits to postgres and ending up with the same git commit
> > > hash 400
> > > > times in a row.
> > >
> > > Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an
> > > empty page, and if not, an error?  Seems easier than checking if each
> > > page contains all zeros every time.
> > >
> > 
> > We already check it anyway, see PageIsVerifiedExtended().
> 
> Right- we check the LSN along with the rest of the page there.

Very good.  I have not looked at the Cybertec patch recently.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Thu, Oct 7, 2021 at 7:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Oct 7, 2021 at 3:24 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Every other
> > caller/flow passes false for 'create_storage' and we still need to
> > create storage in heap_create() if relkind has storage.
>
> That seems surprising.

I have revised the patch w.r.t the way 'create_storage' is interpreted
in heap_create() along with some minor changes to preserve the DBOID
patch.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: storing an explicit nonce

From
Antonin Houska
Date:
Sasasu <i@sasa.su> wrote:

> On 2021/10/6 23:01, Robert Haas wrote:
> > This seems wrong to me. CTR requires that you not reuse the IV. If you
> > re-encrypt the page with a different IV, torn pages are a problem. If
> > you re-encrypt it with the same IV, then it's not secure any more.

> for CBC if the IV is predictable will case "dictionary attack".

The following sounds like IV *uniqueness* is needed to defend against "known
plaintext attack" ...

> and for CBC and GCM reuse IV will case "known plaintext attack".

... but here you seem to say that *randomness* is also necessary:

> XTS works like CBC but adds a tweak step.  the tweak step does not add
> randomness. It means XTS still has "known plaintext attack",

(I suppose you mean "XTS with incorrect (e.g. non-random) IV", rather than XTS
as such.)

> due to the same reason from CBC.

According to the Appendix C of

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf

CBC requires *unpredictability* of the IV, but that does not necessarily mean
randomness: the unpredictable IV can be obtained by applying the forward
cipher function to an unique value.


Can you please try to explain once again what you consider a requirement
(uniqueness, randomness, etc.) on the IV for the XTS mode? Thanks.


-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: storing an explicit nonce

From
Bruce Momjian
Date:
On Tue, Oct 12, 2021 at 10:26:54AM -0400, Robert Haas wrote:
> > specifically: Appendix C: Tweaks
> >
> > Quoting a couple of paragraphs from that appendix:
> >
> > """
> > In general, if there is information that is available and statically
> > associated with a plaintext, it is recommended to use that information
> > as a tweak for the plaintext. Ideally, the non-secret tweak associated
> > with a plaintext is associated only with that plaintext.
> >
> > Extensive tweaking means that fewer plaintexts are encrypted under any
> > given tweak. This corresponds, in the security model that is described
> > in [1], to fewer queries to the target instance of the encryption.
> > """
> >
> > The gist of this being- the more diverse the tweaking being used, the
> > better.  That's where I was going with my "limit the risk" comment.  If
> > we can make the tweak vary more for a given encryption invokation,
> > that's going to be better, pretty much by definition, and as explained
> > in publications by NIST.
> 
> I mean I don't have anything against that appendix, but I think we
> need to understand - with confidence - what the expectations are
> specifically around XTS, and that appendix seems much more general
> than that.

Since there has not been activity on this thread for one month, I have
updated the Postgres TDE wiki to include the conclusions and discussions
from this thread:

    https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Oct 26, 2021 at 6:55 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
>
>
> I have revised the patch w.r.t the way 'create_storage' is interpreted
> in heap_create() along with some minor changes to preserve the DBOID
> patch.
>

Hi Shruthi,

I am reviewing the attached patches and providing a few comments here
below for patch "v5-0002-Preserve-database-OIDs-in-pg_upgrade.patch"

1.
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -31,7 +31,8 @@ CREATE DATABASE <replaceable
class="parameter">name</replaceable>
-           [ IS_TEMPLATE [=] <replaceable
class="parameter">istemplate</replaceable> ] ]
+           [ IS_TEMPLATE [=] <replaceable
class="parameter">istemplate</replaceable> ]
+           [ OID [=] <replaceable
class="parameter">db_oid</replaceable> ] ]

Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'.

2.
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
+ if ((dboid < FirstNormalObjectId) &&
+ (strcmp(dbname, "template0") != 0) &&
+ (!IsBinaryUpgrade))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
+ errmsg("Invalid value for option \"%s\"", defel->defname),
+ errhint("The specified OID %u is less than the minimum OID for user
objects %u.",
+ dboid, FirstNormalObjectId));
+ }

Are we sure that 'IsBinaryUpgrade' will be set properly, before the
createdb function is called? Can we recheck once ?

3.
@@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
  */
  pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock);

- do
+ /* Select an OID for the new database if is not explicitly configured. */
+ if (!OidIsValid(dboid))
  {
- dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
-    Anum_pg_database_oid);
- } while (check_db_file_conflict(dboid));

I think we need to do 'check_db_file_conflict' for the USER given OID
also.. right? It may already be present.

4.
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c

/*
+ * Create template0 database with oid Template0ObjectId i.e, 4
+ */
+

Better to mention here, why OID 4 is reserved for template0 database?.

5.
+ /*
+ * Create template0 database with oid Template0ObjectId i.e, 4
+ */
+ static const char *const template0_setup[] = {
+ "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID "
+ CppAsString2(Template0ObjectId) ";\n\n",

Can we write something like, 'OID = CppAsString2(Template0ObjectId)'?
mention "=".

6.
+
+ /*
+ * We use the OID of postgres to determine datlastsysoid
+ */
+ "UPDATE pg_database SET datlastsysoid = "
+ "    (SELECT oid FROM pg_database "
+ "    WHERE datname = 'postgres');\n\n",
+

Make the above comment a single line comment.

7.
There are some spelling mistakes in the comments as below, please
correct the same
+ /*
+ * Make sure that binary upgrade propogate the database OID to the
new                 =====> correct spelling
+ * cluster
+ */

+/* OID 4 is reserved for Templete0 database */
 ====> Correct spelling
+#define Template0ObjectId 4


I am reviewing another patch
"v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well
and will provide the comments soon if any...

Thanks & Regards
SadhuPrasad
EnterpriseDB: http://www.enterprisedb.com



On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> 1.
> --- a/doc/src/sgml/ref/create_database.sgml
> +++ b/doc/src/sgml/ref/create_database.sgml
> @@ -31,7 +31,8 @@ CREATE DATABASE <replaceable
> class="parameter">name</replaceable>
> -           [ IS_TEMPLATE [=] <replaceable
> class="parameter">istemplate</replaceable> ] ]
> +           [ IS_TEMPLATE [=] <replaceable
> class="parameter">istemplate</replaceable> ]
> +           [ OID [=] <replaceable
> class="parameter">db_oid</replaceable> ] ]
>
> Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'.

I agree that the listitem and the synopsis need to be consistent, but
it could be made consistent either by changing that one to db_oid or
this one to oid.

> 2.
> --- a/src/backend/commands/dbcommands.c
> +++ b/src/backend/commands/dbcommands.c
> + if ((dboid < FirstNormalObjectId) &&
> + (strcmp(dbname, "template0") != 0) &&
> + (!IsBinaryUpgrade))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
> + errmsg("Invalid value for option \"%s\"", defel->defname),
> + errhint("The specified OID %u is less than the minimum OID for user
> objects %u.",
> + dboid, FirstNormalObjectId));
> + }
>
> Are we sure that 'IsBinaryUpgrade' will be set properly, before the
> createdb function is called? Can we recheck once ?

How could it be set incorrectly, and how could we recheck this?

> 3.
> @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
>   */
>   pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock);
>
> - do
> + /* Select an OID for the new database if is not explicitly configured. */
> + if (!OidIsValid(dboid))
>   {
> - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
> -    Anum_pg_database_oid);
> - } while (check_db_file_conflict(dboid));
>
> I think we need to do 'check_db_file_conflict' for the USER given OID
> also.. right? It may already be present.

Hopefully, if that happens, we straight up fail later on.

> 4.
> --- a/src/bin/initdb/initdb.c
> +++ b/src/bin/initdb/initdb.c
>
> /*
> + * Create template0 database with oid Template0ObjectId i.e, 4
> + */
> +
>
> Better to mention here, why OID 4 is reserved for template0 database?.

I'm not sure how we would give a reason for selecting an arbitrary
constant? We could perhaps explain why we use a fixed OID. But there's
no reason it has to be 4, I think.

> 5.
> + /*
> + * Create template0 database with oid Template0ObjectId i.e, 4
> + */
> + static const char *const template0_setup[] = {
> + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID "
> + CppAsString2(Template0ObjectId) ";\n\n",
>
> Can we write something like, 'OID = CppAsString2(Template0ObjectId)'?
> mention "=".

That seems like a good idea, because it would be more consistent.

> 6.
> +
> + /*
> + * We use the OID of postgres to determine datlastsysoid
> + */
> + "UPDATE pg_database SET datlastsysoid = "
> + "    (SELECT oid FROM pg_database "
> + "    WHERE datname = 'postgres');\n\n",
> +
>
> Make the above comment a single line comment.

I think what Shruthi did is more correct. It doesn't have to be done
as a single-line comment just because it can fit on one line. And
Shruthi didn't write this comment anyway, it's only moved slightly
from where it was before.

> 7.
> There are some spelling mistakes in the comments as below, please
> correct the same
> + /*
> + * Make sure that binary upgrade propogate the database OID to the
> new                 =====> correct spelling
> + * cluster
> + */
>
> +/* OID 4 is reserved for Templete0 database */
>  ====> Correct spelling
> +#define Template0ObjectId 4

Yes, those would be good to fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Dec 6, 2021 at 10:14 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
>
> On Tue, Oct 26, 2021 at 6:55 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> >
> >
> > I have revised the patch w.r.t the way 'create_storage' is interpreted
> > in heap_create() along with some minor changes to preserve the DBOID
> > patch.
> >
>
> Hi Shruthi,
>
> I am reviewing the attached patches and providing a few comments here
> below for patch "v5-0002-Preserve-database-OIDs-in-pg_upgrade.patch"
>
> 1.
> --- a/doc/src/sgml/ref/create_database.sgml
> +++ b/doc/src/sgml/ref/create_database.sgml
> @@ -31,7 +31,8 @@ CREATE DATABASE <replaceable
> class="parameter">name</replaceable>
> -           [ IS_TEMPLATE [=] <replaceable
> class="parameter">istemplate</replaceable> ] ]
> +           [ IS_TEMPLATE [=] <replaceable
> class="parameter">istemplate</replaceable> ]
> +           [ OID [=] <replaceable
> class="parameter">db_oid</replaceable> ] ]
>
> Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'.

Replaced "db_oid" with "oid"

>
> 2.
> --- a/src/backend/commands/dbcommands.c
> +++ b/src/backend/commands/dbcommands.c
> + if ((dboid < FirstNormalObjectId) &&
> + (strcmp(dbname, "template0") != 0) &&
> + (!IsBinaryUpgrade))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
> + errmsg("Invalid value for option \"%s\"", defel->defname),
> + errhint("The specified OID %u is less than the minimum OID for user
> objects %u.",
> + dboid, FirstNormalObjectId));
> + }
>
> Are we sure that 'IsBinaryUpgrade' will be set properly, before the
> createdb function is called? Can we recheck once ?

I believe 'IsBinaryUpgrade' will be set to true when pg_upgrade is invoked.
pg_ugrade internally does pg_dump and pg_restore for every database in
the cluster.

> 3.
> @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
>   */
>   pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock);
>
> - do
> + /* Select an OID for the new database if is not explicitly configured. */
> + if (!OidIsValid(dboid))
>   {
> - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
> -    Anum_pg_database_oid);
> - } while (check_db_file_conflict(dboid));
>
> I think we need to do 'check_db_file_conflict' for the USER given OID
> also.. right? It may already be present.

If a datafile with user-specified OID exists, the create database
fails with the below error.
postgres=# create database d2 oid 16452;
ERROR:  could not create directory "base/16452": File exists

> 4.
> --- a/src/bin/initdb/initdb.c
> +++ b/src/bin/initdb/initdb.c
>
> /*
> + * Create template0 database with oid Template0ObjectId i.e, 4
> + */
> +
>
> Better to mention here, why OID 4 is reserved for template0 database?.

The comment is updated to explain why template0 oid is fixed.

> 5.
> + /*
> + * Create template0 database with oid Template0ObjectId i.e, 4
> + */
> + static const char *const template0_setup[] = {
> + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID "
> + CppAsString2(Template0ObjectId) ";\n\n",
>
> Can we write something like, 'OID = CppAsString2(Template0ObjectId)'?
> mention "=".

Fixed

> 6.
> +
> + /*
> + * We use the OID of postgres to determine datlastsysoid
> + */
> + "UPDATE pg_database SET datlastsysoid = "
> + "    (SELECT oid FROM pg_database "
> + "    WHERE datname = 'postgres');\n\n",
> +
>
> Make the above comment a single line comment.

As Robert confirmed, this part of the code is moved from a different place.

> 7.
> There are some spelling mistakes in the comments as below, please
> correct the same
> + /*
> + * Make sure that binary upgrade propogate the database OID to the
> new                 =====> correct spelling
> + * cluster
> + */
>
> +/* OID 4 is reserved for Templete0 database */
>  ====> Correct spelling
> +#define Template0ObjectId 4
>

Fixed.

> I am reviewing another patch
> "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well
> and will provide the comments soon if any...

Thanks. I have rebased relfilenode oid preserve patch. You may use the
rebased patch for review.

Thanks & Regards
Shruthi K C
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Mon, Dec 6, 2021 at 11:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> > 3.
> > @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
> >   */
> >   pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock);
> >
> > - do
> > + /* Select an OID for the new database if is not explicitly configured. */
> > + if (!OidIsValid(dboid))
> >   {
> > - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
> > -    Anum_pg_database_oid);
> > - } while (check_db_file_conflict(dboid));
> >
> > I think we need to do 'check_db_file_conflict' for the USER given OID
> > also.. right? It may already be present.
>
> Hopefully, if that happens, we straight up fail later on.

That's right. If a database with user-specified OID exists, the
createdb fails with a "duplicate key value" error.
If just a data directory with user-specified OID exists,
MakePGDirectory() fails to create the directory and the cleanup
callback createdb_failure_callback() removes the directory that was
not created by 'createdb()' function.
The subsequent create database call with the same OID will succeed.
Should we handle the case where a data directory exists and the
corresponding DB with that oid does not exist? I presume this
situation doesn't arise unless the user tries to create directories in
the data path. Any thoughts?


Thanks & Regards
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com



On Mon, Dec 13, 2021 at 9:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > I am reviewing another patch
> > "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well
> > and will provide the comments soon if any...

I spent much of today reviewing 0001. Here's an updated version, so
far only lightly tested. Please check whether I've broken anything.
Here are the changes:

- I adjusted the function header comment for heap_create. Your
proposed comment seemed like it was pretty detailed but not 100%
correct. It also made one of the lines kind of long because you didn't
wrap the text in the surrounding style. I decided to make it simpler
and shorter instead of longer still and 100% correct.

- I removed a one-line comment that said /* Override the toast
relfilenode */ because it preceded an error check, not a line of code
that would have done what the comment claimed.

- I removed a one-line comment that said /* Override the relfilenode
*/  because the following code would only sometimes override the
relfilenode. The code didn't seem complex enough to justify a a longer
and more accurate comment, so I just took it out.

- I changed a test for (relkind == RELKIND_RELATION || relkind ==
RELKIND_SEQUENCE || relkind == RELKIND_MATVIEW) to use
RELKIND_HAS_STORAGE(). It's true that not all of the storage types
that RELKIND_HAS_STORAGE() tests are possible here, but that's not a
reason to avoiding using the macro. If somebody adds a new relkind
with storage in the future, they might miss the need to manually
update this place, but they will not likely miss the need to update
RELKIND_HAS_STORAGE() since, if they did, their code probably wouldn't
work at all.

- I changed the way that you were passing create_storage down to
heap_create. I think I said before that you should EITHER fix it so
that we set create_storage = true only when the relation actually has
storage OR ELSE have heap_create() itself override the value to false
when there is no storage. You did both. There are times when it's
reasonable to ensure the same thing in multiple places, but this
doesn't seem to be one of them. So I took that out. I chose to retain
the code in heap_create() that overrides the value to false, added a
comment explaining that it does that, and then adjusted the callers to
ignore the storage type. I then added comments, and in one place an
assertion, to make it clearer what is happening.

- In pg_dump.c, I adjusted the comment that says "Not every relation
has storage." and the test that immediately follows, to ignore the
relfilenode when relkind says it's a partitioned table. Really,
partitioned tables should never have had relfilenodes, but as it turns
out, they used to have them.

Let me know your thoughts.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment
On Tue, Dec 14, 2021 at 2:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 13, 2021 at 9:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > > I am reviewing another patch
> > > "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well
> > > and will provide the comments soon if any...
>
> I spent much of today reviewing 0001. Here's an updated version, so
> far only lightly tested. Please check whether I've broken anything.
> Here are the changes:

Thanks, Robert for the updated version. I reviewed the changes and it
looks fine.
I also tested the patch. The patch works as expected.

> - I adjusted the function header comment for heap_create. Your
> proposed comment seemed like it was pretty detailed but not 100%
> correct. It also made one of the lines kind of long because you didn't
> wrap the text in the surrounding style. I decided to make it simpler
> and shorter instead of longer still and 100% correct.

The comment update looks fine. However, I still feel it would be good to
mention on which (rare) circumstance a valid relfilenode can get passed.

> - I removed a one-line comment that said /* Override the toast
> relfilenode */ because it preceded an error check, not a line of code
> that would have done what the comment claimed.
>
> - I removed a one-line comment that said /* Override the relfilenode
> */  because the following code would only sometimes override the
> relfilenode. The code didn't seem complex enough to justify a a longer
> and more accurate comment, so I just took it out.

Fine

> - I changed a test for (relkind == RELKIND_RELATION || relkind ==
> RELKIND_SEQUENCE || relkind == RELKIND_MATVIEW) to use
> RELKIND_HAS_STORAGE(). It's true that not all of the storage types
> that RELKIND_HAS_STORAGE() tests are possible here, but that's not a
> reason to avoiding using the macro. If somebody adds a new relkind
> with storage in the future, they might miss the need to manually
> update this place, but they will not likely miss the need to update
> RELKIND_HAS_STORAGE() since, if they did, their code probably wouldn't
> work at all.

I agree.

> - I changed the way that you were passing create_storage down to
> heap_create. I think I said before that you should EITHER fix it so
> that we set create_storage = true only when the relation actually has
> storage OR ELSE have heap_create() itself override the value to false
> when there is no storage. You did both. There are times when it's
> reasonable to ensure the same thing in multiple places, but this
> doesn't seem to be one of them. So I took that out. I chose to retain
> the code in heap_create() that overrides the value to false, added a
> comment explaining that it does that, and then adjusted the callers to
> ignore the storage type. I then added comments, and in one place an
> assertion, to make it clearer what is happening.

The changes are fine. Thanks for the fine-tuning.

> - In pg_dump.c, I adjusted the comment that says "Not every relation
> has storage." and the test that immediately follows, to ignore the
> relfilenode when relkind says it's a partitioned table. Really,
> partitioned tables should never have had relfilenodes, but as it turns
> out, they used to have them.
>

Fine. Understood.

Thanks & Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com



On 12/14/21 2:35 AM, Robert Haas wrote:
> I spent much of today reviewing 0001. Here's an updated version, so
> far only lightly tested. Please check whether I've broken anything.
Thanks Robert, I tested from v96/12/13/v14 -> v15( with patch)
things are working fine i.e table /index relfilenode is preserved,
not changing after pg_upgrade.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




On 12/15/21 12:09 AM, tushar wrote:
I spent much of today reviewing 0001. Here's an updated version, so
far only lightly tested. Please check whether I've broken anything.
Thanks Robert, I tested from v96/12/13/v14 -> v15( with patch)
things are working fine i.e table /index relfilenode is preserved,
not changing after pg_upgrade.
I covered tablespace OIDs testing scenarios and that is also preserved after pg_upgrade.
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company 
On Mon, Dec 13, 2021 at 8:43 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
>
> On Mon, Dec 6, 2021 at 11:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> > > 3.
> > > @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
> > >   */
> > >   pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock);
> > >
> > > - do
> > > + /* Select an OID for the new database if is not explicitly configured. */
> > > + if (!OidIsValid(dboid))
> > >   {
> > > - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
> > > -    Anum_pg_database_oid);
> > > - } while (check_db_file_conflict(dboid));
> > >
> > > I think we need to do 'check_db_file_conflict' for the USER given OID
> > > also.. right? It may already be present.
> >
> > Hopefully, if that happens, we straight up fail later on.
>
> That's right. If a database with user-specified OID exists, the
> createdb fails with a "duplicate key value" error.
> If just a data directory with user-specified OID exists,
> MakePGDirectory() fails to create the directory and the cleanup
> callback createdb_failure_callback() removes the directory that was
> not created by 'createdb()' function.
> The subsequent create database call with the same OID will succeed.
> Should we handle the case where a data directory exists and the
> corresponding DB with that oid does not exist? I presume this
> situation doesn't arise unless the user tries to create directories in
> the data path. Any thoughts?

I have updated the DBOID preserve patch to handle this case and
generated the latest patch on top of your v7-001-preserve-relfilenode
patch.

Thanks & Regards
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
Hi,

On Fri, Dec 17, 2021 at 01:03:06PM +0530, Shruthi Gowda wrote:
> 
> I have updated the DBOID preserve patch to handle this case and
> generated the latest patch on top of your v7-001-preserve-relfilenode
> patch.

The cfbot reports that the patch doesn't apply anymore:
http://cfbot.cputube.org/patch_36_3296.log
=== Applying patches on top of PostgreSQL commit ID 5513dc6a304d8bda114004a3b906cc6fde5d6274 ===
=== applying patch ./v7-0002-Preserve-database-OIDs-in-pg_upgrade.patch
[...]
patching file src/bin/pg_upgrade/info.c
Hunk #1 FAILED at 190.
Hunk #2 succeeded at 351 (offset 27 lines).
1 out of 2 hunks FAILED -- saving rejects to file src/bin/pg_upgrade/info.c.rej
patching file src/bin/pg_upgrade/pg_upgrade.h
Hunk #1 FAILED at 145.
1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/pg_upgrade.h.rej
patching file src/bin/pg_upgrade/relfilenode.c
Hunk #1 FAILED at 193.
1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/relfilenode.c.rej

Could you send a rebased version?  In the meantime I willl switch the cf entry
to Waiting on Author.



On Sat, Jan 15, 2022 at 11:17 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Fri, Dec 17, 2021 at 01:03:06PM +0530, Shruthi Gowda wrote:
> >
> > I have updated the DBOID preserve patch to handle this case and
> > generated the latest patch on top of your v7-001-preserve-relfilenode
> > patch.
>
> The cfbot reports that the patch doesn't apply anymore:
> http://cfbot.cputube.org/patch_36_3296.log
> === Applying patches on top of PostgreSQL commit ID 5513dc6a304d8bda114004a3b906cc6fde5d6274 ===
> === applying patch ./v7-0002-Preserve-database-OIDs-in-pg_upgrade.patch
> [...]
> patching file src/bin/pg_upgrade/info.c
> Hunk #1 FAILED at 190.
> Hunk #2 succeeded at 351 (offset 27 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file src/bin/pg_upgrade/info.c.rej
> patching file src/bin/pg_upgrade/pg_upgrade.h
> Hunk #1 FAILED at 145.
> 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/pg_upgrade.h.rej
> patching file src/bin/pg_upgrade/relfilenode.c
> Hunk #1 FAILED at 193.
> 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_upgrade/relfilenode.c.rej
>
> Could you send a rebased version?  In the meantime I willl switch the cf entry
> to Waiting on Author.

I have rebased and generated the patches on top of PostgreSQL commit
ID cf925936ecc031355cd56fbd392ec3180517a110.
Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch
first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch.

Thanks & Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> Thanks, Robert for the updated version. I reviewed the changes and it
> looks fine.
> I also tested the patch. The patch works as expected.

Thanks.

> > - I adjusted the function header comment for heap_create. Your
> > proposed comment seemed like it was pretty detailed but not 100%
> > correct. It also made one of the lines kind of long because you didn't
> > wrap the text in the surrounding style. I decided to make it simpler
> > and shorter instead of longer still and 100% correct.
>
> The comment update looks fine. However, I still feel it would be good to
> mention on which (rare) circumstance a valid relfilenode can get passed.

In general, I think it's the job of a function parameter comment to
describe what the parameter does, not how the callers actually use it.
One problem with describing the latter is that, if someone later adds
another caller, there is a pretty good chance that they won't notice
that the comment needs to be changed. More fundamentally, the
parameter function comments should be like an instruction manual for
how to use the function. If you are trying to figure out how to use
this function, it is not helpful to know that "most callers like to
pass false" for this parameter. What you need to know is what value
your new call site should pass, and knowing what "most callers" do or
that something is "rare" doesn't really help. If we want to make this
comment more detailed, we should approach it from the point of view of
explaining how it ought to be set.

I've committed the v8-0001 patch you attached. I'll write separately
about v8-0002.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Jan 17, 2022 at 9:57 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> I have rebased and generated the patches on top of PostgreSQL commit
> ID cf925936ecc031355cd56fbd392ec3180517a110.
> Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch
> first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch.

OK, so looking over 0002, I noticed a few things:

1. datlastsysoid isn't being used for anything any more. That's not a
defect in your patch, but I've separately proposed to remove it.

2. I realized that the whole idea here depends on not having initdb
create more than one database without a fixed OID. The patch solves
that by nailing down the OID of template0, which is a sufficient
solution. However, I think nailing down the (initial) OID of postgres
as well would be a good idea, just in case somebody in the future
decides to add another system-created database.

3. The changes to gram.y don't do anything. Sure, you've added a new
"OID" token, but nothing generates that token, so it has no effect.
The reason the syntax works is that createdb_opt_name accepts "IDENT",
which means any string that's not in the keyword list (see kwlist.h).
But that's there already, so you don't need to do anything in this
file.

4. I felt that the documentation and comments could be somewhat improved.

Here's an updated version in which I've reverted the changes to gram.y
and tried to improve the comments and documentation. Could you have a
look at implementing (2) above?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment
On Tue, Jan 18, 2022 at 12:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Thanks, Robert for the updated version. I reviewed the changes and it
> > looks fine.
> > I also tested the patch. The patch works as expected.
>
> Thanks.
>
> > > - I adjusted the function header comment for heap_create. Your
> > > proposed comment seemed like it was pretty detailed but not 100%
> > > correct. It also made one of the lines kind of long because you didn't
> > > wrap the text in the surrounding style. I decided to make it simpler
> > > and shorter instead of longer still and 100% correct.
> >
> > The comment update looks fine. However, I still feel it would be good to
> > mention on which (rare) circumstance a valid relfilenode can get passed.
>
> In general, I think it's the job of a function parameter comment to
> describe what the parameter does, not how the callers actually use it.
> One problem with describing the latter is that, if someone later adds
> another caller, there is a pretty good chance that they won't notice
> that the comment needs to be changed. More fundamentally, the
> parameter function comments should be like an instruction manual for
> how to use the function. If you are trying to figure out how to use
> this function, it is not helpful to know that "most callers like to
> pass false" for this parameter. What you need to know is what value
> your new call site should pass, and knowing what "most callers" do or
> that something is "rare" doesn't really help. If we want to make this
> comment more detailed, we should approach it from the point of view of
> explaining how it ought to be set.

It's clear now. Thanks for clarifying.

> I've committed the v8-0001 patch you attached. I'll write separately
> about v8-0002.

Sure. Thank you.



On Tue, Jan 18, 2022 at 2:34 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 17, 2022 at 9:57 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > I have rebased and generated the patches on top of PostgreSQL commit
> > ID cf925936ecc031355cd56fbd392ec3180517a110.
> > Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch
> > first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch.
>
> OK, so looking over 0002, I noticed a few things:
>
> 1. datlastsysoid isn't being used for anything any more. That's not a
> defect in your patch, but I've separately proposed to remove it.

okay

> 2. I realized that the whole idea here depends on not having initdb
> create more than one database without a fixed OID. The patch solves
> that by nailing down the OID of template0, which is a sufficient
> solution. However, I think nailing down the (initial) OID of postgres
> as well would be a good idea, just in case somebody in the future
> decides to add another system-created database.

I agree with your thought. In my latest patch, postgres db gets
created with a fixed OID.
I have chosen an arbitrary number 16000 as postgres OID from the
unpinned object OID range1200 - 16383.

> 3. The changes to gram.y don't do anything. Sure, you've added a new
> "OID" token, but nothing generates that token, so it has no effect.
> The reason the syntax works is that createdb_opt_name accepts "IDENT",
> which means any string that's not in the keyword list (see kwlist.h).
> But that's there already, so you don't need to do anything in this
> file.

okay

> 4. I felt that the documentation and comments could be somewhat improved.

The documentation and comment updates are more accurate with the
required details. Thanks.

> Here's an updated version in which I've reverted the changes to gram.y
> and tried to improve the comments and documentation. Could you have a
> look at implementing (2) above?

Attached is the patch that implements comment (2).

Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Thu, Jan 20, 2022 at 7:09 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Here's an updated version in which I've reverted the changes to gram.y
> > and tried to improve the comments and documentation. Could you have a
> > look at implementing (2) above?
>
> Attached is the patch that implements comment (2).

This probably needs minor rebasing on account of the fact that I just
pushed the patch to remove datlastsysoid. I intended to do that before
you posted a new version to save you the trouble, but I was too slow
(or you were too fast, however you want to look at it).

+ errmsg("Invalid value for option \"%s\"", defel->defname),

Per the "error message style" section of the documentation, primary
error messages neither begin with a capital letter nor end with a
period, while errdetail() messages are complete sentences and thus
both begin with a capital letter and end with a period. But what I
think you should really do here is get rid of the error detail and
convey all the information in a primary error message. e.g. "OID %u is
a system OID", or maybe better, "OIDs less than %u are reserved for
system objects".

+ errmsg("database oid %u is already used by database %s",
+ errmsg("data directory exists for database oid %u", dboid));

Usually we write "OID" rather than "oid" in error messages. I think
maybe it would be best to change the text slightly too. I suggest:

database OID %u is already in use by database \"%s\"
data directory already exists for database with OID %u

+ * it would fail. To avoid that, assign a fixed OID to template0 and
+ * postgres rather than letting the server choose one.

a fixed OID -> fixed OIDs
one -> them

Or maybe put this comment back the way I had it and just talk about
postgres, and then change the comment in make_postgres to say "Assign
a fixed OID to postgres, for the same reasons as template0."

+ /*
+ * Make sure that binary upgrade propagate the database OID to the new
+ * cluster
+ */

This comment doesn't really seem necessary. It's sort of self-explanatory.

+# Push the OID that is reserved for template0 database.
+my $Template0ObjectId =
+  Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId');
+push @{$oids}, $Template0ObjectId;

Don't you need to do this for PostgresObjectId also?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Thu, Jan 20, 2022 at 7:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 20, 2022 at 7:09 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > > Here's an updated version in which I've reverted the changes to gram.y
> > > and tried to improve the comments and documentation. Could you have a
> > > look at implementing (2) above?
> >
> > Attached is the patch that implements comment (2).
>
> This probably needs minor rebasing on account of the fact that I just
> pushed the patch to remove datlastsysoid. I intended to do that before
> you posted a new version to save you the trouble, but I was too slow
> (or you were too fast, however you want to look at it).

I have rebased the patch. Kindly have a look at it.

> + errmsg("Invalid value for option \"%s\"", defel->defname),
>
> Per the "error message style" section of the documentation, primary
> error messages neither begin with a capital letter nor end with a
> period, while errdetail() messages are complete sentences and thus
> both begin with a capital letter and end with a period. But what I
> think you should really do here is get rid of the error detail and
> convey all the information in a primary error message. e.g. "OID %u is
> a system OID", or maybe better, "OIDs less than %u are reserved for
> system objects".

Fixed

> + errmsg("database oid %u is already used by database %s",
> + errmsg("data directory exists for database oid %u", dboid));
>
> Usually we write "OID" rather than "oid" in error messages. I think
> maybe it would be best to change the text slightly too. I suggest:
>
> database OID %u is already in use by database \"%s\"
> data directory already exists for database with OID %u

The second error message will be reported when the data directory with
the given OID exists in the data path but the corresponding DB does
not exist. This could happen only if the user creates directories in
the data folder which is indeed not an invalid usage. I have updated
the error message to "data directory with the specified OID %u already
exists" because the error message you recommended gives a slightly
different meaning.

> + * it would fail. To avoid that, assign a fixed OID to template0 and
> + * postgres rather than letting the server choose one.
>
> a fixed OID -> fixed OIDs
> one -> them
>
> Or maybe put this comment back the way I had it and just talk about
> postgres, and then change the comment in make_postgres to say "Assign
> a fixed OID to postgres, for the same reasons as template0."

Fixed

> + /*
> + * Make sure that binary upgrade propagate the database OID to the new
> + * cluster
> + */
>
> This comment doesn't really seem necessary. It's sort of self-explanatory.

Removed

> +# Push the OID that is reserved for template0 database.
> +my $Template0ObjectId =
> +  Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId');
> +push @{$oids}, $Template0ObjectId;
>
> Don't you need to do this for PostgresObjectId also?

It is not required for PostgresObjectId. The unused_oids script
provides a list of unused oids in the manually-assignable OIDs range
(1-9999).

Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Thu, Jan 20, 2022 at 11:03 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> It is not required for PostgresObjectId. The unused_oids script
> provides a list of unused oids in the manually-assignable OIDs range
> (1-9999).

Well, so ... why are we not treating the OIDs for these two databases
the same? If there's a range from which we can assign OIDs without
risk of duplication and without needing to update this script, perhaps
we ought to assign both of them from that range and leave the script
alone.

+ * that is in use in the old cluster is also used in th new cluster - and

th -> the.

+preserves the DB, tablespace, relfilenode OIDs so TOAST and other references

Insert "and" before "relfilenode".

I think this is in pretty good shape now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Fri, Jan 21, 2022 at 1:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 20, 2022 at 11:03 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > It is not required for PostgresObjectId. The unused_oids script
> > provides a list of unused oids in the manually-assignable OIDs range
> > (1-9999).
>
> Well, so ... why are we not treating the OIDs for these two databases
> the same? If there's a range from which we can assign OIDs without
> risk of duplication and without needing to update this script, perhaps
> we ought to assign both of them from that range and leave the script
> alone.

From what I see in the code, template0 and postgres are the last
things that get created in initdb phase. The system OIDs that get
assigned to these DBs vary from release to release. At present, the
system assigned OIDs of template0 and postgres are 13679 and 13680
respectively. I feel it would be safe to assign 16000 and 16001 for
template0 and postgres respectively from the unpinned object OID range
12000 - 16383. In the future, even if the initdb unpinned objects
reach the range of 16000 issues can only arise if initdb() creates
another system-created database for which the system assigns these
reserved OIDs (16000, 16001).

> + * that is in use in the old cluster is also used in th new cluster - and
>
> th -> the.

Fixed

> +preserves the DB, tablespace, relfilenode OIDs so TOAST and other references
>
> Insert "and" before "relfilenode".

Fixed

Attached is the latest patch for review.

Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Fri, Jan 21, 2022 at 8:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> From what I see in the code, template0 and postgres are the last
> things that get created in initdb phase. The system OIDs that get
> assigned to these DBs vary from release to release. At present, the
> system assigned OIDs of template0 and postgres are 13679 and 13680
> respectively. I feel it would be safe to assign 16000 and 16001 for
> template0 and postgres respectively from the unpinned object OID range
> 12000 - 16383. In the future, even if the initdb unpinned objects
> reach the range of 16000 issues can only arise if initdb() creates
> another system-created database for which the system assigns these
> reserved OIDs (16000, 16001).

It doesn't seem safe to me to rely on that. We don't know what could
happen in the future if the number of built-in objects increases.
Looking at the lengthy comment on this topic in transam.h, I see that
there are three ranges:

1-9999 manually assigned OIDs
10000-11999 OIDs assigned by genbki.pl
12000-16384 OIDs assigned to unpinned objects post-bootstrap

It seems to me that what this comment is saying is that OIDs in the
second and third categories are doled out by counters. Therefore, we
can't know which of those OIDs will get used, or how many of them will
get used, or which objects will get which OIDs. Therefore, I think we
should go back to the approach that you were using for template0 and
handle both that database and postgres using that method. That is,
assign an OID manually, and make sure unused_oids knows that it should
be counted as already used.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> It seems to me that what this comment is saying is that OIDs in the
> second and third categories are doled out by counters. Therefore, we
> can't know which of those OIDs will get used, or how many of them will
> get used, or which objects will get which OIDs. Therefore, I think we
> should go back to the approach that you were using for template0 and
> handle both that database and postgres using that method. That is,
> assign an OID manually, and make sure unused_oids knows that it should
> be counted as already used.

Indeed.  If you're going to manually assign OIDs to these databases,
do it honestly, and put them into the range intended for that purpose.
Trying to take short-cuts is just going to cause trouble down the road.

            regards, tom lane



On Sat, Jan 22, 2022 at 12:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > It seems to me that what this comment is saying is that OIDs in the
> > second and third categories are doled out by counters. Therefore, we
> > can't know which of those OIDs will get used, or how many of them will
> > get used, or which objects will get which OIDs. Therefore, I think we
> > should go back to the approach that you were using for template0 and
> > handle both that database and postgres using that method. That is,
> > assign an OID manually, and make sure unused_oids knows that it should
> > be counted as already used.
>
> Indeed.  If you're going to manually assign OIDs to these databases,
> do it honestly, and put them into the range intended for that purpose.
> Trying to take short-cuts is just going to cause trouble down the road.

Understood. I will rework the patch accordingly. Thanks

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com



On Sat, Jan 22, 2022 at 12:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 21, 2022 at 8:40 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > From what I see in the code, template0 and postgres are the last
> > things that get created in initdb phase. The system OIDs that get
> > assigned to these DBs vary from release to release. At present, the
> > system assigned OIDs of template0 and postgres are 13679 and 13680
> > respectively. I feel it would be safe to assign 16000 and 16001 for
> > template0 and postgres respectively from the unpinned object OID range
> > 12000 - 16383. In the future, even if the initdb unpinned objects
> > reach the range of 16000 issues can only arise if initdb() creates
> > another system-created database for which the system assigns these
> > reserved OIDs (16000, 16001).
>
> It doesn't seem safe to me to rely on that. We don't know what could
> happen in the future if the number of built-in objects increases.
> Looking at the lengthy comment on this topic in transam.h, I see that
> there are three ranges:
>
> 1-9999 manually assigned OIDs
> 10000-11999 OIDs assigned by genbki.pl
> 12000-16384 OIDs assigned to unpinned objects post-bootstrap
>
> It seems to me that what this comment is saying is that OIDs in the
> second and third categories are doled out by counters. Therefore, we
> can't know which of those OIDs will get used, or how many of them will
> get used, or which objects will get which OIDs. Therefore, I think we
> should go back to the approach that you were using for template0 and
> handle both that database and postgres using that method. That is,
> assign an OID manually, and make sure unused_oids knows that it should
> be counted as already used.

Agree. In the latest patch, the template0 and postgres OIDs are fixed
to unused manually assigned OIDs 4 and 5 respectively. These OIDs are
no more listed as unused OIDs.


Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Sat, Jan 22, 2022 at 12:47:35PM +0530, Shruthi Gowda wrote:
> On Sat, Jan 22, 2022 at 12:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > It seems to me that what this comment is saying is that OIDs in the
> > > second and third categories are doled out by counters. Therefore, we
> > > can't know which of those OIDs will get used, or how many of them will
> > > get used, or which objects will get which OIDs. Therefore, I think we
> > > should go back to the approach that you were using for template0 and
> > > handle both that database and postgres using that method. That is,
> > > assign an OID manually, and make sure unused_oids knows that it should
> > > be counted as already used.
> >
> > Indeed.  If you're going to manually assign OIDs to these databases,
> > do it honestly, and put them into the range intended for that purpose.
> > Trying to take short-cuts is just going to cause trouble down the road.
> 
> Understood. I will rework the patch accordingly. Thanks

Thanks.  To get the rsync update reduction we need to preserve database
oids.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> Agree. In the latest patch, the template0 and postgres OIDs are fixed
> to unused manually assigned OIDs 4 and 5 respectively. These OIDs are
> no more listed as unused OIDs.

Thanks. Committed with a few more cosmetic changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Jan 25, 2022 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Agree. In the latest patch, the template0 and postgres OIDs are fixed
> > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are
> > no more listed as unused OIDs.
>
> Thanks. Committed with a few more cosmetic changes.

Thanks, Robert for committing this patch.



Re: storing an explicit nonce

From
Antonin Houska
Date:
Stephen Frost <sfrost@snowman.net> wrote:

> Perhaps this is all too meta and we need to work through some specific
> ideas around just what this would look like.  In particular, thinking
> about what this API would look like and how it would be used by
> reorderbuffer.c, which builds up changes in memory and then does a bare
> write() call, seems like a main use-case to consider.  The gist there
> being "can we come up with an API to do all these things that doesn't
> require entirely rewriting ReorderBufferSerializeChange()?"
> 
> Seems like it'd be easier to achieve that by having something that looks
> very close to how write() looks, but just happens to have the option to
> run the data through a stream cipher and maybe does better error
> handling for us.  Making that layer also do block-based access to the
> files underneath seems like a much larger effort that, sure, may make
> some things better too but if we could do that with the same API then it
> could also be done later if someone's interested in that.

My initial proposal is in this new thread:

https://www.postgresql.org/message-id/4987.1644323098%40antos

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



On Tue, Jan 25, 2022 at 10:19:53AM +0530, Shruthi Gowda wrote:
> On Tue, Jan 25, 2022 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > > Agree. In the latest patch, the template0 and postgres OIDs are fixed
> > > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are
> > > no more listed as unused OIDs.
> >
> > Thanks. Committed with a few more cosmetic changes.
> 
> Thanks, Robert for committing this patch.

If I'm not wrong, this can be closed.
https://commitfest.postgresql.org/37/3296/



On 2022-01-24 14:44:10 -0500, Robert Haas wrote:
> On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
> > Agree. In the latest patch, the template0 and postgres OIDs are fixed
> > to unused manually assigned OIDs 4 and 5 respectively. These OIDs are
> > no more listed as unused OIDs.
> 
> Thanks. Committed with a few more cosmetic changes.

I noticed this still has an open CF entry: https://commitfest.postgresql.org/37/3296/
I assume it can be marked as committed?

- Andres



On Mon, Mar 21, 2022 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
> I noticed this still has an open CF entry: https://commitfest.postgresql.org/37/3296/
> I assume it can be marked as committed?

Yeah, done now. But don't forget that we still need to do something on
the "wrong fds used for refilenodes after pg_upgrade relfilenode
changes Reply-To:" thread, and I think the ball is in your court
there.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Jan 22, 2022 at 2:20 AM Shruthi Gowda <gowdashru@gmail.com> wrote:
>> Agree. In the latest patch, the template0 and postgres OIDs are fixed
>> to unused manually assigned OIDs 4 and 5 respectively. These OIDs are
>> no more listed as unused OIDs.

> Thanks. Committed with a few more cosmetic changes.

I happened to take a closer look at this patch today, and I'm pretty
unhappy with the way that the assignment of those OIDs was managed.
There are two big problems:

1. IsPinnedObject() will now report that template0 and postgres are
pinned.  This seems not to prevent one from dropping them (I guess
dropdb() doesn't consult IsPinnedObject), but it would probably bollix
any pg_shdepend management that should happen for them.

2. The Catalog.pm infrastructure knows nothing about these OIDs.
While the unused_oids script was hack-n-slashed to claim that the
OIDs are used, other scripts won't know about them; for example
duplicate_oids won't report conflicts if someone tries to reuse
those OIDs.

The attached draft patch attempts to improve this situation.
It reserves these OIDs, and creates the associated macros, through
the normal BKI infrastructure by adding entries in pg_database.dat.
We have to delete those rows again during initdb, which is slightly
ugly but surely no more so than initdb's other direct manipulations
of pg_database.

There are a few loose ends:

* I'm a bit inclined to simplify IsPinnedObject by just teaching
it that *no* entries of pg_database are pinned, which would correspond
to the evident lack of enforcement in dropdb().  Can anyone see a
reason why we might pin some database in future?

* I had to set up the additional pg_database entries with nonzero
datfrozenxid to avoid an assertion failure during initdb's first VACUUM.
(That VACUUM will overwrite template1's datfrozenxid before computing
the global minimum frozen XID, but not these others; and it doesn't like
finding that the minimum is zero.)  This feels klugy.  An alternative is
to delete the extra pg_database rows before that VACUUM, which would
mean taking those deletes out of make_template0 and make_postgres and
putting them somewhere seemingly unrelated, so that's a bit crufty too.
Anybody have a preference?

* The new macro names seem ill-chosen.  Template0ObjectId is spelled
randomly differently from the longstanding TemplateDbOid, and surely
PostgresObjectId is about as vague a name as could possibly have
been thought of (please name an object that it couldn't apply to).
I'm a little inclined to rename TemplateDbOid to Template1DbOid and
use Template0DbOid and PostgresDbOid for the others, but I didn't pull
that trigger here.

Comments?

            regards, tom lane

diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 520f77971b..d7e5c02f95 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -339,9 +339,11 @@ IsPinnedObject(Oid classId, Oid objectId)
      * robustness.
      */

-    /* template1 is not pinned */
+    /* template1, template0, postgres are not pinned */
     if (classId == DatabaseRelationId &&
-        objectId == TemplateDbOid)
+        (objectId == TemplateDbOid ||
+         objectId == Template0ObjectId ||
+         objectId == PostgresObjectId))
         return false;

     /* the public namespace is not pinned */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1cb4a5b0d2..04454b3d7c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,11 +59,11 @@
 #include "sys/mman.h"
 #endif

-#include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
 #include "catalog/pg_collation_d.h"
+#include "catalog/pg_database_d.h"    /* pgrminclude ignore */
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/logging.h"
@@ -1806,14 +1806,24 @@ make_template0(FILE *cmdfd)
      * objects in the old cluster, the problem scenario only exists if the OID
      * that is in use in the old cluster is also used in the new cluster - and
      * the new cluster should be the result of a fresh initdb.)
-     *
-     * We use "STRATEGY = file_copy" here because checkpoints during initdb
-     * are cheap. "STRATEGY = wal_log" would generate more WAL, which would
-     * be a little bit slower and make the new cluster a little bit bigger.
      */
     static const char *const template0_setup[] = {
-        "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = "
-        CppAsString2(Template0ObjectId)
+        /*
+         * Since pg_database.dat includes a dummy row for template0, we have
+         * to remove that before creating the database for real.
+         */
+        "DELETE FROM pg_database WHERE datname = 'template0';\n\n",
+
+        /*
+         * Clone template1 to make template0.
+         *
+         * We use "STRATEGY = file_copy" here because checkpoints during
+         * initdb are cheap.  "STRATEGY = wal_log" would generate more WAL,
+         * which would be a little bit slower and make the new cluster a
+         * little bit bigger.
+         */
+        "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false"
+        " OID = " CppAsString2(Template0ObjectId)
         " STRATEGY = file_copy;\n\n",

         /*
@@ -1836,12 +1846,11 @@ make_template0(FILE *cmdfd)
         "REVOKE CREATE,TEMPORARY ON DATABASE template1 FROM public;\n\n",
         "REVOKE CREATE,TEMPORARY ON DATABASE template0 FROM public;\n\n",

-        "COMMENT ON DATABASE template0 IS 'unmodifiable empty database';\n\n",
-
         /*
-         * Finally vacuum to clean up dead rows in pg_database
+         * Note: postgres.bki filled in a comment for template0, so we need
+         * not do that here.
          */
-        "VACUUM pg_database;\n\n",
+
         NULL
     };

@@ -1858,12 +1867,19 @@ make_postgres(FILE *cmdfd)
     const char *const *line;

     /*
-     * Just as we did for template0, and for the same reasons, assign a fixed
-     * OID to postgres and select the file_copy strategy.
+     * Comments in make_template0() mostly apply here too.
      */
     static const char *const postgres_setup[] = {
-        "CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) " STRATEGY = file_copy;\n\n",
-        "COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n",
+        "DELETE FROM pg_database WHERE datname = 'postgres';\n\n",
+
+        "CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId)
+        " STRATEGY = file_copy;\n\n",
+
+        /*
+         * Finally vacuum to clean up dead rows in pg_database
+         */
+        "VACUUM pg_database;\n\n",
+
         NULL
     };

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 969e2a7a46..bcb81e02c4 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2844,10 +2844,11 @@ dumpDatabase(Archive *fout)
     qdatname = pg_strdup(fmtId(datname));

     /*
-     * Prepare the CREATE DATABASE command.  We must specify encoding, locale,
-     * and tablespace since those can't be altered later.  Other DB properties
-     * are left to the DATABASE PROPERTIES entry, so that they can be applied
-     * after reconnecting to the target DB.
+     * Prepare the CREATE DATABASE command.  We must specify OID (if we want
+     * to preserve that), as well as the encoding, locale, and tablespace
+     * since those can't be altered later.  Other DB properties are left to
+     * the DATABASE PROPERTIES entry, so that they can be applied after
+     * reconnecting to the target DB.
      */
     if (dopt->binary_upgrade)
     {
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 9a2816de51..338dfca5a0 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -196,10 +196,6 @@ FullTransactionIdAdvance(FullTransactionId *dest)
 #define FirstUnpinnedObjectId    12000
 #define FirstNormalObjectId        16384

-/* OIDs of Template0 and Postgres database are fixed */
-#define Template0ObjectId        4
-#define PostgresObjectId        5
-
 /*
  * VariableCache is a data structure in shared memory that is used to track
  * OID and XID assignment state.  For largely historical reasons, there is
diff --git a/src/include/catalog/pg_database.dat b/src/include/catalog/pg_database.dat
index 5feedff7bf..6c2221a4e9 100644
--- a/src/include/catalog/pg_database.dat
+++ b/src/include/catalog/pg_database.dat
@@ -19,4 +19,24 @@
   datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE',
   datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' },

+# The template0 and postgres entries are included here to reserve their
+# associated OIDs.  We show their correct properties for reference, but
+# note that these pg_database rows are removed and replaced by initdb.
+# Unlike the row for template1, their datfrozenxids will not be overwritten
+# during initdb's first VACUUM, so we have to provide normal-looking XIDs.
+
+{ oid => '4', oid_symbol => 'Template0ObjectId',
+  descr => 'unmodifiable empty database',
+  datname => 'template0', encoding => 'ENCODING', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't',
+  datallowconn => 'f', datconnlimit => '-1', datfrozenxid => '4',
+  datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE',
+  datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' },
+
+{ oid => '5', oid_symbol => 'PostgresObjectId',
+  descr => 'default administrative connection database',
+  datname => 'postgres', encoding => 'ENCODING', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 'f',
+  datallowconn => 't', datconnlimit => '-1', datfrozenxid => '4',
+  datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE',
+  datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' },
+
 ]
diff --git a/src/include/catalog/unused_oids b/src/include/catalog/unused_oids
index 61d41e7561..e55bc6fa3c 100755
--- a/src/include/catalog/unused_oids
+++ b/src/include/catalog/unused_oids
@@ -32,15 +32,6 @@ my @input_files = glob("pg_*.h");

 my $oids = Catalog::FindAllOidsFromHeaders(@input_files);

-# Push the template0 and postgres database OIDs.
-my $Template0ObjectId =
-  Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId');
-push @{$oids}, $Template0ObjectId;
-
-my $PostgresObjectId =
-  Catalog::FindDefinedSymbol('access/transam.h', '..', 'PostgresObjectId');
-push @{$oids}, $PostgresObjectId;
-
 # Also push FirstGenbkiObjectId to serve as a terminator for the last gap.
 my $FirstGenbkiObjectId =
   Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId');

On Wed, Apr 20, 2022 at 2:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The attached draft patch attempts to improve this situation.
> It reserves these OIDs, and creates the associated macros, through
> the normal BKI infrastructure by adding entries in pg_database.dat.
> We have to delete those rows again during initdb, which is slightly
> ugly but surely no more so than initdb's other direct manipulations
> of pg_database.

I'm not sure I really like this approach, but if you're firmly
convinced that it's better than cleaning up the loose ends in some
other way, I'm not going to waste a lot of energy fighting about it.
It doesn't seem horrible.

> There are a few loose ends:
>
> * I'm a bit inclined to simplify IsPinnedObject by just teaching
> it that *no* entries of pg_database are pinned, which would correspond
> to the evident lack of enforcement in dropdb().  Can anyone see a
> reason why we might pin some database in future?

It's kind of curious that we don't pin anything now. There's kind of
nothing to stop you from hosing the system by dropping template0
and/or template1, or mutating them into some crazy form that doesn't
work. But having said that, if as a matter of policy we don't even pin
template0 or template1 or postgres, then it seems unlikely that we
would suddenly decide to pin some new database in the future.

> * I had to set up the additional pg_database entries with nonzero
> datfrozenxid to avoid an assertion failure during initdb's first VACUUM.
> (That VACUUM will overwrite template1's datfrozenxid before computing
> the global minimum frozen XID, but not these others; and it doesn't like
> finding that the minimum is zero.)  This feels klugy.  An alternative is
> to delete the extra pg_database rows before that VACUUM, which would
> mean taking those deletes out of make_template0 and make_postgres and
> putting them somewhere seemingly unrelated, so that's a bit crufty too.
> Anybody have a preference?

Not really. If anything that's an argument against this entire
approach, but I already commented on that point above.

> * The new macro names seem ill-chosen.  Template0ObjectId is spelled
> randomly differently from the longstanding TemplateDbOid, and surely
> PostgresObjectId is about as vague a name as could possibly have
> been thought of (please name an object that it couldn't apply to).
> I'm a little inclined to rename TemplateDbOid to Template1DbOid and
> use Template0DbOid and PostgresDbOid for the others, but I didn't pull
> that trigger here.

Seems totally reasonable. I don't find the current naming horrible or
I'd not have committed it that way, but putting Db in there makes
sense, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 20, 2022 at 2:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The attached draft patch attempts to improve this situation.
>> It reserves these OIDs, and creates the associated macros, through
>> the normal BKI infrastructure by adding entries in pg_database.dat.
>> We have to delete those rows again during initdb, which is slightly
>> ugly but surely no more so than initdb's other direct manipulations
>> of pg_database.

> I'm not sure I really like this approach, but if you're firmly
> convinced that it's better than cleaning up the loose ends in some
> other way, I'm not going to waste a lot of energy fighting about it.

Having just had to bury my nose in renumber_oids.pl, I thought of a
different approach we could take to expose these OIDs to Catalog.pm.
That's to invent a new macro that Catalog.pm recognizes, and write
something about like this in pg_database.h:

DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4);
DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5);

That would result in (a) the OIDs becoming known to Catalog.pm
as reserved, though it wouldn't have any great clue about their
semantics; and (b) this getting emitted into pg_database_d.h:

#define Template0ObjectId 4
#define PostgresObjectId 5

Then we'd not need the dummy entries in pg_database.dat, which
does seem cleaner now that I think about it.  A downside is that
with no context, Catalog.pm could not provide name translation
services during postgres.bki generation for such OIDs --- but
at least for these entries, we don't need that.

If that seems more plausible to you I'll set about preparing a patch.

            regards, tom lane



On Wed, Apr 20, 2022 at 4:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Having just had to bury my nose in renumber_oids.pl, I thought of a
> different approach we could take to expose these OIDs to Catalog.pm.
> That's to invent a new macro that Catalog.pm recognizes, and write
> something about like this in pg_database.h:
>
> DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4);
> DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5);
>
> If that seems more plausible to you I'll set about preparing a patch.

I like it!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 20, 2022 at 4:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Having just had to bury my nose in renumber_oids.pl, I thought of a
>> different approach we could take to expose these OIDs to Catalog.pm.
>> That's to invent a new macro that Catalog.pm recognizes, and write
>> something about like this in pg_database.h:
>> DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4);
>> DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5);

> I like it!

0001 attached is a revised patch that does it that way.  This seems
like a clearly better answer.

0002 contains the perhaps-slightly-more-controversial changes of
changing the macro names and explicitly pinning no databases.

            regards, tom lane

diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm
index 0275795dea..ece0a934f0 100644
--- a/src/backend/catalog/Catalog.pm
+++ b/src/backend/catalog/Catalog.pm
@@ -44,6 +44,8 @@ sub ParseHeader
     $catalog{columns}     = [];
     $catalog{toasting}    = [];
     $catalog{indexing}    = [];
+    $catalog{other_oids}  = [];
+    $catalog{foreign_keys} = [];
     $catalog{client_code} = [];

     open(my $ifh, '<', $input_file) || die "$input_file: $!";
@@ -118,6 +120,14 @@ sub ParseHeader
                 index_decl => $6
               };
         }
+        elsif (/^DECLARE_OID_DEFINING_MACRO\(\s*(\w+),\s*(\d+)\)/)
+        {
+            push @{ $catalog{other_oids} },
+              {
+                other_name => $1,
+                other_oid  => $2
+              };
+        }
         elsif (
             /^DECLARE_(ARRAY_)?FOREIGN_KEY(_OPT)?\(\s*\(([^)]+)\),\s*(\w+),\s*\(([^)]+)\)\)/
           )
@@ -572,6 +582,10 @@ sub FindAllOidsFromHeaders
         {
             push @oids, $index->{index_oid};
         }
+        foreach my $other (@{ $catalog->{other_oids} })
+        {
+            push @oids, $other->{other_oid};
+        }
     }

     return \@oids;
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 520f77971b..d7e5c02f95 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -339,9 +339,11 @@ IsPinnedObject(Oid classId, Oid objectId)
      * robustness.
      */

-    /* template1 is not pinned */
+    /* template1, template0, postgres are not pinned */
     if (classId == DatabaseRelationId &&
-        objectId == TemplateDbOid)
+        (objectId == TemplateDbOid ||
+         objectId == Template0ObjectId ||
+         objectId == PostgresObjectId))
         return false;

     /* the public namespace is not pinned */
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 2d02b02267..f4ec6d6d40 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -472,7 +472,7 @@ EOM
       $catalog->{rowtype_oid_macro}, $catalog->{rowtype_oid}
       if $catalog->{rowtype_oid_macro};

-    # Likewise for macros for toast and index OIDs
+    # Likewise for macros for toast, index, and other OIDs
     foreach my $toast (@{ $catalog->{toasting} })
     {
         printf $def "#define %s %s\n",
@@ -488,6 +488,12 @@ EOM
           $index->{index_oid_macro}, $index->{index_oid}
           if $index->{index_oid_macro};
     }
+    foreach my $other (@{ $catalog->{other_oids} })
+    {
+        printf $def "#define %s %s\n",
+          $other->{other_name}, $other->{other_oid}
+          if $other->{other_name};
+    }

     print $def "\n";

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1cb4a5b0d2..5e2eeefc4c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,11 +59,11 @@
 #include "sys/mman.h"
 #endif

-#include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
 #include "catalog/pg_collation_d.h"
+#include "catalog/pg_database_d.h"    /* pgrminclude ignore */
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/logging.h"
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d3588607e7..786d592e2b 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2901,10 +2901,11 @@ dumpDatabase(Archive *fout)
     qdatname = pg_strdup(fmtId(datname));

     /*
-     * Prepare the CREATE DATABASE command.  We must specify encoding, locale,
-     * and tablespace since those can't be altered later.  Other DB properties
-     * are left to the DATABASE PROPERTIES entry, so that they can be applied
-     * after reconnecting to the target DB.
+     * Prepare the CREATE DATABASE command.  We must specify OID (if we want
+     * to preserve that), as well as the encoding, locale, and tablespace
+     * since those can't be altered later.  Other DB properties are left to
+     * the DATABASE PROPERTIES entry, so that they can be applied after
+     * reconnecting to the target DB.
      */
     if (dopt->binary_upgrade)
     {
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 9a2816de51..338dfca5a0 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -196,10 +196,6 @@ FullTransactionIdAdvance(FullTransactionId *dest)
 #define FirstUnpinnedObjectId    12000
 #define FirstNormalObjectId        16384

-/* OIDs of Template0 and Postgres database are fixed */
-#define Template0ObjectId        4
-#define PostgresObjectId        5
-
 /*
  * VariableCache is a data structure in shared memory that is used to track
  * OID and XID assignment state.  For largely historical reasons, there is
diff --git a/src/include/catalog/genbki.h b/src/include/catalog/genbki.h
index 4ecd76f4be..992b784236 100644
--- a/src/include/catalog/genbki.h
+++ b/src/include/catalog/genbki.h
@@ -84,6 +84,14 @@
 #define DECLARE_UNIQUE_INDEX(name,oid,oidmacro,decl) extern int no_such_variable
 #define DECLARE_UNIQUE_INDEX_PKEY(name,oid,oidmacro,decl) extern int no_such_variable

+/*
+ * These lines inform genbki.pl about manually-assigned OIDs that do not
+ * correspond to any entry in the catalog *.dat files, but should be subject
+ * to uniqueness verification and renumber_oids.pl renumbering.  A C macro
+ * to #define the given name is emitted into the corresponding *_d.h file.
+ */
+#define DECLARE_OID_DEFINING_MACRO(name,oid) extern int no_such_variable
+
 /*
  * These lines are processed by genbki.pl to create a table for use
  * by the pg_get_catalog_foreign_keys() function.  We do not have any
diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h
index e10e91c0af..96be9e9729 100644
--- a/src/include/catalog/pg_database.h
+++ b/src/include/catalog/pg_database.h
@@ -91,4 +91,13 @@ DECLARE_TOAST_WITH_MACRO(pg_database, 4177, 4178, PgDatabaseToastTable, PgDataba
 DECLARE_UNIQUE_INDEX(pg_database_datname_index, 2671, DatabaseNameIndexId, on pg_database using btree(datname
name_ops));
 DECLARE_UNIQUE_INDEX_PKEY(pg_database_oid_index, 2672, DatabaseOidIndexId, on pg_database using btree(oid oid_ops));

+/*
+ * pg_database.dat contains an entry for template1, but not for the template0
+ * or postgres databases, because those are created later in initdb.
+ * However, we still want to manually assign the OIDs for template0 and
+ * postgres, so declare those here.
+ */
+DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4);
+DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5);
+
 #endif                            /* PG_DATABASE_H */
diff --git a/src/include/catalog/renumber_oids.pl b/src/include/catalog/renumber_oids.pl
index 7de13da4bd..ba8c69c87e 100755
--- a/src/include/catalog/renumber_oids.pl
+++ b/src/include/catalog/renumber_oids.pl
@@ -170,6 +170,16 @@ foreach my $input_file (@header_files)
                 $changed = 1;
             }
         }
+        elsif (/^(DECLARE_OID_DEFINING_MACRO\(\s*\w+,\s*)(\d+)\)/)
+        {
+            if (exists $maphash{$2})
+            {
+                my $repl = $1 . $maphash{$2} . ")";
+                $line =~
+                  s/^DECLARE_OID_DEFINING_MACRO\(\s*\w+,\s*\d+\)/$repl/;
+                $changed = 1;
+            }
+        }
         elsif ($line =~ m/^CATALOG\((\w+),(\d+),(\w+)\)/)
         {
             if (exists $maphash{$2})
diff --git a/src/include/catalog/unused_oids b/src/include/catalog/unused_oids
index 61d41e7561..e55bc6fa3c 100755
--- a/src/include/catalog/unused_oids
+++ b/src/include/catalog/unused_oids
@@ -32,15 +32,6 @@ my @input_files = glob("pg_*.h");

 my $oids = Catalog::FindAllOidsFromHeaders(@input_files);

-# Push the template0 and postgres database OIDs.
-my $Template0ObjectId =
-  Catalog::FindDefinedSymbol('access/transam.h', '..', 'Template0ObjectId');
-push @{$oids}, $Template0ObjectId;
-
-my $PostgresObjectId =
-  Catalog::FindDefinedSymbol('access/transam.h', '..', 'PostgresObjectId');
-push @{$oids}, $PostgresObjectId;
-
 # Also push FirstGenbkiObjectId to serve as a terminator for the last gap.
 my $FirstGenbkiObjectId =
   Catalog::FindDefinedSymbol('access/transam.h', '..', 'FirstGenbkiObjectId');
diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index 33955494c6..20894baf18 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -180,12 +180,13 @@
 [

 # A comment could appear here.
-{ oid => '1', oid_symbol => 'TemplateDbOid',
+{ oid => '1', oid_symbol => 'Template1DbOid',
   descr => 'database\'s default template',
-  datname => 'template1', encoding => 'ENCODING', datistemplate => 't',
+  datname => 'template1', encoding => 'ENCODING',
+  datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't',
   datallowconn => 't', datconnlimit => '-1', datfrozenxid => '0',
   datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE',
-  datctype => 'LC_CTYPE', datacl => '_null_' },
+  datctype => 'LC_CTYPE', daticulocale => 'ICU_LOCALE', datacl => '_null_' },

 ]
 ]]></programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5eabd32cf6..61cda56c6f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4540,9 +4540,9 @@ BootStrapXLOG(void)
     checkPoint.nextMulti = FirstMultiXactId;
     checkPoint.nextMultiOffset = 0;
     checkPoint.oldestXid = FirstNormalTransactionId;
-    checkPoint.oldestXidDB = TemplateDbOid;
+    checkPoint.oldestXidDB = Template1DbOid;
     checkPoint.oldestMulti = FirstMultiXactId;
-    checkPoint.oldestMultiDB = TemplateDbOid;
+    checkPoint.oldestMultiDB = Template1DbOid;
     checkPoint.oldestCommitTsXid = InvalidTransactionId;
     checkPoint.newestCommitTsXid = InvalidTransactionId;
     checkPoint.time = (pg_time_t) time(NULL);
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index d7e5c02f95..e784538aae 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -339,18 +339,20 @@ IsPinnedObject(Oid classId, Oid objectId)
      * robustness.
      */

-    /* template1, template0, postgres are not pinned */
-    if (classId == DatabaseRelationId &&
-        (objectId == TemplateDbOid ||
-         objectId == Template0ObjectId ||
-         objectId == PostgresObjectId))
-        return false;
-
     /* the public namespace is not pinned */
     if (classId == NamespaceRelationId &&
         objectId == PG_PUBLIC_NAMESPACE)
         return false;

+    /*
+     * Databases are never pinned.  It might seem that it'd be prudent to pin
+     * at least template0; but we do this intentionally so that template0 and
+     * template1 can be rebuilt from each other, thus letting them serve as
+     * mutual backups (as long as you've not modified template1, anyway).
+     */
+    if (classId == DatabaseRelationId)
+        return false;
+
     /*
      * All other initdb-created objects are pinned.  This is overkill (the
      * system doesn't really depend on having every last weird datatype, for
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 9139fe895c..5dbc7379e3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -908,7 +908,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
      */
     if (bootstrap)
     {
-        MyDatabaseId = TemplateDbOid;
+        MyDatabaseId = Template1DbOid;
         MyDatabaseTableSpace = DEFAULTTABLESPACE_OID;
     }
     else if (in_dbname != NULL)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 5e2eeefc4c..fcef651c2f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1812,8 +1812,8 @@ make_template0(FILE *cmdfd)
      * be a little bit slower and make the new cluster a little bit bigger.
      */
     static const char *const template0_setup[] = {
-        "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = "
-        CppAsString2(Template0ObjectId)
+        "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false"
+        " OID = " CppAsString2(Template0DbOid)
         " STRATEGY = file_copy;\n\n",

         /*
@@ -1862,7 +1862,8 @@ make_postgres(FILE *cmdfd)
      * OID to postgres and select the file_copy strategy.
      */
     static const char *const postgres_setup[] = {
-        "CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) " STRATEGY = file_copy;\n\n",
+        "CREATE DATABASE postgres OID = " CppAsString2(PostgresDbOid)
+        " STRATEGY = file_copy;\n\n",
         "COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n",
         NULL
     };
diff --git a/src/include/catalog/pg_database.dat b/src/include/catalog/pg_database.dat
index 5feedff7bf..05873f74f6 100644
--- a/src/include/catalog/pg_database.dat
+++ b/src/include/catalog/pg_database.dat
@@ -12,7 +12,7 @@

 [

-{ oid => '1', oid_symbol => 'TemplateDbOid',
+{ oid => '1', oid_symbol => 'Template1DbOid',
   descr => 'default template for new databases',
   datname => 'template1', encoding => 'ENCODING', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't',
   datallowconn => 't', datconnlimit => '-1', datfrozenxid => '0',
diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h
index 96be9e9729..611c95656a 100644
--- a/src/include/catalog/pg_database.h
+++ b/src/include/catalog/pg_database.h
@@ -97,7 +97,7 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_database_oid_index, 2672, DatabaseOidIndexId, on pg
  * However, we still want to manually assign the OIDs for template0 and
  * postgres, so declare those here.
  */
-DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4);
-DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5);
+DECLARE_OID_DEFINING_MACRO(Template0DbOid, 4);
+DECLARE_OID_DEFINING_MACRO(PostgresDbOid, 5);

 #endif                            /* PG_DATABASE_H */

On Thu, Apr 21, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Apr 20, 2022 at 4:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Having just had to bury my nose in renumber_oids.pl, I thought of a
> >> different approach we could take to expose these OIDs to Catalog.pm.
> >> That's to invent a new macro that Catalog.pm recognizes, and write
> >> something about like this in pg_database.h:
> >> DECLARE_OID_DEFINING_MACRO(Template0ObjectId, 4);
> >> DECLARE_OID_DEFINING_MACRO(PostgresObjectId, 5);
>
> > I like it!
>
> 0001 attached is a revised patch that does it that way.  This seems
> like a clearly better answer.
>
> 0002 contains the perhaps-slightly-more-controversial changes of
> changing the macro names and explicitly pinning no databases.

Both patches look good to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 21, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 0001 attached is a revised patch that does it that way.  This seems
>> like a clearly better answer.
>> 0002 contains the perhaps-slightly-more-controversial changes of
>> changing the macro names and explicitly pinning no databases.

> Both patches look good to me.

Pushed, thanks for looking.

            regards, tom lane



better page-level checksums

From
Robert Haas
Date:
On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost@snowman.net> wrote:
> Alternatively, we could use
> that technique to just provide a better per-page checksum than what we
> have today.  Maybe we could figure out how to leverage that to move to
> 64bit transaction IDs with some page-level epoch.

I'm interested in assessing the feasibility of a "better page-level
checksums" feature. I have a few questions, and a few observations.
One of my questions is what algorithm(s) we'd want to support. I did a
quick Google search and found that brtfs supports CRC-32C, XXHASH,
SHA256, and BLAKE2B. I don't know that we want to support that many
options (but maybe we do) and I don't think CRC-32C makes any sense
here, for two reasons. First, we've already got a 16-bit checksum, and
a 32-bit checksum doesn't seem like it's gaining enough to be worth
the implementation complexity. Second, we're probably going to have to
dole out per-page space in multiples of MAXALIGN, and that's usually
8. I think for this purpose we should limit ourselves to algorithms
whose output size is, at minimum, 64 bits, and ideally, a multiple of
64 bits. I'm sure there are plenty of options other than the ones that
btrfs uses; I mentioned them only as a way of jump-starting the
discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
wide 16 BYTE checksums. That's a lot of space to consume with a
checksum, but your chances of a collision are very small indeed.

Even if we only offer one new kind of checksum, making space for a
wider checksum makes the page format variable in a way that it
currently isn't. There seem to be about 50 compile-time constants in
the source code whose values are computed based on the block size and
amount of special space in use by some particular AM (yikes!). For
example, for the heap, there's stuff like MaxHeapTuplesPerPage and
MaxHeapTupleSize. If in the future we have some pages that are just
like the ones we have today, and other clusters where we've allowed
space for a checksum, then those constants become run-time variable.
And since they're used in some very low-level functions that are
called a lot, like PageGetHeapFreeSpace(), that seems potentially
problematic. The problem is multiplied if you also think about trying
to store an epoch on each heap page, as per Stephen's proposal above,
because now every page used by any AM whatsoever might or might not
have a checksum, and every heap page might also have or not have an
epoch XID. I think it's going to be somewhat tricky to figure out a
scheme here that avoids making things slow. Originally I was thinking
that things like MaxHeapTuplesPerPage ought to become macros or static
inline functions, but now I have what I think is a better idea: make
them into global variables and initialize them with the appropriate
values for the cluster right after we read the control file. This
doesn't solve the problem if some pages are different than others,
though, and even for the case where every page in the cluster has the
same amount of reserved space, reading a global variable is not going
to be as efficient as referring to a constant compiled right into the
code. I'm hopeful that all of this is solvable somehow, but it's
hairy, for sure.

Another thing I realized is that we would probably end up with the
pd_checksum unused when this other feature is activated. If someone
comes up with a clever idea for how to allocate extra space without
needing things to be a multiple of MAXIMUM_ALIGNOF, they could
potentially shrink the space they need elsewhere by 2 bytes and then
use both that space and pd_checksum, but otherwise pd_checksum is
likely to be dead when an enhanced checksum feature is in use. Since
it's also dead when checksums are turned off, that's probably OK. I
suppose another possibility is to allow both to be turned on and off
independently, i.e. let someone have both a Fletcher-16 checksum in
pd_checksum, and also a wider checksum in this other chunk of space,
but I'm not sure whether that's really a useful thing to be able to
do. (Opinions?)

I'm also a little fuzzy on what the command-line interface for
selecting this functionality would look like. The existing option to
initdb is just --data-checksums, which doesn't leave any way to say
what kind of checksums you want.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 2:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm interested in assessing the feasibility of a "better page-level
> checksums" feature. I have a few questions, and a few observations.
> One of my questions is what algorithm(s) we'd want to support. I did a
> quick Google search and found that brtfs supports CRC-32C, XXHASH,
> SHA256, and BLAKE2B. I don't know that we want to support that many
> options (but maybe we do) and I don't think CRC-32C makes any sense
> here, for two reasons. First, we've already got a 16-bit checksum, and
> a 32-bit checksum doesn't seem like it's gaining enough to be worth
> the implementation complexity.

Why not? The only problems that it won't solve are all related to
crypto. Which is perfectly fine, but it seems like there is a
terminology issue here. ISTM that you're really talking about adding a
cryptographic hash function, not a checksum. These are rather
different things.

> Even if we only offer one new kind of checksum, making space for a
> wider checksum makes the page format variable in a way that it
> currently isn't.

I believe that the page special area was designed to be
variable-sized, and even anticipates dynamic resizing of the special
area. At least in index AMs, where it's not that hard to make extra
space in the special area by shifting the tuples back, and then fixing
line pointers to point to the new offsets. So you have a dynamic
variable-sized array that's a little like a second line pointer array
(though probably not added to all that often).

My preference is for an approach that builds on that, or at least
doesn't significantly complicate it. So a cryptographic hash or nonce
can go in the special area proper (structs like BTPageOpaqueData don't
need any changes), but at a page offset before the special area proper
-- not after.

What disadvantages does that approach have, if any, from your point of view?

-- 
Peter Geoghegan



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 2:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My preference is for an approach that builds on that, or at least
> doesn't significantly complicate it. So a cryptographic hash or nonce
> can go in the special area proper (structs like BTPageOpaqueData don't
> need any changes), but at a page offset before the special area proper
> -- not after.

Minor correction: I meant "before structs like BTPageOpaqueData,
earlier in the page and in the special area proper".

-- 
Peter Geoghegan



Re: better page-level checksums

From
Matthias van de Meent
Date:
On Thu, 9 Jun 2022 at 23:13, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Alternatively, we could use
> > that technique to just provide a better per-page checksum than what we
> > have today.  Maybe we could figure out how to leverage that to move to
> > 64bit transaction IDs with some page-level epoch.
>
> I'm interested in assessing the feasibility of a "better page-level
> checksums" feature. I have a few questions, and a few observations.
> One of my questions is what algorithm(s) we'd want to support. I did a
> quick Google search and found that brtfs supports CRC-32C, XXHASH,
> SHA256, and BLAKE2B. I don't know that we want to support that many
> options (but maybe we do) and I don't think CRC-32C makes any sense
> here, for two reasons. First, we've already got a 16-bit checksum, and
> a 32-bit checksum doesn't seem like it's gaining enough to be worth
> the implementation complexity. Second, we're probably going to have to
> dole out per-page space in multiples of MAXALIGN, and that's usually
> 8.

Why so? We already dole out per-page space in 4-byte increments
through pd_linp, and I see no reason why we can't reserve some line
pointers for per-page metadata if we decide that we need extra
per-page ~overhead~ metadata.

>  I think for this purpose we should limit ourselves to algorithms
> whose output size is, at minimum, 64 bits, and ideally, a multiple of
> 64 bits. I'm sure there are plenty of options other than the ones that
> btrfs uses; I mentioned them only as a way of jump-starting the
> discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
> wide 16 BYTE checksums. That's a lot of space to consume with a
> checksum, but your chances of a collision are very small indeed.

Isn't the goal of a checksum to find - and where possible, correct -
bit flips and other broken pages? I would suggest not to use
cryptographic hash functions for that, as those are rarely
error-correcting.

> Even if we only offer one new kind of checksum, making space for a
> wider checksum makes the page format variable in a way that it
> currently isn't. There seem to be about 50 compile-time constants in
> the source code whose values are computed based on the block size and
> amount of special space in use by some particular AM (yikes!).

Isn't that expected for most of those places? With the current
bufpage.h description of Page, it seems obvious that all bytes on a
page except those in the "hole" and those in the page header are under
full control of the AM. Of course AMs will pre-calculate limits and
offsets during compilation, that saves recalculation cycles and/or
cache lines with constants to keep in L1.

>  For
> example, for the heap, there's stuff like MaxHeapTuplesPerPage and
> MaxHeapTupleSize. If in the future we have some pages that are just
> like the ones we have today, and other clusters where we've allowed
> space for a checksum, then those constants become run-time variable.
> And since they're used in some very low-level functions that are
> called a lot, like PageGetHeapFreeSpace(), that seems potentially
> problematic. The problem is multiplied if you also think about trying
> to store an epoch on each heap page, as per Stephen's proposal above,
> because now every page used by any AM whatsoever might or might not
> have a checksum, and every heap page might also have or not have an
> epoch XID. I think it's going to be somewhat tricky to figure out a
> scheme here that avoids making things slow.

Can't we add some extra fork that stores this extra per-page
information, and contains this extra metadata in a double-buffered
format, so that both before the actual page is written the metadata
too is written to disk, while the old metadata is available too for
recovery purposes. This allows us to maintain the current format with
its low per-page overhead, and only have extra overhead (up to 2x
writes for each page, but the writes for these metadata pages need not
be BLCKSZ in size) for those that opt-in to the more computationally
expensive features of larger checksums, nonces, and/or other non-AM
per-page ~overhead~ metadata.

> Originally I was thinking
> that things like MaxHeapTuplesPerPage ought to become macros or static
> inline functions, but now I have what I think is a better idea: make
> them into global variables and initialize them with the appropriate
> values for the cluster right after we read the control file. This
> doesn't solve the problem if some pages are different than others,
> though, and even for the case where every page in the cluster has the
> same amount of reserved space, reading a global variable is not going
> to be as efficient as referring to a constant compiled right into the
> code. I'm hopeful that all of this is solvable somehow, but it's
> hairy, for sure.
>
> Another thing I realized is that we would probably end up with the
> pd_checksum unused when this other feature is activated. If someone
> comes up with a clever idea for how to allocate extra space without
> needing things to be a multiple of MAXIMUM_ALIGNOF, they could
> potentially shrink the space they need elsewhere by 2 bytes and then
> use both that space and pd_checksum, but otherwise pd_checksum is
> likely to be dead when an enhanced checksum feature is in use. Since
> it's also dead when checksums are turned off, that's probably OK. I
> suppose another possibility is to allow both to be turned on and off
> independently, i.e. let someone have both a Fletcher-16 checksum in
> pd_checksum, and also a wider checksum in this other chunk of space,
> but I'm not sure whether that's really a useful thing to be able to
> do. (Opinions?)

I'd prefer if we didn't change the way pages are presented to AMs.
Currently, it is clear what area is available to you if you write an
AM that uses the bufpage APIs. Changing the page format to have the
buffer manager also touch / reserve space in the special areas seems
like a break of abstraction: Quoting from bufpage.h:

 * AM-generic per-page information is kept in PageHeaderData.
 *
 * AM-specific per-page data (if any) is kept in the area marked "special
 * space"; each AM has an "opaque" structure defined somewhere that is
 * stored as the page trailer.  an access method should always
 * initialize its pages with PageInit and then set its own opaque
 * fields.

I'd rather we keep this contract: am-generic stuff belongs in
PageHeaderData, with the rest of the page fully available for the AM
to use (including the special area).


Kind regards,

Matthias van de Meent



Re: better page-level checksums

From
Fabien COELHO
Date:
Hello Robert,

> I think for this purpose we should limit ourselves to algorithms
> whose output size is, at minimum, 64 bits, and ideally, a multiple of
> 64 bits. I'm sure there are plenty of options other than the ones that
> btrfs uses; I mentioned them only as a way of jump-starting the
> discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
> wide 16 BYTE checksums. That's a lot of space to consume with a
> checksum, but your chances of a collision are very small indeed.

My 0.02€ about that:

You do not have to store the whole hash algorithm output, you can truncate 
or reduce (eg by xoring parts) the size to what makes sense for your 
application and security requirements. ISTM that 64 bits is more than 
enough for a page checksum, whatever the underlying hash algorithm.

Also, ISTM that a checksum algorithm does not really need to be 
cryptographically strong, which means that cheaper alternatives are ok, 
although good quality should be sought nevertheless.

-- 
Fabien.

Re: better page-level checksums

From
Andrey Borodin
Date:


On Fri, Jun 10, 2022 at 5:00 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

Can't we add some extra fork that stores this extra per-page
information, and contains this extra metadata

+1 for this approach. I had observed some painful corruption cases where block storage simply returned stale version of a rage of blocks. This is only possible because checksum is stored on the page itself.
A special fork for checksums would allow us to better detect failures in SSD firmawares, MMU SEUs etc, OS page cache, backup software and storage. It may seems that these kind of stuff never happen. But probability of such failure is drastically bigger than probability of hardware failure being undetected due to CRC16 collision.

Also I'm skeptical about correcting detected errors with the information from checksum. This approach requires very very large checksum. It's much easier to obtain fresh block copy from HA standby.

Best regards, Andrey Borodin.
 

Re: better page-level checksums

From
Robert Haas
Date:
On Thu, Jun 9, 2022 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Why not? The only problems that it won't solve are all related to
> crypto. Which is perfectly fine, but it seems like there is a
> terminology issue here. ISTM that you're really talking about adding a
> cryptographic hash function, not a checksum. These are rather
> different things.

I don't think those are mutually exclusive categories. I shall cite
Wikipedia: "Cryptographic hash ... can also be used as ordinary hash
functions, to index data in hash tables, for fingerprinting, to detect
duplicate data or uniquely identify files, and as checksums to detect
accidental data corruption."[1] There is also PostgreSQL precedent in
the form of the --manifest-checksums argument to pg_basebackup, whose
legal values are SHA{224,256,384,512}|CRC32C|NONE. The man page for
the "shasum" utility says that the purpose of the command is to "Print
or Check SHA Checksums".

I'm not perfectly attached to the idea of using SHA here, but it seems
to me that's pretty much the standard thing these days. Stephen Frost
and David Steele pushed hard for SHA checksums in backup manifests,
and actually wanted it to be the default.

I think that if you're the kind of person who looks at our existing
page checksums and finds them too weak, I doubt that CRC-32C is going
to make you feel any better. You're probably the sort of person who
thinks that checksums should have a lot of bits, and you're probably
not going to be satisfied with the properties of an algorithm invented
in the 1960s. Of course if there's anyone out there who thinks that
our existing 16-bit checksums are a pile of garbage but would be much
happier if CRC-32C is an option, I am happy to have them show up here
and say so, but I find it much more likely that people who want this
kind of feature would advocate for a more modern algorithm.

> My preference is for an approach that builds on that, or at least
> doesn't significantly complicate it. So a cryptographic hash or nonce
> can go in the special area proper (structs like BTPageOpaqueData don't
> need any changes), but at a page offset before the special area proper
> -- not after.
>
> What disadvantages does that approach have, if any, from your point of view?

I think it would be an extremely good idea to store the extended
checksum at the same offset in every page. Right now, code that wants
to compute checksums, or a tool like pg_checksums that wants to verify
them, can find the checksum without needing to interpret any of the
remaining page contents. Things get sticky if you have to interpret
the page contents to locate the checksum that's going to tell you
whether the page contents are messed up. Perhaps this could be worked
around if you tried hard enough, but I don't see what we get out of
it. I don't think that putting the checksum at the very end of the
every page precludes using variable-size special space in the AMs, or
even complicates it much, because if there's a fixed-length block of
stuff at the end of every page, you can easily account for that.

There's a lot less code that cares about the space above pd_special
than there is code that cares about any other portion of the page.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://en.wikipedia.org/wiki/Cryptographic_hash_function



Re: better page-level checksums

From
Peter Eisentraut
Date:
On 10.06.22 15:16, Robert Haas wrote:
> I'm not perfectly attached to the idea of using SHA here, but it seems
> to me that's pretty much the standard thing these days. Stephen Frost
> and David Steele pushed hard for SHA checksums in backup manifests,
> and actually wanted it to be the default.

That seems like a reasonable use in that application, since you might 
want to verify whether a backup has been (maliciously?) altered rather 
than just accidentally bit flipped.

> I think that if you're the kind of person who looks at our existing
> page checksums and finds them too weak, I doubt that CRC-32C is going
> to make you feel any better. You're probably the sort of person who
> thinks that checksums should have a lot of bits, and you're probably
> not going to be satisfied with the properties of an algorithm invented
> in the 1960s. Of course if there's anyone out there who thinks that
> our existing 16-bit checksums are a pile of garbage but would be much
> happier if CRC-32C is an option, I am happy to have them show up here
> and say so, but I find it much more likely that people who want this
> kind of feature would advocate for a more modern algorithm.

I think there ought to be a bit more principled analysis here than just 
"let's add a lot more bits".  There is probably some kind of information 
to be had about how many CRC bits are useful for a given block size, say.

And then there is the question of performance.  When data checksum were 
first added, there was a lot of concern about that.  CRC is usually 
baked directly into hardware, so it's about as cheap as we can hope for. 
  SHA not so much.



Re: better page-level checksums

From
Robert Haas
Date:
On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Why so? We already dole out per-page space in 4-byte increments
> through pd_linp, and I see no reason why we can't reserve some line
> pointers for per-page metadata if we decide that we need extra
> per-page ~overhead~ metadata.

Hmm, that's an interesting approach. I was thinking that putting data
after the PageHeaderData struct would be a non-starter because the
code that looks up a line pointer by index is currently just
multiply-and-add and complicating it seems bad for performance.
However, if we treated the space there as overlapping the line pointer
array and making some line pointers unusable rather than something
inserted prior to the line pointer array, we could avoid that. I still
think it would be kind of complicated, though, because we'd have to
find every bit of code that loops over the line pointer array or
accesses it by index and make sure that it doesn't try to access the
low-numbered line pointers.

> Isn't the goal of a checksum to find - and where possible, correct -
> bit flips and other broken pages? I would suggest not to use
> cryptographic hash functions for that, as those are rarely
> error-correcting.

I wasn't thinking of trying to do error correction, just error
detection. See also my earlier reply to Peter Geoghegan.

> Isn't that expected for most of those places? With the current
> bufpage.h description of Page, it seems obvious that all bytes on a
> page except those in the "hole" and those in the page header are under
> full control of the AM. Of course AMs will pre-calculate limits and
> offsets during compilation, that saves recalculation cycles and/or
> cache lines with constants to keep in L1.

Yep.

> Can't we add some extra fork that stores this extra per-page
> information, and contains this extra metadata in a double-buffered
> format, so that both before the actual page is written the metadata
> too is written to disk, while the old metadata is available too for
> recovery purposes. This allows us to maintain the current format with
> its low per-page overhead, and only have extra overhead (up to 2x
> writes for each page, but the writes for these metadata pages need not
> be BLCKSZ in size) for those that opt-in to the more computationally
> expensive features of larger checksums, nonces, and/or other non-AM
> per-page ~overhead~ metadata.

It's not impossible, I'm sure, but it doesn't seem very appealing to
me. Those extra reads and writes could be expensive, and there's no
place to cleanly integrate them into the code structure. A function
like PageIsVerified() -- which is where we currently validate
checksums -- only gets the page. It can't go off and read some other
page from disk to perform the checksum calculation.

I'm not exactly sure what you have in mind when you say that the
writes need not be BLCKSZ in size. Technically I guess that's true,
but then the checksums have to be crash safe, or they're not much
good. If they're not part of the page, how do they get updated in a
way that makes them crash safe? I guess it could be done: every time
we write a FPW, enlarge the page image by the number of bytes that are
stored in this location. When replaying an FPW, update those bytes
too. And every time we read or write a page, also read or write those
bytes. In essence, we'd be deciding that pages are 8192+n bytes, but
the last n bytes are stored in a different file - and, in memory, a
different buffer pool. I think that would be hugely invasive and
unpleasant to make work and I think the performance would be poor,
too.

> I'd prefer if we didn't change the way pages are presented to AMs.
> Currently, it is clear what area is available to you if you write an
> AM that uses the bufpage APIs. Changing the page format to have the
> buffer manager also touch / reserve space in the special areas seems
> like a break of abstraction: Quoting from bufpage.h:
>
>  * AM-generic per-page information is kept in PageHeaderData.
>  *
>  * AM-specific per-page data (if any) is kept in the area marked "special
>  * space"; each AM has an "opaque" structure defined somewhere that is
>  * stored as the page trailer.  an access method should always
>  * initialize its pages with PageInit and then set its own opaque
>  * fields.
>
> I'd rather we keep this contract: am-generic stuff belongs in
> PageHeaderData, with the rest of the page fully available for the AM
> to use (including the special area).

I don't think that changing the contract has to mean that it becomes
unclear what the contract is. And you can't improve any system without
changing some stuff. But you certainly don't have to like my ideas or
anything....

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Robert Haas
Date:
On Fri, Jun 10, 2022 at 9:36 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> I think there ought to be a bit more principled analysis here than just
> "let's add a lot more bits".  There is probably some kind of information
> to be had about how many CRC bits are useful for a given block size, say.
>
> And then there is the question of performance.  When data checksum were
> first added, there was a lot of concern about that.  CRC is usually
> baked directly into hardware, so it's about as cheap as we can hope for.
>   SHA not so much.

That's all pretty fair. I have to admit that SHA checksums sound quite
expensive, and also that I'm no expert on what kinds of checksums
would be best for this sort of application. Based on the earlier
discussions around TDE, I do think that people want tamper-resistant
checksums here too -- like maybe something where you can't recompute
the checksum without access to some secret. I could propose naive ways
to do that, like prepending a fixed chunk of secret bytes to the
beginning of every block and then running SHA512 or something over the
result, but I'm sure that people with actual knowledge of cryptography
have developed much better and more robust ways of doing this sort of
thing.

I've really been devoting most of my mental energy here to
understanding what problems there are at the PostgreSQL level - i.e.
when we carve out bytes for a wider checksum, what breaks? The only
research that I did to try to understand what algorithms might make
sense was a quick Google search, which led me to the list of
algorithms that btrfs uses. I figured that was a good starting point
because, like a filesystem, we're encrypting fixed-size blocks of
data. However, I didn't intend to present the results of that quick
look as the definitive answer to the question of what might make sense
for PostgreSQL, and would be interested in hearing what you or anyone
else thinks about that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Jun 10, 2022 at 9:36 AM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> > I think there ought to be a bit more principled analysis here than just
> > "let's add a lot more bits".  There is probably some kind of information
> > to be had about how many CRC bits are useful for a given block size, say.
> >
> > And then there is the question of performance.  When data checksum were
> > first added, there was a lot of concern about that.  CRC is usually
> > baked directly into hardware, so it's about as cheap as we can hope for.
> >   SHA not so much.
>
> That's all pretty fair. I have to admit that SHA checksums sound quite
> expensive, and also that I'm no expert on what kinds of checksums
> would be best for this sort of application. Based on the earlier
> discussions around TDE, I do think that people want tamper-resistant
> checksums here too -- like maybe something where you can't recompute
> the checksum without access to some secret. I could propose naive ways
> to do that, like prepending a fixed chunk of secret bytes to the
> beginning of every block and then running SHA512 or something over the
> result, but I'm sure that people with actual knowledge of cryptography
> have developed much better and more robust ways of doing this sort of
> thing.

So, it's not quite as simple as use X or use Y, we need to be
considering the use case too.  In particular, the amount of data that's
being hash'd is relevant when it comes to making a decision about what
hash or checksum to use.  When you're talking about (potentially) 1G
segment files, you'll want to use something different (like SHA) vs.
when you're talking about an 8K block (not that you couldn't use SHA,
but it may very well be overkill for it).

In terms of TDE, that's yet a different use-case and you'd want to use
AE (authenticated encryption) + AAD (additional authenticated data) and
the result of that operation is a block which has some amount of
unencrypted data (eg: LSN, potentially used as the IV), some amount of
encrypted data (eg: everything else), and then space to store the tag
(which can be thought of, but is *distinct* from, a hash of the
encrypted data + the additional unencrypted data, where the latter would
include the unencrypted data on the block, like the LSN, plus other
information that we want to include like the qualified path+filename of
the file as relevant to the PGDATA root).  If our goal is
cryptographiclly authenticated and encrypted data pages (which I believe
is at least one of our goals) then we're talking about encryption
methods like AES GCM which handle production of the tag for us and with
that tag we would *not* need to have any independent hash or checksum for
the block (though we could, but that should really be included in the
*encrypted* section, as hashing unencrypted data and then storing that
hash unencrypted could potentially leak information that we'd rather
not).

Note that NIST has put out information regarding how big a tag is
appropriate for how much data is being encrypted with a given
authenticated encryption method such as AES GCM.  I recall Robert
finding similar information for hashing/checksumming of unencrypted
data from a similar source and that'd make sense to consider when
talking about *just* adding a hash/checksum for unencrypted data blocks.

This is the relevant discussion from NIST on this subject:

https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-38d.pdf

Note particularly Appendix C: Requirements and Guidelines for Using
Short Tags (though, really, the whole thing is good to read..).

> I've really been devoting most of my mental energy here to
> understanding what problems there are at the PostgreSQL level - i.e.
> when we carve out bytes for a wider checksum, what breaks? The only
> research that I did to try to understand what algorithms might make
> sense was a quick Google search, which led me to the list of
> algorithms that btrfs uses. I figured that was a good starting point
> because, like a filesystem, we're encrypting fixed-size blocks of
> data. However, I didn't intend to present the results of that quick
> look as the definitive answer to the question of what might make sense
> for PostgreSQL, and would be interested in hearing what you or anyone
> else thinks about that.

In the thread about checksum/hashes for the backup manifest, I was
pretty sure you found some information regarding the amount of data
being hashed vs. the size you want the hash/checksum to be and that
seems like it'd be particularly relevant for this discussion (as it was
for backups, at least as I recall..).  Hopefully we can go find that.

Thanks,

Stephen

Attachment

Re: better page-level checksums

From
Stephen Frost
Date:
Greetings,

* Andrey Borodin (x4m@double.cloud) wrote:
> On Fri, Jun 10, 2022 at 5:00 AM Matthias van de Meent <
> boekewurm+postgres@gmail.com> wrote:
> > Can't we add some extra fork that stores this extra per-page
> > information, and contains this extra metadata
>
> +1 for this approach. I had observed some painful corruption cases where
> block storage simply returned stale version of a rage of blocks. This is
> only possible because checksum is stored on the page itself.
> A special fork for checksums would allow us to better detect failures in
> SSD firmawares, MMU SEUs etc, OS page cache, backup software and storage.
> It may seems that these kind of stuff never happen. But probability of such
> failure is drastically bigger than probability of hardware failure being
> undetected due to CRC16 collision.

This is another possible approach, sure, but it has its own downsides:
clearly more IO ends up being involved and then you also have to deal
with the fact that the fork's page would certainly end up covering a lot
of the pages in the main relation, not to mention the question of what
to do when we want to get checksums *on forks*, which we surely will
want to have...

> Also I'm skeptical about correcting detected errors with the information
> from checksum. This approach requires very very large checksum. It's much
> easier to obtain fresh block copy from HA standby.

Yeah, error correcting checksums are yet another use-case and one that
would require a lot more space.

Thanks,

Stephen

Attachment

Re: better page-level checksums

From
Stephen Frost
Date:
Greetings,

* Fabien COELHO (coelho@cri.ensmp.fr) wrote:
> >I think for this purpose we should limit ourselves to algorithms
> >whose output size is, at minimum, 64 bits, and ideally, a multiple of
> >64 bits. I'm sure there are plenty of options other than the ones that
> >btrfs uses; I mentioned them only as a way of jump-starting the
> >discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
> >wide 16 BYTE checksums. That's a lot of space to consume with a
> >checksum, but your chances of a collision are very small indeed.
>
> My 0.02€ about that:
>
> You do not have to store the whole hash algorithm output, you can truncate
> or reduce (eg by xoring parts) the size to what makes sense for your
> application and security requirements. ISTM that 64 bits is more than enough
> for a page checksum, whatever the underlying hash algorithm.

Agreed on this- but we shouldn't be guessing at what the correct answers
are here, there's published information from standards bodies about this
sort of thing.

> Also, ISTM that a checksum algorithm does not really need to be
> cryptographically strong, which means that cheaper alternatives are ok,
> although good quality should be sought nevertheless.

Right, if we aren't doing encryption then we just need to focus on what
is needed for the amount of error detection that we want and we can go
look at how much space we need when we're talking about 8K or so worth
of data.  When we *are* doing encryption, what's interesting is the tag
length and that's a different thing which has its own published
information from standards bodies about and we should be looking at
that.  While the general "need X amount of space on the page to store
the hash/authentication data" problem is the same, the answer to "how
much space is needed" will depend on which use case the user requested
(well ... probably anyway, maybe we'll get lucky and find that there's a
reasonable answer to both which fits in the same amount of space and
could possibly leverage that, but let's not try to force that to happen
as we'll surely get called out if we go against the guideance from the
standards bodies who study this stuff).

Thanks,

Stephen

Attachment

Re: better page-level checksums

From
Robert Haas
Date:
On Fri, Jun 10, 2022 at 12:08 PM Stephen Frost <sfrost@snowman.net> wrote:
> So, it's not quite as simple as use X or use Y, we need to be
> considering the use case too.  In particular, the amount of data that's
> being hash'd is relevant when it comes to making a decision about what
> hash or checksum to use.  When you're talking about (potentially) 1G
> segment files, you'll want to use something different (like SHA) vs.
> when you're talking about an 8K block (not that you couldn't use SHA,
> but it may very well be overkill for it).

Interesting. I expected you to be cheerleading for SHA like a madman.

> In terms of TDE, that's yet a different use-case and you'd want to use
> AE (authenticated encryption) + AAD (additional authenticated data) and
> the result of that operation is a block which has some amount of
> unencrypted data (eg: LSN, potentially used as the IV), some amount of
> encrypted data (eg: everything else), and then space to store the tag
> (which can be thought of, but is *distinct* from, a hash of the
> encrypted data + the additional unencrypted data, where the latter would
> include the unencrypted data on the block, like the LSN, plus other
> information that we want to include like the qualified path+filename of
> the file as relevant to the PGDATA root).  If our goal is
> cryptographiclly authenticated and encrypted data pages (which I believe
> is at least one of our goals) then we're talking about encryption
> methods like AES GCM which handle production of the tag for us and with
> that tag we would *not* need to have any independent hash or checksum for
> the block (though we could, but that should really be included in the
> *encrypted* section, as hashing unencrypted data and then storing that
> hash unencrypted could potentially leak information that we'd rather
> not).

Yeah, and I feel there was discussion of how much space AES-GCM-SIV
would need per page and I can't find that discussion now. Pretty sure
it was a pretty meaty number of bytes, and I assume it's also not that
cheap to compute.

> Note that NIST has put out information regarding how big a tag is
> appropriate for how much data is being encrypted with a given
> authenticated encryption method such as AES GCM.  I recall Robert
> finding similar information for hashing/checksumming of unencrypted
> data from a similar source and that'd make sense to consider when
> talking about *just* adding a hash/checksum for unencrypted data blocks.
>
> This is the relevant discussion from NIST on this subject:
>
> https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-38d.pdf
>
> Note particularly Appendix C: Requirements and Guidelines for Using
> Short Tags (though, really, the whole thing is good to read..).

I don't see that as very relevant. That's talking about using 32-bit
or 64-bit tags for things like VOIP packets where a single compromised
packet wouldn't reveal a whole lot. I think we have to take the view
that a single compromised disk block is a big deal.

> In the thread about checksum/hashes for the backup manifest, I was
> pretty sure you found some information regarding the amount of data
> being hashed vs. the size you want the hash/checksum to be and that
> seems like it'd be particularly relevant for this discussion (as it was
> for backups, at least as I recall..).  Hopefully we can go find that.

I went back and looked and found that I had written this:

https://www.postgresql.org/message-id/CA+TgmoYOKC_8o-AR1jTQs0mOrFx=_Rcy5udor1m-LjyJNiSWPQ@mail.gmail.com

I think that gets us a whole lot of nowhere, honestly. I think this
email from Andres is more on point:

http://postgr.es/m/20200327195624.xthhd4xuwabvd3ou@alap3.anarazel.de

I don't really understand all the details of the smhasher pages to
which he links, but I understand that they're measuring the quality of
the bit-mixing, which does matter. So does speed, because people like
their database to be fast even if it's using checksums (or TDE if we
had that). And I think the output size is also a relevant
consideration, because more output bits means both more chance of
detecting errors (assuming good bit-mixing, at least) and also more
wasted space that isn't being used to store your actual data.

I haven't really seen anything that makes me believe that there's a
particularly strong relationship between block size and ideal checksum
size. There's some relationship, for sure. For instance, you want the
checksum to be small relative to the size of the input, so as not to
waste a whole bunch of storage space. I wouldn't propose to hash 32
byte blocks with SHA-256, because my checksums would be as big as the
original data. But I don't really think such considerations are
relevant here. An 8kB block is big enough that any checksum algorithm
in common use today is going to produce output that is well under 1%
of the page size, so you're not going to be wasting tons of storage.

You might be wasting your time, though. One big knock on the
Fletcher-16 approach we're using today is that the chance of a chance
hash collision is pretty noticeably more than 0. Generally, I think
we're right to think that's acceptable, because your chances of
noticing even a single corrupted block are very high. However, if
you're operating tens or hundreds of thousands of PostgreSQL clusters
containing terabytes or petabytes of data, it's quite likely that
there will be instances of corruption which you fail to detect because
the checksum collided. Maybe you care about that. If you do, you
probably need at least a 64-bit checksum before the risk of missing
even a single instance of corruption due to a checksum collision
becomes negligible. Maybe even slightly larger if the amount of data
you're managing is truly vast.

So I think there's probably a good argument that if you're just
concerned about detecting corruption due to bugs, operator error,
hardware failure, etc., something like a 512-bit checksum is overkill
if the only purpose is to detect random bit flips. I think the time
when you need more bits is when you have some goal beyond being really
likely to detect a random error - e.g. if you want 100% guaranteed
detection of every single-bit error, or if you want error correction,
or if you want to foil an adversary who is trying to construct
checksums for maliciously modified blocks. But that is also true for
the backup manifest case, and we allowed SHA-n as an option there. I
feel like there are bound to be some people who want something like
SHA-n just because it sounds good, regardless of whether they really
need it.

We can tell them "no," though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Aleksander Alekseev
Date:
Hi hackers,

> > Can't we add some extra fork that stores this extra per-page
> > information, and contains this extra metadata
> >
> +1 for this approach. I had observed some painful corruption cases where block storage simply returned stale version
ofa rage of blocks. This is only possible because checksum is stored on the page itself.
 

That's very interesting, Andrey. Thanks for sharing.

> One of my questions is what algorithm(s) we'd want to support.

Should it necessarily be a fixed list? Why not support plugable algorithms?

An extension implementing a checksum algorithm is going to need:

- several hooks: check_page_after_reading, calc_checksum_before_writing
- register_checksum()/deregister_checksum()
- an API to save the checksums to a seperate fork

By knowing the block number and the hash size the extension knows
exactly where to look for the checksum in the fork.

-- 
Best regards,
Aleksander Alekseev



Re: better page-level checksums

From
Robert Haas
Date:
On Mon, Jun 13, 2022 at 9:23 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> Should it necessarily be a fixed list? Why not support plugable algorithms?
>
> An extension implementing a checksum algorithm is going to need:
>
> - several hooks: check_page_after_reading, calc_checksum_before_writing
> - register_checksum()/deregister_checksum()
> - an API to save the checksums to a seperate fork
>
> By knowing the block number and the hash size the extension knows
> exactly where to look for the checksum in the fork.

I don't think that a separate fork is a good option for reasons that I
articulated previously: I think it will be significantly more complex
to implement and add extra I/O.

I am not completely opposed to the idea of making the algorithm
pluggable but I'm not very excited about it either. Making the
algorithm pluggable probably wouldn't be super-hard, but allowing a
checksum of arbitrary size rather than one of a short list of fixed
sizes might complicate efforts to ensure this doesn't degrade
performance. And I'm not sure what the benefit is, either. This isn't
like archive modules or custom backup targets where the feature
proposes to interact with things outside the server and we don't know
what's happening on the other side and so need to offer an interface
that can accommodate what the user wants to do. Nor is it like a
custom background worker or a custom data type which lives fully
inside the database but the desired behavior could be anything. It's
not even like column compression where I think that the same small set
of strategies are probably fine for everybody but some people think
that customizing the behavior by datatype would be a good idea. All
it's doing is taking a fixed size block of data and checksumming it. I
don't see that as being something where there's a lot of interesting
things to experiment with from an extension point of view.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Aleksander Alekseev
Date:
Hi Robert,

> I don't think that a separate fork is a good option for reasons that I
> articulated previously: I think it will be significantly more complex
> to implement and add extra I/O.
>
> I am not completely opposed to the idea of making the algorithm
> pluggable but I'm not very excited about it either. Making the
> algorithm pluggable probably wouldn't be super-hard, but allowing a
> checksum of arbitrary size rather than one of a short list of fixed
> sizes might complicate efforts to ensure this doesn't degrade
> performance. And I'm not sure what the benefit is, either. This isn't
> like archive modules or custom backup targets where the feature
> proposes to interact with things outside the server and we don't know
> what's happening on the other side and so need to offer an interface
> that can accommodate what the user wants to do. Nor is it like a
> custom background worker or a custom data type which lives fully
> inside the database but the desired behavior could be anything. It's
> not even like column compression where I think that the same small set
> of strategies are probably fine for everybody but some people think
> that customizing the behavior by datatype would be a good idea. All
> it's doing is taking a fixed size block of data and checksumming it. I
> don't see that as being something where there's a lot of interesting
> things to experiment with from an extension point of view.

I see your point. Makes sense.

So, to clarify, what we are trying to achieve here is to reduce the
probability of an event when a page gets corrupted but the checksum is
accidentally the same as it was before the corruption, correct? And we
also assume that neither file system nor hardware catched this
corruption.

If that's the case I would say that using something like SHA256 would
be an overkill, not only because of the consumed disk space but also
because SHA256 is expensive. Allowing the user to choose from 16-bit,
32-bit and maybe 64-bit checksums should be enough. I would also
suggest that no matter how we do it, if the user chooses 16-bit
checksums the performance and the disk consumption should remain as
they currently are.

Regarding the particular choice of a hash function I would suggest the
MurmurHash family [1]. This is basically the industry standard (it's
good, it's fast, and relatively simple), and we already have
murmurhash32() in the core. We also have hash_bytes_extended() to get
64-bit checksums, but I have no strong opinion on whether this
particular hash function should be used for pages or not. I believe
some benchmarking is appropriate.

There is also a 128-bit version of MurmurHash. Personally I doubt that
it may be of value in practice, but it will not hurt to support it
either, while we are at it. (Probably not in the MVP, though). And if
we are going to choose this path, I see no reason not to support
SHA256 as well, for the paranoid users.

[1]: https://en.wikipedia.org/wiki/MurmurHash

-- 
Best regards,
Aleksander Alekseev



Re: better page-level checksums

From
Robert Haas
Date:
On Mon, Jun 13, 2022 at 12:59 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> So, to clarify, what we are trying to achieve here is to reduce the
> probability of an event when a page gets corrupted but the checksum is
> accidentally the same as it was before the corruption, correct? And we
> also assume that neither file system nor hardware catched this
> corruption.

Yeah, I think so, although it also depends on what the filesystem and
the hardware would do if they did catch the corruption. If they would
have made our read() or write() operation fail, then any checksum
feature at the PostgreSQL level is superfluous. If they would have
noticed the operation but not caused a failure, and say just logged
something in the system log, a PostgreSQL check could still be useful,
because the PostgreSQL user might not be looking at the system log,
but will definitely notice if they get an ERROR rather than a query
result from PostgreSQL. And if the lower-level systems wouldn't have
caught the failure at all, then checksums are useful in that case,
too.

> If that's the case I would say that using something like SHA256 would
> be an overkill, not only because of the consumed disk space but also
> because SHA256 is expensive. Allowing the user to choose from 16-bit,
> 32-bit and maybe 64-bit checksums should be enough. I would also
> suggest that no matter how we do it, if the user chooses 16-bit
> checksums the performance and the disk consumption should remain as
> they currently are.

If the user wants 16-bit checksums, the feature we've already got
seems good enough -- and, as you say, it doesn't use any extra disk
space. This proposal is just about making people happy if they want a
bigger checksum.

On the topic of which algorithm to use, I'd be inclined to think that
it is going to be more useful to offer checksums that are 64 bits or
more, since IMHO 32 is not all that much more than 16, and I still
think there are going to be alignment issues. Beyond that I don't have
anything against your specific suggestions, but I'd like to hear what
other people think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Matthias van de Meent
Date:
On Fri, 10 Jun 2022 at 15:58, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Why so? We already dole out per-page space in 4-byte increments
> > through pd_linp, and I see no reason why we can't reserve some line
> > pointers for per-page metadata if we decide that we need extra
> > per-page ~overhead~ metadata.
>
> Hmm, that's an interesting approach. I was thinking that putting data
> after the PageHeaderData struct would be a non-starter because the
> code that looks up a line pointer by index is currently just
> multiply-and-add and complicating it seems bad for performance.
> However, if we treated the space there as overlapping the line pointer
> array and making some line pointers unusable rather than something
> inserted prior to the line pointer array, we could avoid that. I still
> think it would be kind of complicated, though, because we'd have to
> find every bit of code that loops over the line pointer array or
> accesses it by index and make sure that it doesn't try to access the
> low-numbered line pointers.
>
> > Isn't the goal of a checksum to find - and where possible, correct -
> > bit flips and other broken pages? I would suggest not to use
> > cryptographic hash functions for that, as those are rarely
> > error-correcting.
>
> I wasn't thinking of trying to do error correction, just error
> detection. See also my earlier reply to Peter Geoghegan.

The use of CRC in our current page format implies that we can correct
(some) bit errors, which is why I presumed that that was a goal of
page checksums. I stand corrected.

> > Isn't that expected for most of those places? With the current
> > bufpage.h description of Page, it seems obvious that all bytes on a
> > page except those in the "hole" and those in the page header are under
> > full control of the AM. Of course AMs will pre-calculate limits and
> > offsets during compilation, that saves recalculation cycles and/or
> > cache lines with constants to keep in L1.
>
> Yep.
>
> > Can't we add some extra fork that stores this extra per-page
> > information, and contains this extra metadata in a double-buffered
> > format, so that both before the actual page is written the metadata
> > too is written to disk, while the old metadata is available too for
> > recovery purposes. This allows us to maintain the current format with
> > its low per-page overhead, and only have extra overhead (up to 2x
> > writes for each page, but the writes for these metadata pages need not
> > be BLCKSZ in size) for those that opt-in to the more computationally
> > expensive features of larger checksums, nonces, and/or other non-AM
> > per-page ~overhead~ metadata.
>
> It's not impossible, I'm sure, but it doesn't seem very appealing to
> me. Those extra reads and writes could be expensive, and there's no
> place to cleanly integrate them into the code structure. A function
> like PageIsVerified() -- which is where we currently validate
> checksums -- only gets the page. It can't go off and read some other
> page from disk to perform the checksum calculation.

It could be part of the buffer IO code to provide
PageIsVerifiedExtended with a pointer to the block metadata buffer.

> I'm not exactly sure what you have in mind when you say that the
> writes need not be BLCKSZ in size.

What I meant was that when the extra metadata is stored seperately
from the block itself, it could be written directly to the file offset
instead of having to track BLCKSZ data for N blocks, so the
metadata-write would be << BLCKSZ in length, while the block itself
would still be the normal BLCKSZ write.

> Technically I guess that's true,
> but then the checksums have to be crash safe, or they're not much
> good. If they're not part of the page, how do they get updated in a
> way that makes them crash safe? I guess it could be done: every time
> we write a FPW, enlarge the page image by the number of bytes that are
> stored in this location. When replaying an FPW, update those bytes
> too. And every time we read or write a page, also read or write those
> bytes. In essence, we'd be deciding that pages are 8192+n bytes, but
> the last n bytes are stored in a different file - and, in memory, a
> different buffer pool. I think that would be hugely invasive and
> unpleasant to make work and I think the performance would be poor,
> too.

I agree that this wouldn't be as performant from a R/W perspective as
keeping that metadata inside the block. But on the other hand, that is
only for block R/W operations, and not for in-memory block
manipulations.

> > I'd prefer if we didn't change the way pages are presented to AMs.
> > Currently, it is clear what area is available to you if you write an
> > AM that uses the bufpage APIs. Changing the page format to have the
> > buffer manager also touch / reserve space in the special areas seems
> > like a break of abstraction: Quoting from bufpage.h:
> >
> >  * AM-generic per-page information is kept in PageHeaderData.
> >  *
> >  * AM-specific per-page data (if any) is kept in the area marked "special
> >  * space"; each AM has an "opaque" structure defined somewhere that is
> >  * stored as the page trailer.  an access method should always
> >  * initialize its pages with PageInit and then set its own opaque
> >  * fields.
> >
> > I'd rather we keep this contract: am-generic stuff belongs in
> > PageHeaderData, with the rest of the page fully available for the AM
> > to use (including the special area).
>
> I don't think that changing the contract has to mean that it becomes
> unclear what the contract is. And you can't improve any system without
> changing some stuff. But you certainly don't have to like my ideas or
> anything....

It's not that I disagree with (or dislike the idea of) increasing the
resilience of checksums, I just want to be very careful that we don't
trade (potentially significant) runtime performance for features
people might not use. This thread seems very related to the 'storing
an explicit nonce'-thread, which also wants to reclaim space from a
page that is currently used by AMs, while AMs would lose access to
certain information on pages and certain optimizations that they could
do before. I'm very hesitant to let just any modification to the page
format go through because someone needs extra metadata attached to a
page.

That reminds me, there's one more item to be put on the compatibility
checklist: Currently, the FSM code assumes it can use all space on a
page (except the page header) for its total of 3 levels of FSM data.
Mixing page formats would break how it currently works, as changing
the space that is available on a page will change the fanout level of
each leaf in the tree, which our current code can't handle. To change
the page format of one page in the FSM would thus either require a
rewrite of the whole FSM fork, or extra metadata attached to the
relation that details where the format changes. A similar issue exists
with the VM fork.

That being said, I think that it could be possible to reuse
pd_checksum as an extra area indicator between pd_upper and
pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole]
pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area].
This should require limited rework in current AMs, especially if we
provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some
upper limit on how much overhead the storage uses per page.
Alternatively, we could claim some space on a page using a special
line pointer at the start of the page referring to storage data, while
having the same limitation on size.
One last option is we recognise that there are two storage locations
of pages that have different data requirements -- on-disk that
requires checksums, and in-memory that requires LSNs. Currently, those
fields are both stored on the page in distinct fields, but we could
(_could_) update the code to drop LSN when we store the page, and drop
the checksum when we load the page (at the cost of redo speed when
recovering from an unclean shutdown). That would provide an extra 64
bits on the page without breaking storage, assuming AMs don't already
misuse pd_lsn.

- Matthias



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Fri, Jun 10, 2022 at 6:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > My preference is for an approach that builds on that, or at least
> > doesn't significantly complicate it. So a cryptographic hash or nonce
> > can go in the special area proper (structs like BTPageOpaqueData don't
> > need any changes), but at a page offset before the special area proper
> > -- not after.
> >
> > What disadvantages does that approach have, if any, from your point of view?
>
> I think it would be an extremely good idea to store the extended
> checksum at the same offset in every page. Right now, code that wants
> to compute checksums, or a tool like pg_checksums that wants to verify
> them, can find the checksum without needing to interpret any of the
> remaining page contents. Things get sticky if you have to interpret
> the page contents to locate the checksum that's going to tell you
> whether the page contents are messed up. Perhaps this could be worked
> around if you tried hard enough, but I don't see what we get out of
> it.

Is that the how block-level encryption feature from EDB Advanced Server does it?

-- 
Peter Geoghegan



Re: better page-level checksums

From
Bruce Momjian
Date:
On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote:
> On Fri, Jun 10, 2022 at 6:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > My preference is for an approach that builds on that, or at least
> > > doesn't significantly complicate it. So a cryptographic hash or nonce
> > > can go in the special area proper (structs like BTPageOpaqueData don't
> > > need any changes), but at a page offset before the special area proper
> > > -- not after.
> > >
> > > What disadvantages does that approach have, if any, from your point of view?
> >
> > I think it would be an extremely good idea to store the extended
> > checksum at the same offset in every page. Right now, code that wants
> > to compute checksums, or a tool like pg_checksums that wants to verify
> > them, can find the checksum without needing to interpret any of the
> > remaining page contents. Things get sticky if you have to interpret
> > the page contents to locate the checksum that's going to tell you
> > whether the page contents are messed up. Perhaps this could be worked
> > around if you tried hard enough, but I don't see what we get out of
> > it.
> 
> Is that the how block-level encryption feature from EDB Advanced Server does it?

Uh, EDB Advanced Server doesn't have a block-level encryption feature.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: better page-level checksums

From
Peter Geoghegan
Date:
On Mon, Jun 13, 2022 at 2:54 PM Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote:
> > Is that the how block-level encryption feature from EDB Advanced Server does it?
>
> Uh, EDB Advanced Server doesn't have a block-level encryption feature.

Apparently there is something called "Vormetric Transparent Encryption
(VTE) – Transparent block-level encryption with access controls":

https://www.enterprisedb.com/blog/enhanced-security-edb-postgres-advanced-server-vormetric-data-security-platform

Perhaps there is some kind of confusion around the terminology here?

--
Peter Geoghegan



Re: better page-level checksums

From
Bruce Momjian
Date:
On Mon, Jun 13, 2022 at 03:03:17PM -0700, Peter Geoghegan wrote:
> On Mon, Jun 13, 2022 at 2:54 PM Bruce Momjian <bruce@momjian.us> wrote:
> > On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote:
> > > Is that the how block-level encryption feature from EDB Advanced Server does it?
> >
> > Uh, EDB Advanced Server doesn't have a block-level encryption feature.
> 
> Apparently there is something called "Vormetric Transparent Encryption
> (VTE) – Transparent block-level encryption with access controls":
> 
> https://www.enterprisedb.com/blog/enhanced-security-edb-postgres-advanced-server-vormetric-data-security-platform
> 
> Perhaps there is some kind of confusion around the terminology here?

That is encryption done in a virtual file system independent of
Postgres.  So, I guess the answer to your question is that this is not
how EDB Advanced Server does it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: better page-level checksums

From
Peter Geoghegan
Date:
On Mon, Jun 13, 2022 at 3:06 PM Bruce Momjian <bruce@momjian.us> wrote:
> That is encryption done in a virtual file system independent of
> Postgres.  So, I guess the answer to your question is that this is not
> how EDB Advanced Server does it.

Okay, thanks for clearing that up. The term "block based" does appear
in the article I linked to, so you can see why I didn't understand it
that way initially.

Anyway, I can see how it would be useful to be able to know the offset
of a nonce or of a hash digest on any given page, without access to a
running server. But why shouldn't that be possible with other designs,
including designs closer to what I've outlined?

A known fixed offset in the special area already assumes that all
pages must have a value in the first place, even though that won't be
true for the majority of individual Postgres servers. There is
implicit information involved in a design like the one Robert has
proposed; your backup tool (or whatever) already has to understand to
expect something other than no encryption at all, or no checksum at
all. Tools like pg_filedump already rely on implicit information about
the special area.

I'm not against the idea of picking a handful of checksum/encryption
schemes, with the understanding that we'll be committing to those
particular schemes indefinitely -- it's not reasonable to expect
infinite flexibility here (and so I don't). But why should we accept
something that seems to me to be totally inflexible, and doesn't
compose with other things?

-- 
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Mon, Jun 13, 2022 at 5:14 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> It's not that I disagree with (or dislike the idea of) increasing the
> resilience of checksums, I just want to be very careful that we don't
> trade (potentially significant) runtime performance for features
> people might not use. This thread seems very related to the 'storing
> an explicit nonce'-thread, which also wants to reclaim space from a
> page that is currently used by AMs, while AMs would lose access to
> certain information on pages and certain optimizations that they could
> do before. I'm very hesitant to let just any modification to the page
> format go through because someone needs extra metadata attached to a
> page.

Right. So, to be clear, I think there is an opportunity to store ONE
extra blob of data in the page. It might be an extended checksum, or
it might be a nonce for cryptographic authentication, but it can't be
both. I think this is OK, because in earlier discussions of TDE, it
seems that if you're using encryption and also want to verify page
integrity, you'll use an encryption system that produces some kind of
verifier, and you'll store that into this space in the page instead of
using an enhanced-checksum feature.

In other words, I'm imagining creating a space at the end of each page
for some sort of enhanced security or data integrity feature, and you
can either choose not to use one (in which case things work as they do
today), or you can choose an extended checksums feature, or maybe in
the future you can choose some form of TDE that involves storing a
nonce or a page verifier in the page. But you just get one.

Now, the logical question to ask is: well, if there's only one
opportunity to store an extra blob of data on every page, is this the
best way to use it? What if someone comes along with another feature
that also wants to store a blob of data on every page, and they can't
do it because this proposal got there first? My answer is: well, if
that additional feature is something that provides encryption or
tamper-resistance or data integrity or security in any form, then it
can just be added as a new option for how you use this blob of space,
and users who prefer the new thing to the existing options can pick
it. If it's something else, then .... what is it, exactly? It seems to
me that the kinds of things that require space in *every* page of the
cluster are really the things that fall into this category.

For example, Stephen mused earlier that maybe while we're at it we
could find a way to include an XID epoch in every page. Maybe so, but
we wouldn't actually want that in *every* page. We would only want it
in the heap pages. And as far as I can see that's pretty generally how
things go. There are plenty of projects that might want extra space in
each page *for a certain AM* and I don't see any reason why what I
propose to do here would rule that out. I think this and that could
both be done, and doing this might even make doing that easier by
putting in place some useful infrastructure. What I don't think we can
get away with is having multiple systems that are each taking a bite
out of every page for every AM -- but I think that's OK, because I
don't think there's a lot of need for multiple such systems.

> That reminds me, there's one more item to be put on the compatibility
> checklist: Currently, the FSM code assumes it can use all space on a
> page (except the page header) for its total of 3 levels of FSM data.
> Mixing page formats would break how it currently works, as changing
> the space that is available on a page will change the fanout level of
> each leaf in the tree, which our current code can't handle. To change
> the page format of one page in the FSM would thus either require a
> rewrite of the whole FSM fork, or extra metadata attached to the
> relation that details where the format changes. A similar issue exists
> with the VM fork.

I agree with all of this except I think that "mixing page formats" is
a thing we can't do.

> That being said, I think that it could be possible to reuse
> pd_checksum as an extra area indicator between pd_upper and
> pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole]
> pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area].
> This should require limited rework in current AMs, especially if we
> provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some
> upper limit on how much overhead the storage uses per page.

This is an interesting alternative. It's unclear to me that it makes
anything better if the [blackbox] area is before the special area vs.
afterward. And either way, if that area is fixed-size across the
cluster, you don't really need to use pd_checksum to find it, because
you can just know where it is. A possible advantage of this approach
is that it might make it simpler to cope with a scenario where some
pages in the cluster have this blackbox space and others don't. I
wasn't really thinking that on-line page format conversions were
likely to be practical, but certainly the chances are better if we've
got an explicit pointer to the extra space vs. just knowing where it
has to be.

> Alternatively, we could claim some space on a page using a special
> line pointer at the start of the page referring to storage data, while
> having the same limitation on size.

That sounds messy.

> One last option is we recognise that there are two storage locations
> of pages that have different data requirements -- on-disk that
> requires checksums, and in-memory that requires LSNs. Currently, those
> fields are both stored on the page in distinct fields, but we could
> (_could_) update the code to drop LSN when we store the page, and drop
> the checksum when we load the page (at the cost of redo speed when
> recovering from an unclean shutdown). That would provide an extra 64
> bits on the page without breaking storage, assuming AMs don't already
> misuse pd_lsn.

It seems wrong to me to say that we don't need the LSN for a page
stored on disk. Recovery relies on it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Matthias van de Meent
Date:
On Tue, 14 Jun 2022 at 14:56, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 13, 2022 at 5:14 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > It's not that I disagree with (or dislike the idea of) increasing the
> > resilience of checksums, I just want to be very careful that we don't
> > trade (potentially significant) runtime performance for features
> > people might not use. This thread seems very related to the 'storing
> > an explicit nonce'-thread, which also wants to reclaim space from a
> > page that is currently used by AMs, while AMs would lose access to
> > certain information on pages and certain optimizations that they could
> > do before. I'm very hesitant to let just any modification to the page
> > format go through because someone needs extra metadata attached to a
> > page.
>
> Right. So, to be clear, I think there is an opportunity to store ONE
> extra blob of data in the page. It might be an extended checksum, or
> it might be a nonce for cryptographic authentication, but it can't be
> both. I think this is OK, because in earlier discussions of TDE, it
> seems that if you're using encryption and also want to verify page
> integrity, you'll use an encryption system that produces some kind of
> verifier, and you'll store that into this space in the page instead of
> using an enhanced-checksum feature.

Agreed.

> In other words, I'm imagining creating a space at the end of each page
> for some sort of enhanced security or data integrity feature, and you
> can either choose not to use one (in which case things work as they do
> today), or you can choose an extended checksums feature, or maybe in
> the future you can choose some form of TDE that involves storing a
> nonce or a page verifier in the page. But you just get one.
>
> Now, the logical question to ask is: well, if there's only one
> opportunity to store an extra blob of data on every page, is this the
> best way to use it? What if someone comes along with another feature
> that also wants to store a blob of data on every page, and they can't
> do it because this proposal got there first? My answer is: well, if
> that additional feature is something that provides encryption or
> tamper-resistance or data integrity or security in any form, then it
> can just be added as a new option for how you use this blob of space,
> and users who prefer the new thing to the existing options can pick
> it. If it's something else, then .... what is it, exactly? It seems to
> me that the kinds of things that require space in *every* page of the
> cluster are really the things that fall into this category.
>
> For example, Stephen mused earlier that maybe while we're at it we
> could find a way to include an XID epoch in every page. Maybe so, but
> we wouldn't actually want that in *every* page. We would only want it
> in the heap pages. And as far as I can see that's pretty generally how
> things go. There are plenty of projects that might want extra space in
> each page *for a certain AM* and I don't see any reason why what I
> propose to do here would rule that out. I think this and that could
> both be done, and doing this might even make doing that easier by
> putting in place some useful infrastructure. What I don't think we can
> get away with is having multiple systems that are each taking a bite
> out of every page for every AM -- but I think that's OK, because I
> don't think there's a lot of need for multiple such systems.

I agree with the premise of one only needing one such blob on the
page, yet I don't think that putting it on the exact end of the page
is the best option.

PageGetSpecialPointer is much simpler when you can rely on the
location of the special area. As special areas can be accessed N times
each time a buffer is loaded from disk, and yet the 'storage system
extra blob' only twice (once read, once write), I think the special
area should have priority when handing out page space.

> > That reminds me, there's one more item to be put on the compatibility
> > checklist: Currently, the FSM code assumes it can use all space on a
> > page (except the page header) for its total of 3 levels of FSM data.
> > Mixing page formats would break how it currently works, as changing
> > the space that is available on a page will change the fanout level of
> > each leaf in the tree, which our current code can't handle. To change
> > the page format of one page in the FSM would thus either require a
> > rewrite of the whole FSM fork, or extra metadata attached to the
> > relation that details where the format changes. A similar issue exists
> > with the VM fork.
>
> I agree with all of this except I think that "mixing page formats" is
> a thing we can't do.

I'm not sure it's impossible, but I would indeed agree it would not be
a trivial issue to solve.

> > That being said, I think that it could be possible to reuse
> > pd_checksum as an extra area indicator between pd_upper and
> > pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole]
> > pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area].
> > This should require limited rework in current AMs, especially if we
> > provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some
> > upper limit on how much overhead the storage uses per page.
>
> This is an interesting alternative. It's unclear to me that it makes
> anything better if the [blackbox] area is before the special area vs.
> afterward.

The main benefit of this order is that an AM will see it's special
area at a fixed location if it always uses a fixed-size Opaque struct,
i.e. that an AM may still use (Page + BLCKSZ - sizeof(IndexOpaque)) as
seen in [0]. There might be little to gain, but alternatively there's
also little to lose for the storage system -- page read/write to the
FS happens at most once for each time the page is accessed/written to.
I'd thus much rather let the IO subsystem pay this cost than the AM,
as when you'd offload this cost to the AM that would be a constant
overhead for all in-memory operations, while if it were offloaded to
the IO it would only be felt once per swapped block, on average.

The best point for this layout is that this lets us determine what the
data on each page is for without requiring access to shmem variables.
Appending or prepending storage-special areas to the pd_special area
would confuse AMs about what data is theirs on the page -- making it
explicit in the page format would remove this potential for
confustion, while allowing this storage-blob area to be dynamically
sized.

> And either way, if that area is fixed-size across the
> cluster, you don't really need to use pd_checksum to find it, because
> you can just know where it is. A possible advantage of this approach
> is that it might make it simpler to cope with a scenario where some
> pages in the cluster have this blackbox space and others don't. I
> wasn't really thinking that on-line page format conversions were
> likely to be practical, but certainly the chances are better if we've
> got an explicit pointer to the extra space vs. just knowing where it
> has to be.
>
> > Alternatively, we could claim some space on a page using a special
> > line pointer at the start of the page referring to storage data, while
> > having the same limitation on size.
>
> That sounds messy.

Yep. It isn't my first choice neither, but it is something that I did
consider - it has the potentially desirable effect of the AM being
able to relocate this blob.

> > One last option is we recognise that there are two storage locations
> > of pages that have different data requirements -- on-disk that
> > requires checksums, and in-memory that requires LSNs. Currently, those
> > fields are both stored on the page in distinct fields, but we could
> > (_could_) update the code to drop LSN when we store the page, and drop
> > the checksum when we load the page (at the cost of redo speed when
> > recovering from an unclean shutdown). That would provide an extra 64
> > bits on the page without breaking storage, assuming AMs don't already
> > misuse pd_lsn.
>
> It seems wrong to me to say that we don't need the LSN for a page
> stored on disk. Recovery relies on it.

It's not critical for recovery, "just" very useful; but indeed this
too isn't great.

- Matthias

[0] https://commitfest.postgresql.org/38/3543
[1] https://www.postgresql.org/message-id/CA+TgmoaD8wMN6i1mmuo+4ZNeGE3Hd57ys8uV8UZm7cneqy3W2g@mail.gmail.com



Re: better page-level checksums

From
Robert Haas
Date:
On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Anyway, I can see how it would be useful to be able to know the offset
> of a nonce or of a hash digest on any given page, without access to a
> running server. But why shouldn't that be possible with other designs,
> including designs closer to what I've outlined?

I don't know what you mean by this. As far as I'm aware, the only
design you've outlined is one where the space wasn't at the same
offset on every page.

> A known fixed offset in the special area already assumes that all
> pages must have a value in the first place, even though that won't be
> true for the majority of individual Postgres servers. There is
> implicit information involved in a design like the one Robert has
> proposed; your backup tool (or whatever) already has to understand to
> expect something other than no encryption at all, or no checksum at
> all. Tools like pg_filedump already rely on implicit information about
> the special area.

In general, I was imagining that you'd need to look at the control
file to understand how much space had been reserved per page in this
particular cluster. I agree that's a bit awkward, especially for
pg_filedump. However, pg_filedump and I think also some code internal
to PostgreSQL try to figure out what kind of page we've got by looking
at the *size* of the special space. It's only good luck that we
haven't had a collision there yet, and continuing to rely on that
seems like a dead end. Perhaps we should start including a per-AM
magic number at the beginning of the special space.

> I'm not against the idea of picking a handful of checksum/encryption
> schemes, with the understanding that we'll be committing to those
> particular schemes indefinitely -- it's not reasonable to expect
> infinite flexibility here (and so I don't). But why should we accept
> something that seems to me to be totally inflexible, and doesn't
> compose with other things?

We shouldn't accept something that's totally inflexible, but I don't
know why this seems that way to you.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 8:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Anyway, I can see how it would be useful to be able to know the offset
> > of a nonce or of a hash digest on any given page, without access to a
> > running server. But why shouldn't that be possible with other designs,
> > including designs closer to what I've outlined?
>
> I don't know what you mean by this. As far as I'm aware, the only
> design you've outlined is one where the space wasn't at the same
> offset on every page.

I am skeptical of that particular aspect, yes. Though I would define
it the other way around (now the true special area struct isn't
necessarily at the same offset for a given AM, at least across data
directories).

My main concern is maintaining the ability to interpret much about the
contents of a page without context, and to not make it any harder to
grow the special area dynamically -- which is a broader concern.
Your patch isn't going to be the last one that wants to do something
with the special area. This needs to be carefully considered.

I see a huge amount of potential for adding new optimizations that use
subsidiary space on the page, presumably implemented via a special
area that can grow dynamically. For example, an ad-hoc compression
technique for heap pages that temporarily "absorbs" some extra
versions in the event of opportunistic pruning running and failing to
free enough space. Such a design would operate on similar principles
to deduplication in unique indexes, where the goal is to buy time
rather than buy space. When we fail to keep the contents of a heap
page together today, we often barely fail, so I expect something like
this to have an outsized impact on some workloads.

> In general, I was imagining that you'd need to look at the control
> file to understand how much space had been reserved per page in this
> particular cluster. I agree that's a bit awkward, especially for
> pg_filedump. However, pg_filedump and I think also some code internal
> to PostgreSQL try to figure out what kind of page we've got by looking
> at the *size* of the special space. It's only good luck that we
> haven't had a collision there yet, and continuing to rely on that
> seems like a dead end. Perhaps we should start including a per-AM
> magic number at the beginning of the special space.

It's true that that approach is just a hack -- we probably can do
better. I don't think that it's okay to break it, though. At least not
without providing a comparable alternative, that doesn't rely on
context from the control file.

--
Peter Geoghegan



Re: better page-level checksums

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Tue, Jun 14, 2022 at 8:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> However, pg_filedump and I think also some code internal
>> to PostgreSQL try to figure out what kind of page we've got by looking
>> at the *size* of the special space. It's only good luck that we
>> haven't had a collision there yet, and continuing to rely on that
>> seems like a dead end. Perhaps we should start including a per-AM
>> magic number at the beginning of the special space.

It's been some years since I had much to do with pg_filedump, but
my recollection is that the size of the special space is only one
part of its heuristics, because there already *are* collisions.
Moreover, there already are per-AM magic numbers in there that
it uses to resolve those cases.  They're not at the front though.
Nobody has ever wanted to break on-disk compatibility just to make
pg_filedump's page-type identification less klugy, so I find it
hard to believe that the above suggestion isn't a non-starter.

            regards, tom lane



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 11:08 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I agree with the premise of one only needing one such blob on the
> page, yet I don't think that putting it on the exact end of the page
> is the best option.
>
> PageGetSpecialPointer is much simpler when you can rely on the
> location of the special area. As special areas can be accessed N times
> each time a buffer is loaded from disk, and yet the 'storage system
> extra blob' only twice (once read, once write), I think the special
> area should have priority when handing out page space.

Hmm, but on the other hand, if you imagine a scenario in which the
"storage system extra blob" is actually a nonce for TDE, you need to
be able to find it before you've decrypted the rest of the page. If
pd_checksum gives you the offset of that data, you need to exclude it
from what gets encrypted, which means that you need encrypt three
separate non-contiguous areas of the page whose combined size is
unlikely to be a multiple of the encryption algorithm's block size.
That kind of sucks (and putting it at the end of the page makes it way
better).

That said, I certainly agree that finding the special space needs to
be fast. The question in my mind is HOW fast it needs to be, and what
techniques we might be able to use to dodge the problem. For instance,
suppose that, during the startup sequence, we look at the control
file, figure out the size of the 'storage system extra blob', and
based on that each AM figures out the byte-offset of its special space
and caches that in a global variable. Then, instead of
PageGetSpecialSpace(page) it does PageGetBtreeSpecialSpace(page) or
whatever, where the implementation is ((char*) page) +
the_afformentioned_global_variable. Is that going to be too slow?

If it is, then I think this whole effort may be in more trouble than I
can get it out of, because it's not just the location of the special
space that is an issue here, and indeed from what I can see that's not
even the most important issue. There's tons of constants that are
computed based on the amount of usable space in the page, and I don't
have a better idea than turning those constants into global variables
that are computed once ... well, perhaps in some cases we could
multiply compile hot bits of code, once per possible value of the
compile-time constant, but I'm pretty sure we don't want to do that
for the entire index AM.

There's going to have to be some compromise here. On the one hand
you're going to have people who want to be able to do run-time
conversions between page formats even at the cost of extra runtime
overhead on top of what the basic feature necessarily implies. On the
other hand you're going to have people who don't think any overhead at
all is acceptable, even if it's purely nominal and only visible on a
microbenchmark. Such arguments can easily become holy wars. I think we
should take a pragmatic approach: big slowdowns are categorically
unacceptable, and every effort must be made to minimize overhead, but
if the only permissible amount of overhead is exactly zero, then
there's no hope of ever implementing any of these kinds of features. I
don't think that's actually what most people want.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 9:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's been some years since I had much to do with pg_filedump, but
> my recollection is that the size of the special space is only one
> part of its heuristics, because there already *are* collisions.

Right, there are collisions even today. The heuristics are kludgey,
but they work perfectly in practice. That's not just due to luck --
it's due to people making sure that they continued to work over time.

> Moreover, there already are per-AM magic numbers in there that
> it uses to resolve those cases.  They're not at the front though.
> Nobody has ever wanted to break on-disk compatibility just to make
> pg_filedump's page-type identification less klugy, so I find it
> hard to believe that the above suggestion isn't a non-starter.

There is no doubt that it's not worth breaking on-disk compatibility
just for pg_filedump. The important principle here is that
high-context page formats are bad, and should be avoided whenever
possible.

Why isn't it possible to avoid it here? We have all the bits we need
for it in the page header, and then some. Why should we assume that
it'll never be useful to apply encryption selectively, perhaps at the
relation level?

-- 
Peter Geoghegan



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 10:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, but on the other hand, if you imagine a scenario in which the
> "storage system extra blob" is actually a nonce for TDE, you need to
> be able to find it before you've decrypted the rest of the page. If
> pd_checksum gives you the offset of that data, you need to exclude it
> from what gets encrypted, which means that you need encrypt three
> separate non-contiguous areas of the page whose combined size is
> unlikely to be a multiple of the encryption algorithm's block size.
> That kind of sucks (and putting it at the end of the page makes it way
> better).

I don't have a great understanding of how that cost will be felt in
detail right now, because I don't know enough about the project and
the requirements for TDE in general.

> That said, I certainly agree that finding the special space needs to
> be fast. The question in my mind is HOW fast it needs to be, and what
> techniques we might be able to use to dodge the problem. For instance,
> suppose that, during the startup sequence, we look at the control
> file, figure out the size of the 'storage system extra blob', and
> based on that each AM figures out the byte-offset of its special space
> and caches that in a global variable. Then, instead of
> PageGetSpecialSpace(page) it does PageGetBtreeSpecialSpace(page) or
> whatever, where the implementation is ((char*) page) +
> the_afformentioned_global_variable. Is that going to be too slow?

Who knows? For now the important point is that there is a tension
between the requirements of TDE, and the requirements of access
methods (especially index access methods). It's possible that this
will turn out not to be much of a problem. But the burden of proof is
yours. Making a big change to the on-disk format like this (a change
that affects every access method) should be held to an exceptionally
high standard.

There are bound to be tacit or even explicit assumptions made by
access methods that you risk breaking here. The reality is that all of
the access method code evolved in an environment where the special
space size was constant and generic for a given BLCKSZ. I don't have
much sympathy for any suggestion that code written 20 years ago should
have known not to make these assumptions. I have a lot more sympathy
for the idea that it's a general problem with our infrastructure
(particularly code in bufpage.c and the delicate assumptions made by
its callers) -- a problem that is worth addressing with a broad
solution that enables lots of different work.

We don't necessarily get another shot at this if we get it wrong now.

> There's going to have to be some compromise here. On the one hand
> you're going to have people who want to be able to do run-time
> conversions between page formats even at the cost of extra runtime
> overhead on top of what the basic feature necessarily implies. On the
> other hand you're going to have people who don't think any overhead at
> all is acceptable, even if it's purely nominal and only visible on a
> microbenchmark. Such arguments can easily become holy wars.

How many times has a big change to the on-disk format of this kind of
magnitude taken place, post-pg_upgrade? I would argue that this would
be the first, since it is the moral equivalent of extending the size
of the generic page header.

For all I know the overhead will be perfectly fine, and everybody
wins. I just want to be adamant that we're making the right
trade-offs, and maximizing the benefit from any new cost imposed on
access method code.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 1:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There is no doubt that it's not worth breaking on-disk compatibility
> just for pg_filedump. The important principle here is that
> high-context page formats are bad, and should be avoided whenever
> possible.

I agree.

> Why isn't it possible to avoid it here? We have all the bits we need
> for it in the page header, and then some. Why should we assume that
> it'll never be useful to apply encryption selectively, perhaps at the
> relation level?

We can have anything we want here, but we can't have everything we
want at the same time. There are irreducible engineering trade-offs
here. If all pages in a given cluster are the same, backends can
compute the values of things that are currently compile-time constants
upon startup and continue to use them for the lifetime of the backend.
If pages can vary, some encrypted or checksummed and others not, then
you have to recompute those values for every page. That's bound to
have some cost. It is also more flexible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
> We can have anything we want here, but we can't have everything we
> want at the same time. There are irreducible engineering trade-offs
> here. If all pages in a given cluster are the same, backends can
> compute the values of things that are currently compile-time constants
> upon startup and continue to use them for the lifetime of the backend.
> If pages can vary, some encrypted or checksummed and others not, then
> you have to recompute those values for every page. That's bound to
> have some cost. It is also more flexible.

Maybe not -- it depends on the particulars of the code. For example,
it might be okay for the B-Tree code to assume that B-Tree pages have
a special area at a known fixed offset, determined at compile time. At
the same time, it might very well not be okay for a backup tool to
make any such assumption, because it doesn't have the same context.

Even within TDE, it might be okay to assume that it's a feature that
the user must commit to using for a whole cluster at initdb time. What
isn't okay is committing to that assumption now and forever, by
leaving the door open to a world in which that assumption no longer
holds. Like when you do finally get around to making TDE something
that can work at the relation level, for example. Even if there is
only a small chance of that ever happening, why wouldn't we be
prepared for it, just on general principle?

-- 
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 2:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe not -- it depends on the particulars of the code. For example,
> it might be okay for the B-Tree code to assume that B-Tree pages have
> a special area at a known fixed offset, determined at compile time. At
> the same time, it might very well not be okay for a backup tool to
> make any such assumption, because it doesn't have the same context.
>
> Even within TDE, it might be okay to assume that it's a feature that
> the user must commit to using for a whole cluster at initdb time. What
> isn't okay is committing to that assumption now and forever, by
> leaving the door open to a world in which that assumption no longer
> holds. Like when you do finally get around to making TDE something
> that can work at the relation level, for example. Even if there is
> only a small chance of that ever happening, why wouldn't we be
> prepared for it, just on general principle?

To the extent that we can leave ourselves room to do new things in the
future without incurring unreasonable costs in the present, I'm in
favor of that, as I believe anyone would be. But as you say, a lot
depends on the specifics. Theoretical flexibility that can only be
used in practice by really slow code doesn't help anybody.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 11:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Even within TDE, it might be okay to assume that it's a feature that
> > the user must commit to using for a whole cluster at initdb time. What
> > isn't okay is committing to that assumption now and forever, by
> > leaving the door open to a world in which that assumption no longer
> > holds. Like when you do finally get around to making TDE something
> > that can work at the relation level, for example. Even if there is
> > only a small chance of that ever happening, why wouldn't we be
> > prepared for it, just on general principle?
>
> To the extent that we can leave ourselves room to do new things in the
> future without incurring unreasonable costs in the present, I'm in
> favor of that, as I believe anyone would be. But as you say, a lot
> depends on the specifics. Theoretical flexibility that can only be
> used in practice by really slow code doesn't help anybody.

A tool like pg_filedump or a backup tool can easily afford this
overhead. The only cost that TDE has to pay for this added flexibility
is that it has to set one of the PD_* bits in a code path that is
already bound to be very expensive. What's so bad about that?

Honestly, I'm a bit surprised that you're pushing back on this
particular point. A nonce for TDE is just something that code in
places like bufpage.h ought to know about. It has to be negotiated at
that level, because it will in fact affect a lot of callers to the
bufpage.h functions.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> A tool like pg_filedump or a backup tool can easily afford this
> overhead. The only cost that TDE has to pay for this added flexibility
> is that it has to set one of the PD_* bits in a code path that is
> already bound to be very expensive. What's so bad about that?
>
> Honestly, I'm a bit surprised that you're pushing back on this
> particular point. A nonce for TDE is just something that code in
> places like bufpage.h ought to know about. It has to be negotiated at
> that level, because it will in fact affect a lot of callers to the
> bufpage.h functions.

Peter, unless I have missed something, this email is the very first
one where you or anyone else have said anything at all about a PD_*
bit. Even here, it's not very clear exactly what you are proposing.
Therefore I have neither said anything bad about it in the past, nor
can I now answer the question as to what is "so bad about it." If you
want to make a concrete proposal, I will be happy to tell you what I
think about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 12:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Peter, unless I have missed something, this email is the very first
> one where you or anyone else have said anything at all about a PD_*
> bit. Even here, it's not very clear exactly what you are proposing.
> Therefore I have neither said anything bad about it in the past, nor
> can I now answer the question as to what is "so bad about it." If you
> want to make a concrete proposal, I will be happy to tell you what I
> think about it.

I am proposing that we not commit ourselves to relying on implicit
information about what must be true for every page in the cluster.
Just having a little additional page-header metadata (in pd_flags)
would accomplish that much, and wouldn't in itself impose any real
overhead on TDE.

It's not like the PageHeaderData.pd_flags bits are already a precious
commodity, in the same way as the heap tuple infomask status bits are.
We can afford to use some of them for this purpose, and then some.

Why wouldn't we do it that way, just on general principle?

You may still find it useful to rely on high level context at the
level of code that runs on the server, perhaps for performance reasons
(though it's unclear how much it matters). In which case the status
bit is technically redundant information as far as the code is
concerned. That may well be fine.

--
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 3:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I am proposing that we not commit ourselves to relying on implicit
> information about what must be true for every page in the cluster.
> Just having a little additional page-header metadata (in pd_flags)
> would accomplish that much, and wouldn't in itself impose any real
> overhead on TDE.
>
> It's not like the PageHeaderData.pd_flags bits are already a precious
> commodity, in the same way as the heap tuple infomask status bits are.
> We can afford to use some of them for this purpose, and then some.
>
> Why wouldn't we do it that way, just on general principle?
>
> You may still find it useful to rely on high level context at the
> level of code that runs on the server, perhaps for performance reasons
> (though it's unclear how much it matters). In which case the status
> bit is technically redundant information as far as the code is
> concerned. That may well be fine.

I still am not clear on precisely what you are proposing here. I do
agree that there is significant bit space available in pd_flags and
that consuming some of it wouldn't be stupid, but that doesn't add up
to a proposal. Maybe the proposal is: figure out how many different
configurations there are for this new kind of page space, let's say N,
and then reserve ceil(log2(N)) bits from pd_flags to indicate which
one we've got.

One possible problem with this is that, if the page is actually
encrypted, we might want pd_flags to also be encrypted. The existing
contents of pd_flags disclose some information about the tuples that
are on the page, so having them exposed to prying eyes does not seem
appealing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 1:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I still am not clear on precisely what you are proposing here. I do
> agree that there is significant bit space available in pd_flags and
> that consuming some of it wouldn't be stupid, but that doesn't add up
> to a proposal. Maybe the proposal is: figure out how many different
> configurations there are for this new kind of page space, let's say N,
> and then reserve ceil(log2(N)) bits from pd_flags to indicate which
> one we've got.

I'm just making a general point. Why wouldn't we start out with the
assumption that we use some pd_flags bit space for this stuff?

> One possible problem with this is that, if the page is actually
> encrypted, we might want pd_flags to also be encrypted. The existing
> contents of pd_flags disclose some information about the tuples that
> are on the page, so having them exposed to prying eyes does not seem
> appealing.

I'm skeptical of the idea that we want to avoid leaving any metadata
unencrypted. But I'm not an expert on TDE, and don't want to say too
much about it without having done some more research. I would like to
see some justification for just encrypting everything on the page
without concern for the loss of debuggability, though. What is the
underlying theory behind that particular decision? Are there any
examples that we can draw from, from other systems or published
designs?

Let's assume for now that we don't leave pd_flags unencrypted, as you
have suggested. We're still discussing new approaches to checksumming
in the scope of this work, which of course includes many individual
cases that don't involve any encryption. Plus even with encryption
there are things like defensive assertions that can be added by using
a flag bit for this.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 1:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jun 14, 2022 at 1:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I still am not clear on precisely what you are proposing here. I do
> > agree that there is significant bit space available in pd_flags and
> > that consuming some of it wouldn't be stupid, but that doesn't add up
> > to a proposal. Maybe the proposal is: figure out how many different
> > configurations there are for this new kind of page space, let's say N,
> > and then reserve ceil(log2(N)) bits from pd_flags to indicate which
> > one we've got.
>
> I'm just making a general point. Why wouldn't we start out with the
> assumption that we use some pd_flags bit space for this stuff?

Technically we don't already do that today, with the 16-bit checksums
that are stored in PageHeaderData.pd_checksum. But we do something
equivalent: low-level tools can still infer that checksums must not be
enabled on the page (really the cluster) indirectly in the event of a
0 checksum. A 0 value can reasonably be interpreted as a page from a
cluster without checksums (barring page corruption). This is basically
reasonable because our implementation of checksums is guaranteed to
not generate 0 as a valid checksum value.

While pg_filedump does not rely on the 0 checksum convention
currently, it doesn't really need to. When the user uses the -k option
to verify checksums in passing, pg_filedump can assume that checksums
must be enabled ("the user said they must be so expect it" is a
reasonable assumption at that point). This also depends on there being
only one approach to checksums.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 4:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm just making a general point. Why wouldn't we start out with the
> assumption that we use some pd_flags bit space for this stuff?

Well, the reason that wasn't my starting assumption is because I
didn't think of the idea.

> I'm skeptical of the idea that we want to avoid leaving any metadata
> unencrypted. But I'm not an expert on TDE, and don't want to say too
> much about it without having done some more research. I would like to
> see some justification for just encrypting everything on the page
> without concern for the loss of debuggability, though. What is the
> underlying theory behind that particular decision? Are there any
> examples that we can draw from, from other systems or published
> designs?

I don't really think there is much controversy about the idea that
it's a good idea to encrypt all of the data rather than only some of
it. I mean, that's what side channel attacks are: failure to secure
all of the information that an attacker might find useful.
Unfortunately, it seems inevitable that any TDE implementation in
PostgreSQL is going to leak some information that an attacker might
consider useful - e.g. we can't conceal how many files they are, or
what they're called, or the lengths of those files. But it seems
absolutely clear that our goal ought to be to leak as little
information as possible.

> Let's assume for now that we don't leave pd_flags unencrypted, as you
> have suggested. We're still discussing new approaches to checksumming
> in the scope of this work, which of course includes many individual
> cases that don't involve any encryption. Plus even with encryption
> there are things like defensive assertions that can be added by using
> a flag bit for this.

True. I don't think we should be too profligate with those bits just
in case somebody needs a bunch of them for something important in the
future, but it's probably fine to use up one or two.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Technically we don't already do that today, with the 16-bit checksums
> that are stored in PageHeaderData.pd_checksum. But we do something
> equivalent: low-level tools can still infer that checksums must not be
> enabled on the page (really the cluster) indirectly in the event of a
> 0 checksum. A 0 value can reasonably be interpreted as a page from a
> cluster without checksums (barring page corruption). This is basically
> reasonable because our implementation of checksums is guaranteed to
> not generate 0 as a valid checksum value.

I don't think that 'pg_checksums -d' zeroes the checksum values on the
pages in the cluster.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 7:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
> But it seems
> absolutely clear that our goal ought to be to leak as little
> information as possible.

But at what cost?

Basically I think that this is giving up rather a lot. For example,
isn't it possible that we'd have corruption that could be a bug in
either the checksum code, or in recovery?

I'd feel a lot better about it if there was some sense of both the
costs and the benefits.

> > Let's assume for now that we don't leave pd_flags unencrypted, as you
> > have suggested. We're still discussing new approaches to checksumming
> > in the scope of this work, which of course includes many individual
> > cases that don't involve any encryption. Plus even with encryption
> > there are things like defensive assertions that can be added by using
> > a flag bit for this.
>
> True. I don't think we should be too profligate with those bits just
> in case somebody needs a bunch of them for something important in the
> future, but it's probably fine to use up one or two.

Sure, but how many could possibly be needed for this? I can't see it
being more than 2 or 3. Which seems absolutely fine. They *definitely*
have no value if nobody ever uses them for anything.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Michael Paquier
Date:
On Tue, Jun 14, 2022 at 10:21:16PM -0400, Robert Haas wrote:
> On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Technically we don't already do that today, with the 16-bit checksums
>> that are stored in PageHeaderData.pd_checksum. But we do something
>> equivalent: low-level tools can still infer that checksums must not be
>> enabled on the page (really the cluster) indirectly in the event of a
>> 0 checksum. A 0 value can reasonably be interpreted as a page from a
>> cluster without checksums (barring page corruption). This is basically
>> reasonable because our implementation of checksums is guaranteed to
>> not generate 0 as a valid checksum value.
>
> I don't think that 'pg_checksums -d' zeroes the checksum values on the
> pages in the cluster.

Saving the suspense..  pg_checksums --disable only updates the control
file to keep the operation cheap.
--
Michael

Attachment

Re: better page-level checksums

From
Peter Geoghegan
Date:
On Tue, Jun 14, 2022 at 7:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Technically we don't already do that today, with the 16-bit checksums
> > that are stored in PageHeaderData.pd_checksum. But we do something
> > equivalent: low-level tools can still infer that checksums must not be
> > enabled on the page (really the cluster) indirectly in the event of a
> > 0 checksum. A 0 value can reasonably be interpreted as a page from a
> > cluster without checksums (barring page corruption). This is basically
> > reasonable because our implementation of checksums is guaranteed to
> > not generate 0 as a valid checksum value.
>
> I don't think that 'pg_checksums -d' zeroes the checksum values on the
> pages in the cluster.

Obviously there are limitations on when and how we can infer something
about the whole cluster based on one single page image -- it all
depends on the context. I'm only arguing that we ought to make this
kind of analysis as easy as we reasonably can. I just don't see any
downside to having a status bit per checksum or encryption algorithm
at the page level, and plenty of upside (especially if the event of
bugs).

This seems like the absolute bare minimum to me, and I'm genuinely
surprised that there is even a question about whether or not we should
do that much.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Peter Eisentraut
Date:
On 13.06.22 20:20, Robert Haas wrote:
> If the user wants 16-bit checksums, the feature we've already got
> seems good enough -- and, as you say, it doesn't use any extra disk
> space. This proposal is just about making people happy if they want a
> bigger checksum.

It's hard to get any definite information about what size of checksum is 
"good enough", since after all it depends on what kinds of errors you 
expect and what kinds of probabilities you want to accept.  But the best 
I could gather so far is that 16-bit CRC are good until about 16 kB 
block size.

Which leads to the question whether there is really a lot of interest in 
catering to larger block sizes.  The recent thread about performance 
impact of different block sizes might renew interest in this.  But 
unless we really want to encourage playing with the block sizes (and if 
my claim above is correct), then a larger checksum size might not be needed.

> On the topic of which algorithm to use, I'd be inclined to think that
> it is going to be more useful to offer checksums that are 64 bits or
> more, since IMHO 32 is not all that much more than 16, and I still
> think there are going to be alignment issues. Beyond that I don't have
> anything against your specific suggestions, but I'd like to hear what
> other people think.

Again, gathering some vague information ...

The benefits of doubling the checksum size are exponential rather than 
linear, so there is no significant benefit of using a 64-bit checksum 
over a 32-bit one, for supported block sizes (current max is 32 kB).



Re: better page-level checksums

From
Robert Haas
Date:
On Wed, Jun 15, 2022 at 4:54 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> It's hard to get any definite information about what size of checksum is
> "good enough", since after all it depends on what kinds of errors you
> expect and what kinds of probabilities you want to accept.  But the best
> I could gather so far is that 16-bit CRC are good until about 16 kB
> block size.

Not really. There's a lot of misinformation on this topic floating
around on this mailing list, and some of that misinformation is my
fault. I keep learning more about this topic. However, I'm pretty
confident that, on the one hand, there's no hard limit on the size of
the data that can be effectively validated via a CRC, and on the other
hand, CRC isn't a particularly great algorithm, although it does have
certain interesting advantages for certain purposes.

For example, according to
https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks#Error_detection_strength
a CRC is guaranteed to detect all single-bit errors. This property is
easy to achieve: for example, a parity bit has this property.
According to the same source, a CRC is guaranteed to detect two-bit
errors only if the distance between them is less than some limit that
gets larger as the CRC gets wider. Imagine that you have a CRC-16 of a
message 64k+1 bits in length. Suppose that an error in the first bit
changes the result from v to v'. Can we, by flipping a second bit
later in the message, change the final result from v' back to v? The
calculation only has 64k possible answers, and we have 64k bits we can
flip to try to get the desired answer. If every one of those bit flips
produces a different answer, then one of those answers must be v --
which means detection of two-bits errors is not guaranteed. If at
least two of those bit flips produce the same answer, then consider
the messages produced by those two different bit flips. They differ
from each other by exactly two bits and yet produced the same CRC, so
detection of two-bit errors is still not guaranteed.

On the other hand, it's still highly likely. If a message of length
2^16+1 bits contains two bit errors one of which is in the first bit,
the chances that the other one is in exactly the right place to cancel
out the first error are about 2^-16. That's not zero, but it's just as
good as our chances of detecting a replacement of the entire message
with some other message chosen completely at random. I think the
reason why discussion of CRCs tends to focus on the types of bit
errors that it can detect is that the algorithm was designed when
people were doing stuff like sending files over a modem. It's easy to
understand how individual bits could get garbled without anybody
noticing, while large-scale corruption would be less likely, but the
risks are not necessarily the same for a PostgreSQL data file. Lower
levels of the stack are probably already using checksums to try to
detect errors at the level of the physical medium. I'm sure some stuff
slips through the cracks, but in practice we also see failure modes
where the filesystem substitutes 8kB of data from an unrelated file,
or where a torn write in combination with unreliable fsync results in
half of the page contents being from an older version of the page.
These kinds of large-scale replacements aren't what CRCs are designed
to detect, and the chances that we will detect them are roughly
1-2^-bits, whether we use a CRC or something else.

Of course, that partly depends on the algorithm quality. If an
algorithm is more likely to generate some results than others, then
its actual error detection rate will not be as good as the number of
output bits would suggest. If the result doesn't depend equally on
every input bit, then the actual error detection rate will not be as
good as the number of output bits would suggest. And CRC-32 is
apparently not great by modern standards:

https://github.com/rurban/smhasher

Compare the results for CRC-32 with, say, Spooky32. Apparently the
latter is faster yet produces better output. So maybe we would've been
better off if we'd made Spooky32 the default algorithm for backup
manifest checksums rather than CRC-32.

> The benefits of doubling the checksum size are exponential rather than
> linear, so there is no significant benefit of using a 64-bit checksum
> over a 32-bit one, for supported block sizes (current max is 32 kB).

I'm still unconvinced that the block size is very relevant here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Robert Haas
Date:
On Tue, Jun 14, 2022 at 10:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Basically I think that this is giving up rather a lot. For example,
> isn't it possible that we'd have corruption that could be a bug in
> either the checksum code, or in recovery?
>
> I'd feel a lot better about it if there was some sense of both the
> costs and the benefits.

I think that, if and when we get TDE, debuggability is likely to be a
huge issue. Something will go wrong for someone at some point, and
when it does, what they'll have is a supposedly-encrypted page that
cannot be decrypted, and it will be totally unclear what has gone
wrong. Did the page get corrupted on disk by a random bit flip? Is
there a bug in the algorithm? Torn page? As things stand today, when a
page gets corrupted, a human being can look at the page and make an
educated guess about what has gone wrong and whether PostgreSQL or
some other system is to blame, and if it's PostgreSQL, perhaps have
some ideas as to where to look for the bug. If the pages are
encrypted, that's a lot harder. I think what will happen, depending on
the encryption mode, is probably that either (a) the page will decrypt
to complete garbage or (b) the page will fail some kind of
verification and you won't be able to decrypt it at all. Either way,
you won't be able to infer anything about what caused the problem. All
you'll know is that something is wrong. That sucks - a lot - and I
don't have a lot of good ideas as to what can be done about it. The
idea that an encrypted page is unintelligible and that small changes
to either the encrypted or unencrypted data should result in large
changes to the other is intrinsic to the nature of encryption. It's
more or less un-debuggable by design.

With extended checksums, I don't think the issues are anywhere near as
bad. I'm not deeply opposed to setting a page-level flag but I expect
nominal benefits. A human being looking at the page isn't going to
have a ton of trouble figuring out whether or not the extended
checksum is present unless the page is horribly, horribly garbled, and
even if that happens, will debugging that problem really be any worse
than debugging a horribly, horribly garbled page today? I don't think
so. I likewise expect that pg_filedump could use heuristics to figure
out what's going on just by looking at the page, even if no external
information is available. You are probably right when you say that
there's no need to be so parsimonious with pd_flags space as all that,
but I believe that if we did decide to set no bit in pd_flags, whoever
maintains pg_filedump these days would not have huge difficulty
inventing a suitable heuristic. A page with an extended checksum is
basically still an intelligible page, and we shouldn't understate the
value of that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Peter Geoghegan
Date:
On Wed, Jun 15, 2022 at 1:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think what will happen, depending on
> the encryption mode, is probably that either (a) the page will decrypt
> to complete garbage or (b) the page will fail some kind of
> verification and you won't be able to decrypt it at all. Either way,
> you won't be able to infer anything about what caused the problem. All
> you'll know is that something is wrong. That sucks - a lot - and I
> don't have a lot of good ideas as to what can be done about it. The
> idea that an encrypted page is unintelligible and that small changes
> to either the encrypted or unencrypted data should result in large
> changes to the other is intrinsic to the nature of encryption. It's
> more or less un-debuggable by design.

It's pretty clear that there must be a lot of truth to that. But that
doesn't mean that there aren't meaningful gradations beyond that.

I think that it's worth doing the following exercise (humor me): Why
wouldn't it be okay to just encrypt the tuple space and the line
pointer array, leaving both the page header and page special area
unencrypted? What kind of user would find that trade-off to be
unacceptable, and why? What's the nuance of it?

For all I know you're right (about encrypting the whole page, metadata
and all). I just want to know why that is. I understand that this
whole area is one where in general we may have to live with a certain
amount of uncertainty about what really matters.

> With extended checksums, I don't think the issues are anywhere near as
> bad. I'm not deeply opposed to setting a page-level flag but I expect
> nominal benefits.

I also expect only a small benefit. But that isn't a particularly
important factor in my mind.

Let's suppose that it turns out to be significantly more useful than
we originally expected, for whatever reason. Assuming all that, what
else can be said about it now? Isn't it now *relatively* likely that
including that status bit metadata will be *extremely* valuable, and
not merely somewhat more valuable?

I guess it doesn't matter much now (since you have all but conceded
that using a bit for this makes sense), but FWIW that's the main
reason why I almost took it for granted that we'd need to use a status
bit (or bits) for this.

-- 
Peter Geoghegan



Re: better page-level checksums

From
Robert Haas
Date:
On Wed, Jun 15, 2022 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that it's worth doing the following exercise (humor me): Why
> wouldn't it be okay to just encrypt the tuple space and the line
> pointer array, leaving both the page header and page special area
> unencrypted? What kind of user would find that trade-off to be
> unacceptable, and why? What's the nuance of it?

Let's consider a continuum where, on the one end, you encrypt the
entire disk. Then, consider a solution where you encrypt each
individual file, block by block. Next, let's imagine that we don't
encrypt some kinds of files at all, if we think the data in them isn't
sensitive enough. CLOG, maybe. Perhaps pg_class, because that'd be
useful for debugging, and how sensitive can the names of the database
tables be? Then, let's adopt your proposal here and leave some parts
of each block unencrypted for debuggability. As a next step, we could
take the further step of separately encrypting each tuple, but only
the data, leaving the tuple header unencrypted. Then, going further,
we could encrypt each individual column value within the tuple
separately, rather than encrypting the tuple together. Then, let's
additionally decide that we're not going to encrypt all the columns,
but just the ones the user says are sensitive. Now I think we've
pretty much reached the other end of the continuum, unless someone is
going to propose something like encrypting only part of each column,
or storing some unencrypted data along with each encrypted column that
is somehow dependent on the column contents.

I think it is undeniable that every step along that continuum has
weakened security in some way. The worst case scenario for an attacker
must be that the entire disk is encrypted and they can gain no
meaningful information at all without having to break that encryption.
As the encryption touches fewer things, it becomes easier and easier
to make inferences about the unseen data based on the data that you
can see. One can sit around and argue about whether the amount of
information that is leaked at any given step is enough for anyone to
care, but to some extent that's an opinion question where any position
can be defended by someone. I would argue that even leaking the
lengths of the files is not great at all. Imagine that the table is
scheduled_nuclear_missile_launches. I definitely do not want my
adversaries to know even as much as whether that table is zero-length
or non-zero-length. In fact I would prefer that they be unable to
infer that I have such a table at all. Back in 2019 I constructed a
similar example for how access to pg_clog could leak meaningful
information:

http://postgr.es/m/CA+TgmoZhbeYmRoAccJ1oCN03Jz2Uak18QN4afx4WD7g+j7SVcQ@mail.gmail.com

Now, obviously, anyone can debate how realistic such cases are, but
they definitely exist. If you can read btpo_prev, btpo_next,
btpo_level, and btpo_flags for every page in the btree, you can
probably infer some things about the distribution of keys in the table
-- especially if you can read all the pages at time T1 and then read
them all again later at time T2 (and maybe further times T3..Tn). You
can make inference about which parts of the keyspace are receiving new
index insertions and which are not. If that's the index on the
current_secret_missions.country_code column, well then that sucks.
Your adversary may be able to infer where in the world your secret
organization is operating and round up all your agents.

Now, I do realize that if we're ever going to get TDE in PostgreSQL,
we will probably have to make some compromises. Actually concealing
file lengths would require a redesign of the entire storage system,
and so is probably not practical in the short term. Concealing SLRU
contents would require significant changes too, some of which I think
are things Thomas wants to do anyway, but we might have to punt that
goal for the first version of a TDE feature, too. Surely, that weakens
security, but if it gets us to a feature that some people can use
before the heat death of the universe, there's a reasonable argument
that that's better than nothing. Still, conceding that we may not
realistically may be able to conceal all the information in v1 is
different from arguing that concealing it isn't desirable, and I think
the latter argument is pretty hard to defend.

People who want to break into computers have gotten incredibly good at
exploiting incredibly subtle bits of information in order to infer the
contents of unseen data.
https://en.wikipedia.org/wiki/Spectre_(security_vulnerability) is a
good example: somebody figured out that the branch prediction hardware
could initiate speculative accesses to RAM that the user doesn't
actually have permission to execute, and thus a JavaScript program
running in your browser can read out the entire contents of RAM by
measuring exactly how long mis-predicted code takes to execute.
There's got to be at least one chip designer out there somewhere who
was involved in the design of that branch prediction system, knew that
it didn't perform the permissions checks before accessing RAM, and
thought to themselves "that should be ok - what's the worst thing that
can happen?". I imagine that (those) chip designer(s) had a really bad
day when they found out someone had written a program to use that
information leakage to read out the entire contents of RAM ... not
even using C, but using JavaScript running inside a browser!

That's only an example, but I think it's pretty typical of how these
sort of things go. I believe computer security literature is literally
riddled with attacks where the exposure of seemingly-innocent
information turned out to be a big problem. I don't think the
information exposed in the btree special space is very innocent: it's
not the keys themselves, but if you have the contents of every btree
special space in the btree there are definitely cases where you can
take inference from that information.

> I also expect only a small benefit. But that isn't a particularly
> important factor in my mind.
>
> Let's suppose that it turns out to be significantly more useful than
> we originally expected, for whatever reason. Assuming all that, what
> else can be said about it now? Isn't it now *relatively* likely that
> including that status bit metadata will be *extremely* valuable, and
> not merely somewhat more valuable?

This is too hypothetical for me to have an intelligent opinion.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: better page-level checksums

From
Bruce Momjian
Date:
On Tue, Jun 14, 2022 at 01:42:55PM -0400, Robert Haas wrote:
> Hmm, but on the other hand, if you imagine a scenario in which the
> "storage system extra blob" is actually a nonce for TDE, you need to
> be able to find it before you've decrypted the rest of the page. If
> pd_checksum gives you the offset of that data, you need to exclude it
> from what gets encrypted, which means that you need encrypt three
> separate non-contiguous areas of the page whose combined size is
> unlikely to be a multiple of the encryption algorithm's block size.
> That kind of sucks (and putting it at the end of the page makes it way
> better).

I continue to believe that a nonce is not needed for XTS encryption
mode, and that adding a tamper-detection GCM hash is of limited
usefulness since malicious writes can be done to other critical files
and can be used to find the cluster or encryption keys

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson