Transparent column encryption - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Transparent column encryption
Date
Msg-id 89157929-c2b6-817b-6025-8e4b2d89d88f@enterprisedb.com
Whole thread Raw
Responses Re: Transparent column encryption
Re: Transparent column encryption
Re: Transparent column encryption
Re: Transparent column encryption
List pgsql-hackers
I want to present my proof-of-concept patch for the transparent column
encryption feature.  (Some might also think of it as automatic
client-side encryption or similar, but I like my name.)  This feature
enables the {automatic,transparent} encryption and decryption of
particular columns in the client.  The data for those columns then
only ever appears in ciphertext on the server, so it is protected from
the "prying eyes" of DBAs, sysadmins, cloud operators, etc.  The
canonical use case for this feature is storing credit card numbers
encrypted, in accordance with PCI DSS, as well as similar situations
involving social security numbers etc.  Of course, you can't do any
computations with encrypted values on the server, but for these use
cases, that is not necessary.  This feature does support deterministic
encryption as an alternative to the default randomized encryption, so
in that mode you can do equality lookups, at the cost of some
security.

This functionality also exists in other SQL database products, so the
overall concepts weren't invented by me by any means.

Also, this feature has nothing to do with the on-disk encryption
feature being contemplated in parallel.  Both can exist independently.

The attached patch has all the necessary pieces in place to make this
work, so you can have an idea how the overall system works.  It
contains some documentation and tests to help illustrate the
functionality.  But it's missing the remaining 90% of the work,
including additional DDL support, error handling, robust memory
management, protocol versioning, forward and backward compatibility,
pg_dump support, psql \d support, refinement of the cryptography, and
so on.  But I think obvious solutions exist to all of those things, so
it isn't that interesting to focus on them for now.

------

Now to the explanation of how it works.

You declare a column as encrypted in a CREATE TABLE statement.  The
column value is encrypted by a symmetric key called the column
encryption key (CEK).  The CEK is a catalog object.  The CEK key
material is in turn encrypted by an assymmetric key called the column
master key (CMK).  The CMK is not stored in the database but somewhere
where the client can get to it, for example in a file or in a key
management system.  When a server sends rows containing encrypted
column values to the client, it first sends the required CMK and CEK
information (new protocol messages), which the client needs to record.
Then, the client can use this information to automatically decrypt the
incoming row data and forward it in plaintext to the application.

For the CMKs, the catalog object specifies a "provider" and generic
options.  Right now, libpq has a "file" provider hardcoded, and it
takes a "filename" option.  Via some mechanism to be determined,
additional providers could be loaded and then talk to key management
systems via http or whatever.  I have left some comments in the libpq
code where the hook points for this could be.

The general idea would be for an application to have one CMK per area
of secret stuff, for example, for credit card data.  The CMK can be
rotated: each CEK can be represented multiple times in the database,
encrypted by a different CMK.  (The CEK can't be rotated easily, since
that would require reading out all the data from a table/column and
reencrypting it.  We could/should add some custom tooling for that,
but it wouldn't be a routine operation.)

The encryption algorithms are mostly hardcoded right now, but there
are facilities for picking algorithms and adding new ones that will be
expanded.  The CMK process uses RSA-OAEP.  The CEK process uses
AES-128-CBC right now; a more complete solution should probably
involve some HMAC thrown in.

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption).  These are essentially cousins of bytea.  For the rest of
the database system below the protocol handling, there is nothing
special about those.  For example, encryptedr has no operators at all,
encryptedd has only an equality operator.  pg_attribute has a new
column attrealtypid that stores the original type of the data in the
column.  This is only used for providing it to clients, so that
higher-level clients can convert the decrypted value to their
appropriate data types in their environments.

Some protocol extensions are required.  These should be guarded by
some _pq_... setting, but this is not done in this patch yet.  As
mentioned above, extra messages are added for sending the CMKs and
CEKs.  In the RowDescription message, I have commandeered the format
field to add a bit that indicates that the field is encrypted.  This
could be made a separate field, and there should probably be
additional fields to indicate the algorithm and CEK name, but this was
easiest for now.  The ParameterDescription message is extended to
contain format fields for each parameter, for the same purpose.
Again, this could be done differently.

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out).  It is required to use
protocol-level prepared statements (i.e., extended query) for this.
The client must first prepare a statement, then describe the statement
to get parameter metadata, which indicates which parameters are to be
encrypted and how.  So this will require some care by applications
that want to do this, but, well, they probably should be careful
anyway.  In libpq, the existing APIs make this difficult, because
there is no way to pass the result of a describe-statement call back
into execute-statement-with-parameters.  I added new functions that do
this, so you then essentially do

     res0 = PQdescribePrepared(conn, "");
     res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0);

(The name could obviously be improved.)  Other client APIs that have a
"statement handle" concept could do this more elegantly and probably
without any API changes.

Another challenge is that the parse analysis must check which
underlying column a parameter corresponds to.  This is similar to
resorigtbl and resorigcol in the opposite direction.  The current
implementation of this works for the test cases, but I know it has
some problems, so I'll continue working in this.  This functionality
is in principle available to all prepared-statement variants, not only
protocol-level.  So you can see in the tests that I expanded the
pg_prepared_statements view to show this information as well, which
also provides an easy way to test and debug this functionality
independent of column encryption.

And also, psql doesn't use prepared statements, so writing into
encrypted columns currently doesn't work at all via psql.  (Reading
works no problem.)  All the test code currently uses custom libpq C
programs.  We should think about a way to enable prepared statements
in psql, perhaps something like

INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2'

(\gexec and \gx are already taken.)

------

This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.  As I mentioned above, a lot of the remaining work
is arguably mostly straightforward.  Some closer examination of the
issues surrounding the libpq API changes and psql would be useful.
Perhaps there are other projects where that kind of functionality
would also be useful.
Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: keepliaves etc. as environment variables
Next
From: Tom Lane
Date:
Subject: Re: Assorted improvements in pg_dump