Re: Transparent column encryption - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Re: Transparent column encryption |
Date | |
Msg-id | 79f08a39-a7da-5157-cef4-378fb60c18f8@enterprisedb.com Whole thread Raw |
In response to | Re: Transparent column encryption (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Responses |
Re: Transparent column encryption
|
List | pgsql-hackers |
Rebased patch, no new functionality. On 29.06.22 01:29, Peter Eisentraut wrote: > Here is a new version of this patch. See also the original description > quoted below. I have done a significant amount of work on this over the > last few months. Some important news include: > > - The cryptography has been improved. It now uses an AEAD scheme, and > for deterministic encryption a proper SIV construction. > > - The OpenSSL-specific parts have been moved to a separate file in > libpq. Non-OpenSSL builds compile and work (without functionality, of > course). > > - libpq handles multiple CEKs and CMKs, including changing keys on the fly. > > - libpq supports a mode to force encryption of certain values. > > - libpq supports a flexible configuration system for looking up CMKs, > including support for external key management systems. > > - psql has a new \gencr command that allows passing in bind parameters > for (potential) encryption. > > - There is some more pg_dump and psql support. > > - The new data types for storing encrypted data have been renamed for > clarity. > > - Various changes to the protocol compared to the previous patch. > > - The patch contains full documentation of the protocol changes, > glossary entries, and more new documentation. > > The major pieces that are still missing are: > > - DDL support for registering keys > > - Protocol versioning or feature flags > > Other than that it's pretty complete in my mind. > > For interested reviewers, I have organized the patch so that you can > start reading it top to bottom: The documentation comes first, then the > tests, then the code changes. Even some feedback on the first or first > two aspects would be valuable to me. > > Old news follows: > > On 03.12.21 22:32, Peter Eisentraut wrote: >> I want to present my proof-of-concept patch for the transparent column >> encryption feature. (Some might also think of it as automatic >> client-side encryption or similar, but I like my name.) This feature >> enables the {automatic,transparent} encryption and decryption of >> particular columns in the client. The data for those columns then >> only ever appears in ciphertext on the server, so it is protected from >> the "prying eyes" of DBAs, sysadmins, cloud operators, etc. The >> canonical use case for this feature is storing credit card numbers >> encrypted, in accordance with PCI DSS, as well as similar situations >> involving social security numbers etc. Of course, you can't do any >> computations with encrypted values on the server, but for these use >> cases, that is not necessary. This feature does support deterministic >> encryption as an alternative to the default randomized encryption, so >> in that mode you can do equality lookups, at the cost of some >> security. >> >> This functionality also exists in other SQL database products, so the >> overall concepts weren't invented by me by any means. >> >> Also, this feature has nothing to do with the on-disk encryption >> feature being contemplated in parallel. Both can exist independently. >> >> The attached patch has all the necessary pieces in place to make this >> work, so you can have an idea how the overall system works. It >> contains some documentation and tests to help illustrate the >> functionality. But it's missing the remaining 90% of the work, >> including additional DDL support, error handling, robust memory >> management, protocol versioning, forward and backward compatibility, >> pg_dump support, psql \d support, refinement of the cryptography, and >> so on. But I think obvious solutions exist to all of those things, so >> it isn't that interesting to focus on them for now. >> >> ------ >> >> Now to the explanation of how it works. >> >> You declare a column as encrypted in a CREATE TABLE statement. The >> column value is encrypted by a symmetric key called the column >> encryption key (CEK). The CEK is a catalog object. The CEK key >> material is in turn encrypted by an assymmetric key called the column >> master key (CMK). The CMK is not stored in the database but somewhere >> where the client can get to it, for example in a file or in a key >> management system. When a server sends rows containing encrypted >> column values to the client, it first sends the required CMK and CEK >> information (new protocol messages), which the client needs to record. >> Then, the client can use this information to automatically decrypt the >> incoming row data and forward it in plaintext to the application. >> >> For the CMKs, the catalog object specifies a "provider" and generic >> options. Right now, libpq has a "file" provider hardcoded, and it >> takes a "filename" option. Via some mechanism to be determined, >> additional providers could be loaded and then talk to key management >> systems via http or whatever. I have left some comments in the libpq >> code where the hook points for this could be. >> >> The general idea would be for an application to have one CMK per area >> of secret stuff, for example, for credit card data. The CMK can be >> rotated: each CEK can be represented multiple times in the database, >> encrypted by a different CMK. (The CEK can't be rotated easily, since >> that would require reading out all the data from a table/column and >> reencrypting it. We could/should add some custom tooling for that, >> but it wouldn't be a routine operation.) >> >> The encryption algorithms are mostly hardcoded right now, but there >> are facilities for picking algorithms and adding new ones that will be >> expanded. The CMK process uses RSA-OAEP. The CEK process uses >> AES-128-CBC right now; a more complete solution should probably >> involve some HMAC thrown in. >> >> In the server, the encrypted datums are stored in types called >> encryptedr and encryptedd (for randomized and deterministic >> encryption). These are essentially cousins of bytea. For the rest of >> the database system below the protocol handling, there is nothing >> special about those. For example, encryptedr has no operators at all, >> encryptedd has only an equality operator. pg_attribute has a new >> column attrealtypid that stores the original type of the data in the >> column. This is only used for providing it to clients, so that >> higher-level clients can convert the decrypted value to their >> appropriate data types in their environments. >> >> Some protocol extensions are required. These should be guarded by >> some _pq_... setting, but this is not done in this patch yet. As >> mentioned above, extra messages are added for sending the CMKs and >> CEKs. In the RowDescription message, I have commandeered the format >> field to add a bit that indicates that the field is encrypted. This >> could be made a separate field, and there should probably be >> additional fields to indicate the algorithm and CEK name, but this was >> easiest for now. The ParameterDescription message is extended to >> contain format fields for each parameter, for the same purpose. >> Again, this could be done differently. >> >> Speaking of parameter descriptions, the trickiest part of this whole >> thing appears to be how to get transparently encrypted data into the >> database (as opposed to reading it out). It is required to use >> protocol-level prepared statements (i.e., extended query) for this. >> The client must first prepare a statement, then describe the statement >> to get parameter metadata, which indicates which parameters are to be >> encrypted and how. So this will require some care by applications >> that want to do this, but, well, they probably should be careful >> anyway. In libpq, the existing APIs make this difficult, because >> there is no way to pass the result of a describe-statement call back >> into execute-statement-with-parameters. I added new functions that do >> this, so you then essentially do >> >> res0 = PQdescribePrepared(conn, ""); >> res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0); >> >> (The name could obviously be improved.) Other client APIs that have a >> "statement handle" concept could do this more elegantly and probably >> without any API changes. >> >> Another challenge is that the parse analysis must check which >> underlying column a parameter corresponds to. This is similar to >> resorigtbl and resorigcol in the opposite direction. The current >> implementation of this works for the test cases, but I know it has >> some problems, so I'll continue working in this. This functionality >> is in principle available to all prepared-statement variants, not only >> protocol-level. So you can see in the tests that I expanded the >> pg_prepared_statements view to show this information as well, which >> also provides an easy way to test and debug this functionality >> independent of column encryption. >> >> And also, psql doesn't use prepared statements, so writing into >> encrypted columns currently doesn't work at all via psql. (Reading >> works no problem.) All the test code currently uses custom libpq C >> programs. We should think about a way to enable prepared statements >> in psql, perhaps something like >> >> INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2' >> >> (\gexec and \gx are already taken.) >> >> ------ >> >> This is not targeting PostgreSQL 15. But I'd appreciate some feedback >> on the direction. As I mentioned above, a lot of the remaining work >> is arguably mostly straightforward. Some closer examination of the >> issues surrounding the libpq API changes and psql would be useful. >> Perhaps there are other projects where that kind of functionality >> would also be useful.
Attachment
pgsql-hackers by date: