Transparent column encryption - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Transparent column encryption |
Date | |
Msg-id | 89157929-c2b6-817b-6025-8e4b2d89d88f@enterprisedb.com Whole thread Raw |
Responses |
Re: Transparent column encryption
Re: Transparent column encryption Re: Transparent column encryption Re: Transparent column encryption |
List | pgsql-hackers |
I want to present my proof-of-concept patch for the transparent column encryption feature. (Some might also think of it as automatic client-side encryption or similar, but I like my name.) This feature enables the {automatic,transparent} encryption and decryption of particular columns in the client. The data for those columns then only ever appears in ciphertext on the server, so it is protected from the "prying eyes" of DBAs, sysadmins, cloud operators, etc. The canonical use case for this feature is storing credit card numbers encrypted, in accordance with PCI DSS, as well as similar situations involving social security numbers etc. Of course, you can't do any computations with encrypted values on the server, but for these use cases, that is not necessary. This feature does support deterministic encryption as an alternative to the default randomized encryption, so in that mode you can do equality lookups, at the cost of some security. This functionality also exists in other SQL database products, so the overall concepts weren't invented by me by any means. Also, this feature has nothing to do with the on-disk encryption feature being contemplated in parallel. Both can exist independently. The attached patch has all the necessary pieces in place to make this work, so you can have an idea how the overall system works. It contains some documentation and tests to help illustrate the functionality. But it's missing the remaining 90% of the work, including additional DDL support, error handling, robust memory management, protocol versioning, forward and backward compatibility, pg_dump support, psql \d support, refinement of the cryptography, and so on. But I think obvious solutions exist to all of those things, so it isn't that interesting to focus on them for now. ------ Now to the explanation of how it works. You declare a column as encrypted in a CREATE TABLE statement. The column value is encrypted by a symmetric key called the column encryption key (CEK). The CEK is a catalog object. The CEK key material is in turn encrypted by an assymmetric key called the column master key (CMK). The CMK is not stored in the database but somewhere where the client can get to it, for example in a file or in a key management system. When a server sends rows containing encrypted column values to the client, it first sends the required CMK and CEK information (new protocol messages), which the client needs to record. Then, the client can use this information to automatically decrypt the incoming row data and forward it in plaintext to the application. For the CMKs, the catalog object specifies a "provider" and generic options. Right now, libpq has a "file" provider hardcoded, and it takes a "filename" option. Via some mechanism to be determined, additional providers could be loaded and then talk to key management systems via http or whatever. I have left some comments in the libpq code where the hook points for this could be. The general idea would be for an application to have one CMK per area of secret stuff, for example, for credit card data. The CMK can be rotated: each CEK can be represented multiple times in the database, encrypted by a different CMK. (The CEK can't be rotated easily, since that would require reading out all the data from a table/column and reencrypting it. We could/should add some custom tooling for that, but it wouldn't be a routine operation.) The encryption algorithms are mostly hardcoded right now, but there are facilities for picking algorithms and adding new ones that will be expanded. The CMK process uses RSA-OAEP. The CEK process uses AES-128-CBC right now; a more complete solution should probably involve some HMAC thrown in. In the server, the encrypted datums are stored in types called encryptedr and encryptedd (for randomized and deterministic encryption). These are essentially cousins of bytea. For the rest of the database system below the protocol handling, there is nothing special about those. For example, encryptedr has no operators at all, encryptedd has only an equality operator. pg_attribute has a new column attrealtypid that stores the original type of the data in the column. This is only used for providing it to clients, so that higher-level clients can convert the decrypted value to their appropriate data types in their environments. Some protocol extensions are required. These should be guarded by some _pq_... setting, but this is not done in this patch yet. As mentioned above, extra messages are added for sending the CMKs and CEKs. In the RowDescription message, I have commandeered the format field to add a bit that indicates that the field is encrypted. This could be made a separate field, and there should probably be additional fields to indicate the algorithm and CEK name, but this was easiest for now. The ParameterDescription message is extended to contain format fields for each parameter, for the same purpose. Again, this could be done differently. Speaking of parameter descriptions, the trickiest part of this whole thing appears to be how to get transparently encrypted data into the database (as opposed to reading it out). It is required to use protocol-level prepared statements (i.e., extended query) for this. The client must first prepare a statement, then describe the statement to get parameter metadata, which indicates which parameters are to be encrypted and how. So this will require some care by applications that want to do this, but, well, they probably should be careful anyway. In libpq, the existing APIs make this difficult, because there is no way to pass the result of a describe-statement call back into execute-statement-with-parameters. I added new functions that do this, so you then essentially do res0 = PQdescribePrepared(conn, ""); res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0); (The name could obviously be improved.) Other client APIs that have a "statement handle" concept could do this more elegantly and probably without any API changes. Another challenge is that the parse analysis must check which underlying column a parameter corresponds to. This is similar to resorigtbl and resorigcol in the opposite direction. The current implementation of this works for the test cases, but I know it has some problems, so I'll continue working in this. This functionality is in principle available to all prepared-statement variants, not only protocol-level. So you can see in the tests that I expanded the pg_prepared_statements view to show this information as well, which also provides an easy way to test and debug this functionality independent of column encryption. And also, psql doesn't use prepared statements, so writing into encrypted columns currently doesn't work at all via psql. (Reading works no problem.) All the test code currently uses custom libpq C programs. We should think about a way to enable prepared statements in psql, perhaps something like INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2' (\gexec and \gx are already taken.) ------ This is not targeting PostgreSQL 15. But I'd appreciate some feedback on the direction. As I mentioned above, a lot of the remaining work is arguably mostly straightforward. Some closer examination of the issues surrounding the libpq API changes and psql would be useful. Perhaps there are other projects where that kind of functionality would also be useful.
Attachment
pgsql-hackers by date: