I'm planning to store cryptographic hashes (BLAKE3) in a column of a
postgresql table. I'm going to be doing a large number of operations of
roughly this form:
- Receive a large number of hashes via a web API call.
- Check which hashes aren't already in the database.
- Send back a bitmap to the user of which hashes they need to send.
- Receive data from the user corresponding to those hashes.
- Store the data (not in postgresql).
- Add the new hashes to the database, along with a tiny value indicating
what group of objects they were stored in.
A few questions:
- Is there a way to tell postgresql "this column contains cryptographic
hashes, so you can do hash joins using any subset of the bits, without
having to hash them again"? If not, should there be?
- Is `bit(256)` the right type to use to store 32-byte hash values with
no additional overhead?
- What would be the simplest way, given an input array of hashes (which
I may have to pass in as an array and use `unnest`), to filter out all
the values that already exist, *and* generate a corresponding bitmap
in the same order for present/not-present for the entire array (to
send back to the user)? Filtering seems easy enough, but generating
the bitmap less so.
- Does it make more sense to store the values as one row per value, or
as one row per group of values? I know that postgresql can store an
entire array in one column; could that efficiently support operations
like "tell me which of these objects don't exist in any array in this
column" or "for all of these objects, tell me all the group-id values
for rows containing them"?
Thank you,
Josh Triplett