Thread: [MASSMAIL]Storing and comparing columns of cryptographic hashes?
I'm planning to store cryptographic hashes (BLAKE3) in a column of a postgresql table. I'm going to be doing a large number of operations of roughly this form: - Receive a large number of hashes via a web API call. - Check which hashes aren't already in the database. - Send back a bitmap to the user of which hashes they need to send. - Receive data from the user corresponding to those hashes. - Store the data (not in postgresql). - Add the new hashes to the database, along with a tiny value indicating what group of objects they were stored in. A few questions: - Is there a way to tell postgresql "this column contains cryptographic hashes, so you can do hash joins using any subset of the bits, without having to hash them again"? If not, should there be? - Is `bit(256)` the right type to use to store 32-byte hash values with no additional overhead? - What would be the simplest way, given an input array of hashes (which I may have to pass in as an array and use `unnest`), to filter out all the values that already exist, *and* generate a corresponding bitmap in the same order for present/not-present for the entire array (to send back to the user)? Filtering seems easy enough, but generating the bitmap less so. - Does it make more sense to store the values as one row per value, or as one row per group of values? I know that postgresql can store an entire array in one column; could that efficiently support operations like "tell me which of these objects don't exist in any array in this column" or "for all of these objects, tell me all the group-id values for rows containing them"? Thank you, Josh Triplett
On Mon, Apr 8, 2024 at 10:08 AM Josh Triplett <josh@joshtriplett.org> wrote:
- Is there a way to tell postgresql "this column contains cryptographic hashes, so you can do hash joins using any subset of the bits, without having to hash them again"? If not, should there be?
No, and no. (if I understand your question correctly). You could use a functional index, I suppose, but seems premature optimization.
- Is `bit(256)` the right type to use to store 32-byte hash values with no additional overhead?
No, you would want bytea. I would store the value in a TEXT field, unless you really worried about space savings. The hexadecimal value will be far easier to debug and work with, and you can use a simple b-tree index.
- What would be the simplest way, given an input array of hashes (which
I may have to pass in as an array and use `unnest`), to filter out all
the values that already exist, *and* generate a corresponding bitmap
in the same order for present/not-present for the entire array (to
send back to the user)? Filtering seems easy enough, but generating
the bitmap less so.
Something like this:
SELECT array_agg(case when t.bhash is null then 1 else 0 end)
from unnest(array['blakehash1', 'blakehash2', etc...]) as a(x)
left join mytable t on t.bhash = a.x;
- Does it make more sense to store the values as one row per value, or
as one row per group of values?
Hard to answer without knowing more, but I'd lean towards simple and one row per value.
Your proposal (query db, do external work, update db) also sets of lots of concurrency red flags, so be mindful of that.
Cheers,
Greg
On Mon, Apr 8, 2024 at 10:08 AM Josh Triplett <josh@joshtriplett.org> wrote:
- Is there a way to tell postgresql "this column contains cryptographic
hashes, so you can do hash joins using any subset of the bits, without
having to hash them again"? If not, should there be?
if you know the specific subset of the bits from the hash value ahead of time, you can create an index on a function to extract the subset of bits ahead of time by putting them in an index. Consider
create table dd2 (id integer, hash_b bytea); --create example table
insert into dd2 values (1, '/x1234568'::bytea); --throw a record at it
create index on dd2 (substr ( hash_b,1,5)); -- create btree index on subset of value
select * from dd2 WHERE substr(hash_b,1,5) = '/x123'; --this query will use the index once there enough data to justify an index scan
create table dd2 (id integer, hash_b bytea); --create example table
insert into dd2 values (1, '/x1234568'::bytea); --throw a record at it
create index on dd2 (substr ( hash_b,1,5)); -- create btree index on subset of value
select * from dd2 WHERE substr(hash_b,1,5) = '/x123'; --this query will use the index once there enough data to justify an index scan
- Is `bit(256)` the right type to use to store 32-byte hash values with
no additional overhead?
Keep them in the bytea format, can easily cast from bytea to hex for debugging purposes.
select encode(hash_b, 'hex') from dd2
The Larger the data size the slower all operations are best to keep them in same data type instead of casting in and out different data types
- What would be the simplest way, given an input array of hashes (which
I may have to pass in as an array and use `unnest`), to filter out all
the values that already exist, *and* generate a corresponding bitmap
in the same order for present/not-present for the entire array (to
send back to the user)? Filtering seems easy enough, but generating
the bitmap less so.
If you have unique index on the bytea hash, you can unnest the array to a table, then insert values with ON CONFLICT
CREATE unique index on dd2(hash_b);
INSERT into dd2 (hash_b)
(SELECT * from unnest(array['/x1234568','/x12345','/x12346', '/x12347', '/x12348' ]::bytea[]))
ON CONFLICT (hash_b) DO NOTHING
RETURNING hash_b -- this will return the rows that got inserted
Then you can compare the inserted vs the inputted array to know what is present vs not present.
(SELECT * from unnest(array['/x1234568','/x12345','/x12346', '/x12347', '/x12348' ]::bytea[]))
ON CONFLICT (hash_b) DO NOTHING
RETURNING hash_b -- this will return the rows that got inserted
Then you can compare the inserted vs the inputted array to know what is present vs not present.
- Does it make more sense to store the values as one row per value, or
as one row per group of values? I know that postgresql can store an
entire array in one column; could that efficiently support operations
like "tell me which of these objects don't exist in any array in this
column" or "for all of these objects, tell me all the group-id values
for rows containing them"?
Putting them into arrays to keep them grouped together adds additional overhead to unnest for searching or join operations. If the data is going to be large and have to scan over large data sets with arrays containing many hash values thats allot of data processing. Arrays can be indexed but have to use GIN index, which has many drawbacks compared to btree. None of the above queries are possible with GIN indexes or using array columns without a lot more code.
Arrays are not data sets if the design needs to access a specific hash value for update,delete, append new values, an array probably not the best solution.
Hope this helps
Justin