Home > mailing lists

Re: Unique index on hash of jsonb value - correct solution? - Mailing list pgsql-general

From	Albrecht Dreß
Subject	Re: Unique index on hash of jsonb value - correct solution?
Date	May 19, 2020 17:43:14
Msg-id	IJDZZMMB.LMTMAQYN.FUFORC4U@HZY2Z3HA.DW4FJGKP.U2BKBLCU Whole thread
In response to	Re: Unique index on hash of jsonb value - correct solution? (Stephen Frost <sfrost@snowman.net>)
List	pgsql-general

Tree view

Hi,

first, thanks a lot for your input!

Am 19.05.20 16:16 schrieb(en) Stephen Frost:
> That's only going to work up to a certain size for that text and blob value too, of course..

These three columns (stripped-down code, there are more in the table) are a “lazy” import from various python scripts.
Thetext and bytea (actually a sha256 hash) fields always have a fixed, short size, but the json varies between a single
itemand a quite large dict, so extracting the data into columns seems to be cumbersome. 

> This is looking like it might be some kind of KV store which is generally discouraged.

Well, yes, somehow…

>> "jsonb" uses an internal binary representation that reorders the attributes in a deterministic fashin, so yes, that
isguaranteed. 
>
> Sure- but the md5() isn't going to run on the jsonb data directly, it'll run on the text representation that's
returned,and I'm not at all convinced that the project has agreed to make that text form always be canonical and
identicalforever, including across major version upgrades.. 

Ok, I see.

> Further, there are some other cases to consider- such as: do you think that 0.00001230 is different from 0.0000123?
Sincewe'll store numbers in jsonb as numeric, we preserve trailing franctional zeroes. 

Good point; currently no problem for me (strings and ints only in the json dicts), but might be a serious issue in the
future.

> Overall, if what you need is uniqueness here, I'd strongly reconsider the choice to use jsonb to store this data and
insteadsee if you can break the data out into proper columns with a proper unique constraint across them (or across
somehash of the combination of them that's done in a way that is clear and unambiguous). 

Got your point, I hope – probably the clean solution would be a defined, sorted serialisation plus hashing of the json
(andthe text and bytea columns) either in the python producers, or in the (already existing) db function called by the
pythonscripts for inserting data… 

Thanks again,
Albrecht.

Attachment

msg-12596-2304.dat

pgsql-general by date:

From: Santhosh Kumar
Date: 19 May 2020, 17:25:05
Subject: Clarification relation logical replication

From: Nick Cleaton
Date: 19 May 2020, 19:00:04
Subject: A limit clause can cause a poor index choice

Re: Unique index on hash of jsonb value - correct solution? - Mailing list pgsql-general

Attachment

Previous

Next