Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence |
Date | |
Msg-id | CAH2-Wzn_Zx6=iFbbow9xO85M=Av4qaT8DcHW5oM-QMd0_ttCsQ@mail.gmail.com Whole thread Raw |
In response to | Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
|
List | pgsql-hackers |
On Sun, Aug 25, 2019 at 1:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that teaching opclasses to say whether this is okay is a > reasonable approach. I've begun working on this, with help from Anastasia. My working assumption is that I only need to care about opclass-declared input data types (pg_opclass.opcintype), plus the corresponding collations -- the former can be used to lookup an appropriate pg_amproc entry (i.e. B-Tree support function 4), while the latter are passed to the support function to get an answer about whether or not it's okay to use deduplication. This approach seems to be good enough as far as the deduplication project's needs are concerned. However, I think that I probably need to take a broader view of the problem than that. Any guidance would be much appreciated. > > Consumers of this new infrastructure probably won't be limited to the > > deduplication feature; > > Indeed, we run up against this sort of thing all the time in, eg, planner > optimizations. I think some sort of "equality is precise" indicator > would be really useful for a lot of things. Suppose I wanted to add support for deduplication of a B-Tree index on an array of integers. This probably wouldn't be very compelling, but just suppose. It's not clear how this could work within the confines of the type and operator class systems. I can hardly determine that it's safe or unsafe to do so at CREATE INDEX time, since the opclass-declared input data type is always the pg_type.oid corresponding to 'anyarray' -- I am forced to make a generic assumption that deduplication is not safe. I must make this conservative assumption since, in general, the indexed column could turn out to be an array of numeric datums -- a "transitively unsafe" anyarray (numeric's display scale issue could leak into anyarray). I'm not actually worried about any practical downside that this may create for users of the B-Tree deduplication feature; a B-Tree index on an array *is* a pretty niche thing. Does seem like I should make sure that I get this right, though. Code like the 'anyarray' B-Tree support function 1 (i.e. btarraycmp()/array_cmp()) doesn't hint at a solution -- it merely does a lookup of the underlying type's comparator using the typcache. That depends on having actual anyarray datums to do something with, which isn't something that this new infrastructure can rely on in any way. I suppose that the only thing that would work here would be to somehow look through the pg_attribute entry for the index column, which will have the details required to distinguish between (say) an array of integers (which is safe, I think) from an array of numerics (which is unsafe). From there, the information about the element type could (say) be passed to the anyarray default opclass' support function 4, which could do its own internal lookup. That seems like it might be a solution in search of a problem, though. BTW, I currently forbid cross-type support function 4 entries for an opclass, on the grounds that that isn't sensible for deduplication. Do you think that that restriction is appropriate in general, given the likelihood that this support function will be used in several other areas? Thanks -- Peter Geoghegan
pgsql-hackers by date: