Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence |
Date | |
Msg-id | CAH2-Wzkmb6WyEUV98UMQ5nvV2cFu5M967zOO2vZ1AcouKD4HWA@mail.gmail.com Whole thread Raw |
In response to | Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
List | pgsql-hackers |
On Mon, Jan 13, 2020 at 12:49 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > In attachment you can find the WIP patch that adds support function for > btree opclasses. Cool. Thanks! > Current version of the patch adds: > > 1) new syntax, which allow to provide support function: > > CREATE OPERATOR CLASS int4_ops_test > FOR TYPE int4 USING btree AS > OPERATOR 1 =(int4, int4), > FUNCTION 1 btint4cmp(int4, int4), > SUPPORT datum_image_eqisbitwise; Hmm. Do we really need these grammar changes? If so, why? I think that you wanted to make this something that could work with any opclass of any index access method, but I don't see that as a useful goal. (If it was useful, it could be considered later, on a case by case basis.) I imagined that this infrastructure would consist of inventing a new variety of B-Tree opclass support function -- something like sortsupport. You could generalize from the example of commit c6e3ac11b60, which added SortSupport functions (also known as "B-Tree support function 2" functions). You might also take a look at the much more recent commit 0a459cec96d, which added in_range functions (also known as "B-Tree support function 3"). Note that neither of those two commits had grammar changes for CREATE OPERATOR CLASS, or anything like that. What I have in mind is a "B-Tree support function 4", obviously. You should probably add a C function that's similar to PrepareSortSupportFromIndexRel()/FinishSortSupportFunction() that will be called from the B-Tree code. This will give a simple yes/no answer to the question: "Is it safe to apply deduplication to this Relation"? This C function will know to return false for an opclass that doesn't have any support function 4 set for any single attribute. It can provide a general overview of what we're telling the caller about the opclass here, etc. Another patch could add a similar function that works with a plain operator, a bit like PrepareSortSupportFromOrderingOp() -- but that isn't necessary now. I suppose that this approach requires something a bit like a struct SortSupportData, with filled-out collation information, etc. The nbtree code expects a simple yes/no answer based on all columns in the index, so it will be necessary to serialize that information to send it across the SQL function interface -- the pg_proc support function will have one argument of type "internal". And, I suppose that you'll also need some basic btvalidate() validation code. > We probably can add more words to specify the purpose of the support > function. Right -- some documentation is needed in btree.sgml, alongside the existing stuff for support functions 1, 2, and 3. > 2) trivial support function that always returns true > 'datum_image_eqisbitwise'. > It is named after 'datum_image_eq', because we define this support > function via its behavior. I like the idea of a generic, trivial SQL-callable function that all simple scalar types can use -- one that just returns true. Maybe we should call this general class of function an "image_equal" function, and refer to "image equality" in the btree.sgml docs. I don't think that using the term "bitwise" is helpful, since it sounds very precise but is actually slightly inaccurate (since TOASTable datums can be "image equal", but not bitwise equal according to datumIsEqual()). How does everyone feel about "image equality" as a name? As I said before, it seems like a good idea to tie this new infrastructure to existing infrastructure used for things like REFRESH MATERIALIZED VIEW CONCURRENTLY. > If this prototype is fine, I will continue this work and add support > functions for other opclasses, update pg_dump and documentation. If this work is structured as a new support function, then it isn't really a user-visible feature -- you won't need pg_dump support, psql support, etc. Most of the documentation will be for operator class authors rather than regular users. We can document the specific opclasses that have support for deduplication later, if at all. I think it would be fine if the deduplication docs (not the docs for this infrastructure) just pointed out specific cases that we *cannot* support -- there are not many exceptions (numeric, text with a nondeterministic collation, a few others like that). -- Peter Geoghegan
pgsql-hackers by date: