Thread: Extending opfamilies for GIN indexes

Extending opfamilies for GIN indexes

From
Tom Lane
Date:
I just got annoyed by the fact that contrib/intarray has support for
queries on GIN indexes on integer[] columns, but they only work if you
use the intarray-provided opclass, not the core-provided GIN opclass for
integer[] columns.  In general, of course, two different GIN opclasses
aren't compatible, but here there is precious little reason why not:
the contents of the index are the same both ways, ie, all the individual
integer keys in the arrays.  It would be a real usability improvement,
and would eliminate a foot-gun, if contrib/intarray could somehow be an
extension to the core opclass instead of an independent thing.

It seems to me that this should be possible within the opfamily/opclass
data structure.  Right now, there isn't any real application for
opfamilies for GIN (or GiST) indexes, because both of those AMs pay
attention only to the "default" support procs that are bound into the
opclass for an index.  But that could change.

In particular, only two of the five support procs used by GIN are
actually associated with "the index", in the sense of having some impact
on what's stored in the index: the compare() and extractValue() procs.
The other three are more associated with queries, though they do depend
on having knowledge about the behavior of the compare and extractValue
procs.

So here's what I'm thinking: we could redefine a GIN opclass, per se, as
needing only compare() and extractValue() procs to be bound into it.
The other three procs, as well as the query operators, could be "loose"
in the containing opfamily.  The index AM would choose which set of the
other support procedures to use for a specific query by matching their
amproclefttype/amprocrighttype to the declared input types of the query
operator, much as btree does.

Having done that, contrib/intarray could work by adding "loose"
operators and support procs to the core opfamily for integer[].

It's possible that this scheme would also make it really useful to have
multiple opclasses within one GIN opfamily; though offhand I'm not sure
of an application for that.  (Right now, the only reason to do that is
if you want to give opclasses for different types the same name, as we
do with the core "array_ops".)

Perhaps the same could be done with GiST, although I'm less sure about
the possible usefulness there.

Comments?

BTW, this idea means that amproc entries would no longer be tightly
associated with specific GIN opclasses, so the contentious patch for
getObjectDescription should indeed get applied.
        regards, tom lane


Re: Extending opfamilies for GIN indexes

From
Tom Lane
Date:
I wrote:
> So here's what I'm thinking: we could redefine a GIN opclass, per se, as
> needing only compare() and extractValue() procs to be bound into it.
> The other three procs, as well as the query operators, could be "loose"
> in the containing opfamily.  The index AM would choose which set of the
> other support procedures to use for a specific query by matching their
> amproclefttype/amprocrighttype to the declared input types of the query
> operator, much as btree does.

> Having done that, contrib/intarray could work by adding "loose"
> operators and support procs to the core opfamily for integer[].

Oh, wait a minute: there's a bad restriction there, namely that a
contrib module could only add "loose" operators that had different
declared input types from the ones known to the core opclass.  Otherwise
there'd be a conflict with the contrib module and core needing to insert
similarly-keyed support functions.  This would actually be enough for
contrib/intarray (because the core operator entries are for "anyarray"
not for "integer[]") but it is easy to foresee cases where that wouldn't
be good enough.  Seems like we'd need an additional key column in
pg_amproc to really make this cover all cases.
        regards, tom lane


Re: Extending opfamilies for GIN indexes

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Oh, wait a minute: there's a bad restriction there, namely that a
> contrib module could only add "loose" operators that had different
> declared input types from the ones known to the core opclass.  Otherwise
> there'd be a conflict with the contrib module and core needing to insert
> similarly-keyed support functions.  This would actually be enough for
> contrib/intarray (because the core operator entries are for "anyarray"
> not for "integer[]") but it is easy to foresee cases where that wouldn't
> be good enough.  Seems like we'd need an additional key column in
> pg_amproc to really make this cover all cases.

I would have though that such contrib would then need to offer their own
opfamily and opclasses, and users would have to use the specific opclass
manually like they do e.g. for text_pattern_ops.  Can't it work that way?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Extending opfamilies for GIN indexes

From
Tom Lane
Date:
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Oh, wait a minute: there's a bad restriction there, namely that a
>> contrib module could only add "loose" operators that had different
>> declared input types from the ones known to the core opclass.

> I would have though that such contrib would then need to offer their own
> opfamily and opclasses, and users would have to use the specific opclass
> manually like they do e.g. for text_pattern_ops.  Can't it work that way?

I think you missed the point: right now, to use both the core and
intarray operators on an integer[] column, you have to create *two*
GIN indexes, which will have exactly identical contents.  I'm looking
for a way to let intarray extend the core opfamily definition so that
one index can serve.
        regards, tom lane


Re: Extending opfamilies for GIN indexes

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>> Oh, wait a minute: there's a bad restriction there, namely that a
>>> contrib module could only add "loose" operators that had different
>>> declared input types from the ones known to the core opclass.
>
>> I would have though that such contrib would then need to offer their own
>> opfamily and opclasses, and users would have to use the specific opclass
>> manually like they do e.g. for text_pattern_ops.  Can't it work that way?
>
> I think you missed the point: right now, to use both the core and
> intarray operators on an integer[] column, you have to create *two*
> GIN indexes, which will have exactly identical contents.  I'm looking
> for a way to let intarray extend the core opfamily definition so that
> one index can serve.

Maybe this is a dumb question, but why not just put whatever stuff
intarray[] adds directly into the core opfamily?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Extending opfamilies for GIN indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 19, 2011 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think you missed the point: right now, to use both the core and
>> intarray operators on an integer[] column, you have to create *two*
>> GIN indexes, which will have exactly identical contents. I'm looking
>> for a way to let intarray extend the core opfamily definition so that
>> one index can serve.

> Maybe this is a dumb question, but why not just put whatever stuff
> intarray[] adds directly into the core opfamily?

AFAICS that means integrating contrib/intarray into core.  Independently
of whether that's a good idea or not, PG is supposed to be an extensible
system, so it would be nice to have a solution that supported add-on
extensions.

The subtext here is that GIN, unlike the other index AMs, uses a
representation that seems pretty amenable to supporting a wide variety
of query types with a single index.  contrib/intarray's "query_int"
operators are not at all like the subset-inclusion-testing operators
that the core opclass supports, and it's not very hard to think of
additional cases that could be of interest to somebody (example: find
all arrays that contain some/all entries within a given integer range).
I think we're going to come up against similar situations over and over
until we find a solution.
        regards, tom lane


Re: Extending opfamilies for GIN indexes

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jan 19, 2011 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think you missed the point: right now, to use both the core and
>>> intarray operators on an integer[] column, you have to create *two*
>>> GIN indexes, which will have exactly identical contents. I'm looking
>>> for a way to let intarray extend the core opfamily definition so that
>>> one index can serve.
>
>> Maybe this is a dumb question, but why not just put whatever stuff
>> intarray[] adds directly into the core opfamily?
>
> AFAICS that means integrating contrib/intarray into core.  Independently
> of whether that's a good idea or not, PG is supposed to be an extensible
> system, so it would be nice to have a solution that supported add-on
> extensions.

Yeah, I'm just wondering if it's worth the effort, especially in view
of a rather large patch queue we seem to have outstanding at the
moment.

> The subtext here is that GIN, unlike the other index AMs, uses a
> representation that seems pretty amenable to supporting a wide variety
> of query types with a single index.  contrib/intarray's "query_int"
> operators are not at all like the subset-inclusion-testing operators
> that the core opclass supports, and it's not very hard to think of
> additional cases that could be of interest to somebody (example: find
> all arrays that contain some/all entries within a given integer range).
> I think we're going to come up against similar situations over and over
> until we find a solution.

Interesting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Extending opfamilies for GIN indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 19, 2011 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAICS that means integrating contrib/intarray into core. �Independently
>> of whether that's a good idea or not, PG is supposed to be an extensible
>> system, so it would be nice to have a solution that supported add-on
>> extensions.

> Yeah, I'm just wondering if it's worth the effort, especially in view
> of a rather large patch queue we seem to have outstanding at the
> moment.

Oh, maybe we're not on the same page here: I wasn't really proposing
to do this right now, it's more of a TODO item.

Offhand the only reason to do it now would be if we settled on something
that required a layout change in pg_amop/pg_amproc.  Since we already
have one such change in 9.1, getting the additional change done in the
same release would be valuable to reduce the number of distinct cases
for pg_dump and other clients to support.
        regards, tom lane


Re: Extending opfamilies for GIN indexes

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> I think you missed the point: right now, to use both the core and
> intarray operators on an integer[] column, you have to create *two*
> GIN indexes, which will have exactly identical contents.  I'm looking
> for a way to let intarray extend the core opfamily definition so that
> one index can serve.

That I think I understood, but then I mixed opfamily and opclasses
badly.  Let's try again.

For the GIN indexes, we have 2 methods for building the index and 3
others to search it to solve the query.  You're proposing that the 2
former methods would be in the opfamily and the 3 later in the opclass.

We'd like to be able to use the same index (which building depends on
the opfamily) for solving different kind of queries, for which we can
use different traversal and search algorithms, that's the opclass.

So we would want the planner to know that in the GIN case an index built
with any opclass of a given opfamily can help answer a query that would
need any opclass of the opfamily.  Right?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Extending opfamilies for GIN indexes

From
Tom Lane
Date:
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
> For the GIN indexes, we have 2 methods for building the index and 3
> others to search it to solve the query.  You're proposing that the 2
> former methods would be in the opfamily and the 3 later in the opclass.

Actually the other way around.  An opclass is the subset of an opfamily
that is tightly bound to an index.  The "build" methods have to be
associatable with an index, so they're part of the index's opclass.
The "query" methods could be loose in the opfamily.

> So we would want the planner to know that in the GIN case an index built
> with any opclass of a given opfamily can help answer a query that would
> need any opclass of the opfamily.  Right?

The planner's not the problem here --- what's missing is the rule for
the index AM to look up the right support functions to call at runtime.

The trick is to associate the proper query support methods with any
given query operator (which'd also be loose in the family, probably).
The existing schema for pg_amop and pg_amproc is built on the assumption
that the amoplefttype/amoprighttype are sufficient for making this
association; but that seems to fall down if we would like to allow
contrib modules to add new query operators that coincidentally take the
same input types as an existing opfamily member.
        regards, tom lane


Re: Extending opfamilies for GIN indexes

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Actually the other way around.  An opclass is the subset of an opfamily
> that is tightly bound to an index.  The "build" methods have to be
> associatable with an index, so they're part of the index's opclass.
> The "query" methods could be loose in the opfamily.

I had understood your proposal to change that for GIN.  Thinking again
now with keeping opfamily and opclass as they are now: an opclass is the
code we run to build and scan the index, an opfamily is a way to use the
same index data and code in more contexts than strictly covered by an
opclass.

> The planner's not the problem here --- what's missing is the rule for
> the index AM to look up the right support functions to call at runtime.
>
> The trick is to associate the proper query support methods with any
> given query operator (which'd also be loose in the family, probably).
> The existing schema for pg_amop and pg_amproc is built on the assumption
> that the amoplefttype/amoprighttype are sufficient for making this
> association; but that seems to fall down if we would like to allow
> contrib modules to add new query operators that coincidentally take the
> same input types as an existing opfamily member.

Well the opfamily machinery allows to give query support to any index
whose opclass is in the family.  That is, the same set of operators are
covered by more than one opclass.

What we want to add is more than one set of operators can find data
support in more than one "index kind".  But you still want to run
specific search code here.  So it seems to me we shouldn't attack the
problem at the operator left and right type level, but rather model that
we need another level of flexibility, separating somewhat the index data
building and maintaining from the code that's used to access it.

The example that we're working from seem to be covered if we are able to
instruct PostgreSQL than a set of opclass'es are "binary coercible", I
think that's the term here.

Then the idea would be to have PostgreSQL able to figure out that a
given index can be used with any binary coercible opclass, rather than
only the one used to maintain it.  What do you think?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support