Thread: Fixing GIN for empty/null/full-scan cases

Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
I've been thinking about how to fix GIN's assorted corner-case problems,
as has been discussed several times, most recently here:
http://archives.postgresql.org/pgsql-hackers/2010-10/msg00521.php
See also
http://wiki.postgresql.org/wiki/Todo#GIN

There are basically three related issues:

1. GIN doesn't store anything in the index for a NULL item.
2. GIN doesn't store anything in the index for an empty (zero-key) item.
3. GIN can't deal with NULL key values.

(An "item" is a composite value to be indexed, such as a tsvector or
array.  A "key" is an individual indexable value, such as a lexeme or
array element.)

Because of #1 and #2, GIN can't handle full-index scans.  This is not
because the code can't scan all of the index, but because doing so
wouldn't necessarily return a TID for every existing heap row.

We have to fix #1 and #2, and then get rid of the prohibition on
full-index scans, in order to deal with the complaints that appear in our
TODO list.  The problem with NULL key values is somewhat less pressing,
because most GIN-indexable operators are strict enough to not care about
null keys, so we could perhaps just ignore nulls.  But I'm inclined to
think that it'd be best to fix that now while we're whacking the code and
opclass APIs around, rather than probably having to go back for yet
another round later.  A concrete example of a hit we'll take if we don't
index nulls is that array-is-contained-in would have to be treated as a
lossy rather than lossless index search, since there'd be no way to tell
from the index whether the array item contains any nulls (rendering it not
contained in anything).


As far as storage in the index goes, NULL key values seem perfectly simple
to deal with: just allow a null to get stored for the key value of a GIN
index entry.  What we discussed doing to fix #1 and #2 was to store a
"dummy" index entry, rather than no entries at all.  The least complicated
way to do that would be to store a NULL key.  That would mean that,
for example with integer arrays, all three of these item values would have
identical index entries:NULL::int[]'{}'::int[]'{NULL}'::int[]
So any searches that need to yield different answers for these cases
(ie find some but not all of them) would have to be treated as lossy.
That's not necessarily a show-stopper, but I'm inclined to think that
it is worth working a bit harder so that we can distinguish them.

It's already true that GIN requires special-case code to construct its
index entries (look at GinFormTuple).  What I'm thinking we could do
without too much additional ugliness is store either the key value (for
the normal, non-null key case) or an int16 representing a "category":1 = null key value2 = placeholder for zero-key
item3= placeholder for null item
 
The index entry's IndexTupleHasNulls flag would be sufficient to
distinguish whether a key or a category flag is present, since there
are no other potentially-null values in a GIN index entry.  There is
room in this scheme for more "categories" if we ever need any, though
I can't think of what they'd be.


The other sticky problem is how to extend the GIN opclass support function
API definitions for all this.  I propose the following:

compare(): doesn't really need any changes.  Just as in btree, we only
need to call the key comparison function for non-null keys.  The various
categories of nulls can have hard-wired comparison behavior.

extractValue(): needs an extension so it can return null key values.
I propose adding a third argument:Datum *extractValue(Datum inputValue, int32 *nkeys,                    bool
**nullFlags)
If the function wants to return any nulls, it has to palloc an array
of nkeys bools and store a pointer to it at *nullFlags.  We can initialize
*nullFlags to NULL (implying no null keys) for backwards compatibility
with existing functions that aren't aware of the third argument.  In the
case of a null item value, we needn't call the function at all, we can
just generate the dummy entry directly.  Zero-key items can be handled
compatibly with the current behavior: if nkeys is returned as zero,
we'll generate a dummy entry instead of generating nothing at all.

extractQuery(): likewise, needs to be able to return null query element
values.  I propose adding a sixth argument:Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n,
             bool **pmatch, Pointer **extra_data,                    bool **nullFlags)
 
As above, we can initialize *nullFlags to NULL for backwards
compatibility.  We will continue to assume that a null query value means
an unsatisfiable query, so we don't need to be able to call extractQuery
with a null input query.  We'll keep the current convention that returning
nkeys = -1 means an unsatisfiable query while returning zero requests a
full index scan.

consistent(): needs to be able to deal with possible nulls in the
extracted query values.  I think the best way to deal with this is to pass
through not just the null array but also the extracted datum array,
instead of forcing consistent() to possibly repeat the work of
extractQuery().  So the proposed signature is
bool consistent(bool check[], StrategyNumber n, Datum query,                int32 nkeys, Pointer extra_data[], bool
*recheck,       Datum query_keys[], bool is_null[])
 

where the last two arguments are new and will just be ignored by existing
opclasses.  The check[] array will now have IS NOT DISTINCT FROM
semantics: that is, we'll set check[i] true if query_keys[i] is a null and
there's a null key value in the index for the current heap TID.  If the
consistent() function wants to verify that check[i]==true means real
equality, it has to check is_null[i] as well.

An important point in connection with full-index scans is that it will now
be possible for consistent() to be called with nkeys == 0.  The reason to
do that rather than just assuming the result is TRUE is that we need to
know whether to recheck.  Also, I have noticed that some consistent()
functions assume that there's always at least one TRUE element in check[]
(see ginarrayconsistent for example).  This assumption now fails, since
for a zero-key placeholder entry we will need to call the consistent()
function with no check[] flags set.  Since no such assumption is
sanctioned by the documented spec for consistent(), I don't feel bad about
breaking it.  (Note: for a null-item placeholder, we'll assume the
consistent() result is FALSE without calling it.  Such an index entry
could only give rise to an index hit in a qual-free index scan.)

comparePartial(): I think we can avoid changing the API here.  It should
be okay to assume that null index values never match any form of partial
search.  If we arrange to sort them to the end, we can just stop a partial
scan as soon as we hit a null.

The upshot of the above conventions is that it'll still be the case that
GIN-indexable operators must be strict: they cannot return TRUE if either
directly given argument is NULL.  However, operators that can succeed even
though some element of a composite argument is NULL will work properly now.


At this point you're probably asking yourself "so, how
backwards-compatible is all this?".  Here's my analysis:

* Existing GIN indexes are upwards compatible so far as on-disk storage
goes, but they will of course be missing entries for empty, null, or
null-containing items.  Users who want to do searches that should find
such items will need to reindex after updating to 9.1.

* Existing extractValue() and extractQuery() functions will work as well
(or poorly) as before.  Presumably they either ignore potential null key
values or throw error if they find one.

* Existing consistent() functions will still work as before, if they are
being used in conjunction with existing extractValue() and extractQuery()
functions, since those will never return any null key or query values.
There's a small possibility that one would give a false-positive answer
if called with no true check[] items; but such cases would have failed
outright in 9.0 or before, so no previously-working queries would be
broken.


Comments?
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
Robert Haas
Date:
On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> * Existing GIN indexes are upwards compatible so far as on-disk storage
> goes, but they will of course be missing entries for empty, null, or
> null-containing items.  Users who want to do searches that should find
> such items will need to reindex after updating to 9.1.

This is the only part of this proposal that bothers me a little bit.
It would be nice if the system could determine whether a GIN index is
"upgraded from 9.0 or earlier and thus doesn't contain these entries"
- and avoid trying to use the index for these sorts of queries in
cases where it might return wrong answers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * Existing GIN indexes are upwards compatible so far as on-disk storage
>> goes, but they will of course be missing entries for empty, null, or
>> null-containing items. �Users who want to do searches that should find
>> such items will need to reindex after updating to 9.1.

> This is the only part of this proposal that bothers me a little bit.
> It would be nice if the system could determine whether a GIN index is
> "upgraded from 9.0 or earlier and thus doesn't contain these entries"
> - and avoid trying to use the index for these sorts of queries in
> cases where it might return wrong answers.

I don't think it's really worth the trouble.  The GIN code has been
broken for these types of queries since day one, and yet we've had only
maybe half a dozen complaints about it.  Moreover there's no practical
way to "avoid trying to use the index", since in many cases the fact
that a query requires a full-index scan isn't determinable at plan time.

The best we could really do is throw an error at indexscan start, and
that doesn't seem all that helpful.  But it probably wouldn't take much
code either, if you're satisfied with that answer.  (I'm envisioning
adding a version ID to the GIN metapage and then checking that before
proceeding with a full-index scan.)
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
Robert Haas
Date:
On Tue, Jan 4, 2011 at 4:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> * Existing GIN indexes are upwards compatible so far as on-disk storage
>>> goes, but they will of course be missing entries for empty, null, or
>>> null-containing items.  Users who want to do searches that should find
>>> such items will need to reindex after updating to 9.1.
>
>> This is the only part of this proposal that bothers me a little bit.
>> It would be nice if the system could determine whether a GIN index is
>> "upgraded from 9.0 or earlier and thus doesn't contain these entries"
>> - and avoid trying to use the index for these sorts of queries in
>> cases where it might return wrong answers.
>
> I don't think it's really worth the trouble.  The GIN code has been
> broken for these types of queries since day one, and yet we've had only
> maybe half a dozen complaints about it.  Moreover there's no practical
> way to "avoid trying to use the index", since in many cases the fact
> that a query requires a full-index scan isn't determinable at plan time.
>
> The best we could really do is throw an error at indexscan start, and
> that doesn't seem all that helpful.  But it probably wouldn't take much
> code either, if you're satisfied with that answer.  (I'm envisioning
> adding a version ID to the GIN metapage and then checking that before
> proceeding with a full-index scan.)

I'd be satisfied with that answer.  It at least makes it a lot more
clear when you've got a problem.  If this were a more common scenario,
I'd probably advocate for a better solution, but the one you propose
seems adequate given the frequency of the problem as you describe it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Fixing GIN for empty/null/full-scan cases

From
Josh Berkus
Date:
On 1/4/11 1:49 PM, Tom Lane wrote:
> I don't think it's really worth the trouble.  The GIN code has been
> broken for these types of queries since day one, and yet we've had only
> maybe half a dozen complaints about it.  Moreover there's no practical
> way to "avoid trying to use the index", since in many cases the fact
> that a query requires a full-index scan isn't determinable at plan time.

Actually, there's been a *lot* of complaining about the GIN issues.
It's just that most of that complaining doesn't reach -hackers.

The common pattern I've seen in our practice and on IRC is:

1) user has GiST indexes
2) user tries converting them to GIN
3) user gets "full index scan" errors
4) user switches back and gives up

I agree that backwards compatibility should not be a priority; it is
sufficient to tell users to reindex.  For one thing, anyone who *is*
using GIN presently will have written their application code to avoid
full index scans.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
Attached is a WIP patch along the lines I suggested earlier: this covers
the core code and docs, but doesn't yet do anything to the various
opclass-specific functions.  Testing it, I find that there was a gap in
my previous thinking.  The patch works for cases like

    select ... where arraycol = '{}'

which previously failed for lack of ability to do a full-index scan.
However, it doesn't work for cases like

    select ... where arraycol <@ '{1,2}'

This correctly returns {1}, {2}, and {1,2} --- but it doesn't find {}.
There are empty-item placeholder entries in the index for the latter,
but since the query does contain some real keys, it's not a full-index
scan and thus it doesn't look for the empty-item placeholders.

There are basically two ways we could fix this: with or without the
extractQuery function's cooperation.  Without any help from
extractQuery, it would be necessary for *every* GIN query to include
empty-item placeholders in the search and then see if the consistentFn
would succeed on those rows.  That seems pretty unattractive from an
efficiency standpoint, not to mention that it opens the possibility
I mentioned before of old consistentFns assuming without checking
that there's at least one true check[] entry.

What seems like a better idea is for extractQuery to explicitly tell
us whether or not an empty indexed item can match the query.  In the
terms of this patch, that amounts to including a GIN_CAT_EMPTY_ITEM
placeholder as one of the search targets.  Now, there are two ways
we could do that, too:

1. Explicitly include the EMPTY_ITEM target as part of the Datum[] array
returned by extractQuery.  That seems a bit grotty on a couple of
grounds, one being that we couldn't use just a bool isNull[] auxiliary
array but would have to expose GinNullCategory or something similar to
the opclasses.  Another problem with it is that then the "empty" target
would presumably be included in the check[] array, with the result that
an "empty" heap item would have one check[] value set, which would mess
up simple AND/OR combination logic such as you see in
ginarrayconsistent().

2. Add another output bool parameter to extractQuery that it must set
true (from a default false state) if the query could match with no check
values set.  This would prompt the GIN code to search for EMPTY_ITEM
placeholders, but they'd not be part of the check[] array.

With either one of these approaches, we can avoid the
backwards-compatibility problem of a consistentFn being passed cases
that it's not expecting, because it's not going to see any such cases
without an update to its cohort extractQueryFn.  (In particular, if the
extractQueryFn returns zero keys and doesn't take an additional action
to say that an empty item can match, then we'll treat that as an
unsatisfiable query instead of calling a consistentFn that might
misbehave on such a case.)

I think I like option #2 better.  Comments?

            regards, tom lane


Attachment

Re: Fixing GIN for empty/null/full-scan cases

From
Euler Taveira de Oliveira
Date:
Em 06-01-2011 21:31, Tom Lane escreveu:
> I think I like option #2 better.  Comments?
>
+1.


--   Euler Taveira de Oliveira  http://www.timbira.com/


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
I wrote:
> 2. Add another output bool parameter to extractQuery that it must set
> true (from a default false state) if the query could match with no check
> values set.  This would prompt the GIN code to search for EMPTY_ITEM
> placeholders, but they'd not be part of the check[] array.

On further reflection: if we're going to go this route, we really ought
to take one more step and allow the opclass to demand a full-index scan.
The reason for this is cases like tsvector's NOT operator:
SELECT ... WHERE tsvectorcol @@ '! unwanted'::tsquery

Right now, this will do what it says on the tin if implemented as a
seqscan.  It will fail (silently, I think) if implemented as a GIN index
search.  We didn't use to have any way of making it behave sanely as
an indexsearch, but the mechanisms I'm building now would support doing
this right.

So, instead of just a bool, I'm now proposing adding an int return
argument specified like this:
   searchMode is an output argument that allows extractQuery to specify   details about how the search will be done. If
*searchModeis set to   GIN_SEARCH_MODE_DEFAULT (which is the value it is initialized to   before call), only items that
matchat least one of the returned   keys are considered candidate matches. If *searchMode is set to
GIN_SEARCH_MODE_INCLUDE_EMPTY,then in addition to items containing   at least one matching key, items that contain no
keysat all are   considered candidate matches. (This mode is useful for implementing   is-subset-of operators, for
example.)If *searchMode is set to   GIN_SEARCH_MODE_ALL, then all non-null items in the index are   considered
candidatematches, whether they match any of the returned   keys or not. (This mode is much slower than the other two
choices,  since it requires scanning essentially the entire index, but it may   be necessary to implement corner cases
correctly.An operator that   needs this mode in most cases is probably not a good candidate for a   GIN operator
class.)The symbols to use for setting this mode are   defined in access/gin.h.
 

The default mode is equivalent to what used to happen implicitly, so
this is still backwards-compatible with existing opclasses.

Don't have code to back up this spec yet, but I believe I see how to do
it.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
David E. Wheeler
Date:
On Jan 4, 2011, at 3:18 PM, Josh Berkus wrote:

> Actually, there's been a *lot* of complaining about the GIN issues.
> It's just that most of that complaining doesn't reach -hackers.
>
> The common pattern I've seen in our practice and on IRC is:
>
> 1) user has GiST indexes
> 2) user tries converting them to GIN
> 3) user gets "full index scan" errors
> 4) user switches back and gives up

We (PGX) actually have a client who could use this. Tom, if you have patches as you work on this (or, better, a branch
ina Git repo), I could do some testing on your client's code with it. It would involve converting from a GiST to a GIN
indexand then seeing how well the queries fare. Would that be helpful to you? 

Best,

David



Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> We (PGX) actually have a client who could use this. Tom, if you have patches as you work on this (or, better, a
branchin a Git repo), I could do some testing on your client's code with it. It would involve converting from a GiST to
aGIN index and then seeing how well the queries fare. Would that be helpful to you?
 

Well, actually, I just committed it.  If you want to test, feel free.
Note that right now only the anyarray && <@ @> operators are genuinely
fixed ... I plan to hack on tsearch and contrib pretty soon though.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:

> Well, actually, I just committed it.  If you want to test, feel free.
> Note that right now only the anyarray && <@ @> operators are genuinely
> fixed ... I plan to hack on tsearch and contrib pretty soon though.

Hrm, the queries I wrote for this sort of thing use intarray:
   WHERE blah @@ '(12|14)'::query_int

That's not done yet though, right?

Best,

David




Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>> Well, actually, I just committed it.  If you want to test, feel free.
>> Note that right now only the anyarray && <@ @> operators are genuinely
>> fixed ... I plan to hack on tsearch and contrib pretty soon though.

> Hrm, the queries I wrote for this sort of thing use intarray:

>     WHERE blah @@ '(12|14)'::query_int

> That's not done yet though, right?

Right.  Maybe sometime this weekend.
        regards, tom lane


contrib/intarray (was Re: Fixing GIN for empty/null/full-scan cases)

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>> Well, actually, I just committed it.  If you want to test, feel free.
>> Note that right now only the anyarray && <@ @> operators are genuinely
>> fixed ... I plan to hack on tsearch and contrib pretty soon though.

> Hrm, the queries I wrote for this sort of thing use intarray:

I'm going to work on contrib/intarray first (before tsearch etc)
so that you can do whatever testing you want sooner.

One of the things that first got me annoyed about the whole GIN
situation is that intarray's definitions of the <@ and @> operators were
inconsistent with the core operators of the same names.  I believe that
the inconsistency has to go away.  Really the only reason that intarray
has its own versions of these operators at all is that it can be faster
than the generic anyarray versions in core.  There seem to be three ways
in which intarray is simpler/faster than the generic operators:
* restricted to integer arrays* restricted to 1-D arrays* doesn't allow nulls in the arrays

The first of these is pretty important from a speed perspective, and
it's also basically free because of the type system: the parser won't
attempt to apply intarray's operators to anything that's not an integer
array.  The second one seems a little more dubious.  AFAICS the code
isn't actually exploiting 1-D-ness anywhere; it always uses
ArrayGetNItems() to compute the array size, for example.  I propose that
we just drop that restriction and let it accept arrays that are
multidimensional, implicitly linearizing the elements in storage order.
(Any created arrays will still be 1-D though.)

The third restriction is a bit harder to decide what to do about.
If we keep it then intarray's <@ etc will throw errors in some cases
where core would not have.  However, dealing with nulls will make the
code significantly uglier and probably slower than it is now; and that's
not work that I'm excited about doing right now anyway.  So what I
propose for the moment is that intarray still have that restriction.
Maybe someone else will feel like fixing it later.

I will however fix the issue described here:
http://archives.postgresql.org/pgsql-bugs/2010-12/msg00032.php
that intarray sometimes throws "nulls not allowed" errors on
arrays that once contained nulls but now don't.  That can be
done with a relatively localized patch --- we just need to look
a little harder when the ARR_HASNULL flag is set.

Comments, objections?
        regards, tom lane


Re: contrib/intarray (was Re: Fixing GIN for empty/null/full-scan cases)

From
"David E. Wheeler"
Date:
On Jan 8, 2011, at 1:59 PM, Tom Lane wrote:

>> Hrm, the queries I wrote for this sort of thing use intarray:
>
> I'm going to work on contrib/intarray first (before tsearch etc)
> so that you can do whatever testing you want sooner.

No, of course not.

> One of the things that first got me annoyed about the whole GIN
> situation is that intarray's definitions of the <@ and @> operators were
> inconsistent with the core operators of the same names.  I believe that
> the inconsistency has to go away.  Really the only reason that intarray
> has its own versions of these operators at all is that it can be faster
> than the generic anyarray versions in core.  There seem to be three ways
> in which intarray is simpler/faster than the generic operators:
>
>     * restricted to integer arrays
>     * restricted to 1-D arrays
>     * doesn't allow nulls in the arrays

My understanding is that they also perform much better if the values in an integer array are ordered. Does that matter?

Best,

David



Re: contrib/intarray (was Re: Fixing GIN for empty/null/full-scan cases)

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 8, 2011, at 1:59 PM, Tom Lane wrote:
>> There seem to be three ways
>> in which intarray is simpler/faster than the generic operators:
>> 
>> * restricted to integer arrays
>> * restricted to 1-D arrays
>> * doesn't allow nulls in the arrays

> My understanding is that they also perform much better if the values in an integer array are ordered. Does that
matter?

Some of the operations sort the array contents as an initial step.  I'm
not sure how much faster they'll be if the array is already ordered,
but in any case they don't *require* presorted input.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>> Well, actually, I just committed it.  If you want to test, feel free.
>> Note that right now only the anyarray && <@ @> operators are genuinely
>> fixed ... I plan to hack on tsearch and contrib pretty soon though.

> Hrm, the queries I wrote for this sort of thing use intarray:
>     WHERE blah @@ '(12|14)'::query_int
> That's not done yet though, right?

intarray is done now, feel free to test ...
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
Tom,

On Jan 8, 2011, at 9:41 PM, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>>> Well, actually, I just committed it.  If you want to test, feel free.
>>> Note that right now only the anyarray && <@ @> operators are genuinely
>>> fixed ... I plan to hack on tsearch and contrib pretty soon though.
>
>> Hrm, the queries I wrote for this sort of thing use intarray:
>>    WHERE blah @@ '(12|14)'::query_int
>> That's not done yet though, right?
>
> intarray is done now, feel free to test ...

Thanks, working on it now. I'm restoring a dump from 8.4, but got these erors:

pg_restore: [archiver (db)] Error from TOC entry 3227; 2616 46485 OPERATOR CLASS gin__int_ops postgres
pg_restore: [archiver (db)] could not execute query: ERROR:  function ginarrayextract(anyarray, internal) does not
exist  Command was: CREATE OPERATOR CLASS gin__int_ops   FOR TYPE integer[] USING gin AS   STORAGE integer ,   OPERATOR
3&&(integer[],int... 
pg_restore: [archiver (db)] could not execute query: ERROR:  operator class "contrib.gin__int_ops" does not exist for
accessmethod "gin"   Command was: ALTER OPERATOR CLASS contrib.gin__int_ops USING gin OWNER TO postgres; 
pg_restore: [archiver (db)] Error from TOC entry 3254; 3600 16245434 TEXT SEARCH DICTIONARY en_mls_20101103 postgres

Did a signature change or something? Is there something that needs a compatibility interface of some kind?

Thanks,

David



Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> Thanks, working on it now. I'm restoring a dump from 8.4, but got these erors:

> pg_restore: [archiver (db)] Error from TOC entry 3227; 2616 46485 OPERATOR CLASS gin__int_ops postgres
> pg_restore: [archiver (db)] could not execute query: ERROR:  function ginarrayextract(anyarray, internal) does not
exist

> Did a signature change or something?

Yeah.  I think if you just load up the current contrib/intarray first,
you'll be fine (ignore all the object-already-exists errors).

> Is there something that needs a compatibility interface of some kind?

No, what we need is a decent extension package manager ;-)
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 12, 2011, at 4:35 PM, Tom Lane wrote:

>> Did a signature change or something?
>
> Yeah.  I think if you just load up the current contrib/intarray first,
> you'll be fine (ignore all the object-already-exists errors).

Oh, from 9.1devel? yeah, okay. Will do that tomorrow (finishing the current load to make sure that there are no other
errorsI can't ignore, but won't see among all the intarray stuff tomorrow). 

>> Is there something that needs a compatibility interface of some kind?
>
> No, what we need is a decent extension package manager ;-)

Yeah. Maybe you can do that this weekend? Or, I dunno, while you “sleep” tonight?

Best,

David




Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 12, 2011, at 4:35 PM, Tom Lane wrote:
>> No, what we need is a decent extension package manager ;-)

> Yeah. Maybe you can do that this weekend? Or, I dunno, while you �sleep� tonight?

Supposedly it's in the queue for the upcoming CF :-)
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> "David E. Wheeler" <david@kineticode.com> writes:
>> On Jan 12, 2011, at 4:35 PM, Tom Lane wrote:
>>> No, what we need is a decent extension package manager ;-)
>
>> Yeah. Maybe you can do that this weekend? Or, I dunno, while you “sleep” tonight?
>
> Supposedly it's in the queue for the upcoming CF :-)

Hehe, and some provision have been made to support upgrading from 9.0 to
9.1 too:
 http://pgsql.tapoueh.org/extensions/doc/html/extend-extension.html#AEN50748

But that won't solve the dump-from-9.0 and restore-into-9.1 by itself,
the only way for us to solve that problem that I can think of would be
to backpatch a new feature.  Do it the old-way until you upgrade from
9.1 to later might be our answer here.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Fixing GIN for empty/null/full-scan cases

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Tue, Jan 4, 2011 at 4:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> * Existing GIN indexes are upwards compatible so far as on-disk storage
> >>> goes, but they will of course be missing entries for empty, null, or
> >>> null-containing items. ?Users who want to do searches that should find
> >>> such items will need to reindex after updating to 9.1.
> >
> >> This is the only part of this proposal that bothers me a little bit.
> >> It would be nice if the system could determine whether a GIN index is
> >> "upgraded from 9.0 or earlier and thus doesn't contain these entries"
> >> - and avoid trying to use the index for these sorts of queries in
> >> cases where it might return wrong answers.
> >
> > I don't think it's really worth the trouble. ?The GIN code has been
> > broken for these types of queries since day one, and yet we've had only
> > maybe half a dozen complaints about it. ?Moreover there's no practical
> > way to "avoid trying to use the index", since in many cases the fact
> > that a query requires a full-index scan isn't determinable at plan time.
> >
> > The best we could really do is throw an error at indexscan start, and
> > that doesn't seem all that helpful. ?But it probably wouldn't take much
> > code either, if you're satisfied with that answer. ?(I'm envisioning
> > adding a version ID to the GIN metapage and then checking that before
> > proceeding with a full-index scan.)
> 
> I'd be satisfied with that answer.  It at least makes it a lot more
> clear when you've got a problem.  If this were a more common scenario,
> I'd probably advocate for a better solution, but the one you propose
> seems adequate given the frequency of the problem as you describe it.

What does pg_upgrade need to do about this for 9.1?  Just tell people
they might get an GIN error someday?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> What does pg_upgrade need to do about this for 9.1?

Nothing.  An existing GIN index can still do all the same queries it
could do before.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
David E. Wheeler
Date:
On Jan 8, 2011, at 9:41 PM, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>>> Well, actually, I just committed it.  If you want to test, feel free.
>>> Note that right now only the anyarray && <@ @> operators are genuinely
>>> fixed ... I plan to hack on tsearch and contrib pretty soon though.
>
>> Hrm, the queries I wrote for this sort of thing use intarray:
>>    WHERE blah @@ '(12|14)'::query_int
>> That's not done yet though, right?
>
> intarray is done now, feel free to test ...

Tom,

Well, I regret to say that what I found is…all over the place.

So I have a rather wide table with two GIST indexes, one on an integer[] column and one on a tsvector column. I duped
thetable and replaced those indexes with GIN indexes. And the results of my testing don't make much sense. 

Well, first the good news: I got no NULL-related errors at all. There are a lot of rows with an empty array in the
integer[]column. And I got the same results for my queries against the table with the GIN indexes as the one with the
GiSTindexes. So all that's good. 

One of the reasons our client wants GIN for the integer[] column so bad is because recreating the GiST integer[] index
isquite painful. Before I duped the table, I was just dropping and recreating the index on the original table. It was
greatto create the GIN index; for 400K rows, it took 1300 ms. After my initial round of testing, I dropped it and
createdthe GiST index. That ran for…well, *hours*. Four at least, maybe six or seven (I forgot \timing and was letting
itrun on screen while I did some iOS hacking). I think something might be really wrong with GiST index creation for
integerarrays, because the difference was just appalling.  

As a sanity check, I did the same thing today with the same table(s) on their ts_vector columns. Creating the GiST
indextook just under 7 seconds; the GIN index took 23.4 seconds. On a second attempt, GiST took 16371 ms and GIN 30452.
Ihad expected GIN to be faster here, but both are within the realms of the acceptable, compared to the time it took to
createthe GiST index on the integer[] column. 

As for the queries, here too I was surprised. As I said, the integer[] column had a lot of empty arrays. And the
indexeslook like so: 
 "idx_gist_features" gist (features) WHERE deleted_at IS NULL AND status = 1 "idx_gist_textsearch" gist (ts_index_col)
WHEREdeleted_at IS NULL AND status = 1 

Just s/gist/gin/ for the gin indexes. For the integer[] column, I ran a bunch of queries like so (again, just
s/gist/gin/for the gin versions): 
   explain analyze SELECT count(*) FROM gist_listings     WHERE features @@ '(1369043)'::query_int       AND deleted_at
ISNULL AND mls_status_id = 1; 

This integer had pretty high selectvity, 86 out of 406K rows.
Gist: 117.444 on the first run, around 3.2 ms thereafter
GIN:  Around 325 ms on all runs
   explain analyze SELECT count(*) FROM gist_listings     WHERE features @@ '(1368798|1369043)'::query_int       AND
deleted_atIS NULL AND mls_status_id = 1; 

Rows selected: 91528.
Gist: 4030.282 ms on the first run, around 210 ms thereafter.
GIN:  4309.259 ms on the first run, around 400 ms thereafter
   explain analyze SELECT count(*) FROM gist_listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND mls_status_id = 1; 

Rows selected: 91528
Gist: 1738.568 ms on the first run, around  24 ms thereafter.
GIN:  4427.517 ms on the first run, around 340 ms thereafter

These numbers are a bit crazy-making, but the upshot is that Gist is slow out of the gate, but with data cached, it's
prettyspeedy. With indexscan and bitmapscan disabled, these queries all took 300-400 ms. So GIN was never better
performingthan a table scan. So while GIN is a big win for re-indexing (this database gets its intarray Gist indexes
updatedquite frequently, as they get quited bloated in a hurry), it's not a win at all for querying. 

So, thinking that there might be something funky with intarray GIN support, we wanted to test performance of GIST vs.
GINon the tsquery column. Here GIN was a much bigger win. With a query like this: 
   SELECT l.* FROM gist_listings    WHERE ts_index_col @@ to_tsquery(regexp_replace(        plainto_tsquery('english',
'1Infinite Loop')::text,        '''(?=[[:space:]]|$)', ''':B', 'g'    )) and deleted_at IS NULL AND status = 1; 

With zero rows returned, GIN consistently executed in 20 ms. Gist took 838.274 for the first run and 25-30 ms on
subsequentruns. So GIN is the clear winner here, except for index creation as noted above. 

And here's one that selects a single row:
   SELECT l.* FROM gist_listings    WHERE ts_index_col @@ to_tsquery(regexp_replace(        plainto_tsquery('english',
'volkswagon')::text,       '''(?=[[:space:]]|$)', ''':B', 'g'    )) and deleted_at IS NULL AND status = 1; 

GiST: 495.867 first run, 380 thereafter
GIN:  83.980 first run, 330 thereafter

Again, GIN is the clear winner here, though it's negligible when the data is in the cache.

So some questions:

* Is something seriously wrong with GiST index creation on integer[] columns?

* Why does GIN performance appear to be no better than table scans on integer[] columns?

* Why does it take 3-4x longer to create the GIN than the GiST index on tsvector? I thought that GIN was supposed to be
fasterto update 

Hope this is helpful, and please do let me know if there are any other tests you'd like me to run against this data.

Best,

David

Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> So some questions:

> * Is something seriously wrong with GiST index creation on integer[] columns?

> * Why does GIN performance appear to be no better than table scans on integer[] columns?

> * Why does it take 3-4x longer to create the GIN than the GiST index on tsvector? I thought that GIN was supposed to
befaster to update
 

Hard to comment on any of this without a concrete example (including
data) to look at.  Given the bugs we've recently found in the picksplit
algorithms for other contrib modules, I wouldn't be too surprised if the
sucky GiST performance traced to a similar bug in intarray.  But I'm not
excited about devising my own test case.

One other point here is that GIN index build time is quite sensitive to
maintenance_work_mem --- what did you have that set to?
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 14, 2011, at 5:54 PM, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> So some questions:
>
>> * Is something seriously wrong with GiST index creation on integer[] columns?
>
>> * Why does GIN performance appear to be no better than table scans on integer[] columns?
>
>> * Why does it take 3-4x longer to create the GIN than the GiST index on tsvector? I thought that GIN was supposed to
befaster to update 
>
> Hard to comment on any of this without a concrete example (including
> data) to look at.  Given the bugs we've recently found in the picksplit
> algorithms for other contrib modules, I wouldn't be too surprised if the
> sucky GiST performance traced to a similar bug in intarray.  But I'm not
> excited about devising my own test case.

I could give you access to the box in question if you'd like to poke at it. Send me a public key.

> One other point here is that GIN index build time is quite sensitive to
> maintenance_work_mem --- what did you have that set to?

64MB

Best,

David



Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 14, 2011, at 11:37 PM, David E. Wheeler wrote:

>> Hard to comment on any of this without a concrete example (including
>> data) to look at.  Given the bugs we've recently found in the picksplit
>> algorithms for other contrib modules, I wouldn't be too surprised if the
>> sucky GiST performance traced to a similar bug in intarray.  But I'm not
>> excited about devising my own test case.

FWIW, it looks like we're able to fix the GiST performance by using gist__intbig_ops. Relevant thread:
 http://archives.postgresql.org/pgsql-performance/2009-03/msg00254.php

Perhaps time to revisit using gist__int_ops as the default?

> I could give you access to the box in question if you'd like to poke at it. Send me a public key.
>
>> One other point here is that GIN index build time is quite sensitive to
>> maintenance_work_mem --- what did you have that set to?
>
> 64MB

Best,

David




Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> One of the reasons our client wants GIN for the integer[] column so bad is because recreating the GiST integer[]
indexis quite painful. Before I duped the table, I was just dropping and recreating the index on the original table. It
wasgreat to create the GIN index; for 400K rows, it took 1300 ms. After my initial round of testing, I dropped it and
createdthe GiST index. That ran for�well, *hours*. Four at least, maybe six or seven (I forgot \timing and was letting
itrun on screen while I did some iOS hacking). I think something might be really wrong with GiST index creation for
integerarrays, because the difference was just appalling. 
 

I poked into this a bit, although it's really off-topic for what I was
doing in GIN, and I agree that there is something fundamentally rotten
in the gist__int_ops logic.  oprofile shows that the code is spending
all its time in g_int_compress and g_int_union during index build:

samples  %        app name                 symbol name
52747    40.2486  _int.so                  g_int_compress
27092    20.6726  libc-2.12.2.so           _wordcopy_fwd_aligned
18938    14.4506  _int.so                  compASC
18256    13.9302  postgres                 pg_qsort
5741      4.3807  no-vmlinux               /no-vmlinux
3297      2.5158  postgres                 swapfunc
864       0.6593  _int.so                  g_int_decompress
826       0.6303  _int.so                  _int_unique
700       0.5341  _int.so                  inner_int_union
617       0.4708  postgres                 med3
588       0.4487  libc-2.12.2.so           memset
371       0.2831  _int.so                  g_int_same
143       0.1091  libc-2.12.2.so           memcpy
123       0.0939  postgres                 ArrayGetNItems
67        0.0511  _int.so                  inner_int_inter
48        0.0366  postgres                 AllocSetAlloc
47        0.0359  libc-2.12.2.so           memmove
40        0.0305  postgres                 MemoryContextAllocZero

The time in g_int_compress is all on this loop:
       while (len > MAXNUMRANGE * 2)       {           min = 0x7fffffff;           for (i = 2; i < len; i += 2)
     if (min > (dr[i] - dr[i - 1]))               {                   min = (dr[i] - dr[i - 1]);                   cand
=i;               }           memmove((void *) &dr[cand - 1], (void *) &dr[cand + 1], (len - cand - 1) *
sizeof(int32));          len -= 2;       }
 

which is not too surprising since that's obviously O(N^2).  The other
high-percentage functions are qsort and subsidiaries, and those calls
are coming from g_int_union.  That part could probably be sped up, since
most of the calls are unioning two inputs, which could be done a lot
cheaper than this (the inputs are always presorted no?).  But there is
something really fundamentally wrong in the logic, I think, because
while poking at it with gdb I saw it union-ing fifty-thousand-element
arrays.  Surely it should get lossy before it gets to that.  I'm also
wondering how it tells the difference between lossy and non-lossy keys
--- although it's hard to tell what such spectacularly uncommented code
is supposed to be doing, it looks like the lossy case is supposed to be
pairs of values representing ranges, in which case g_int_compress is
surely doing the Wrong Thing with already-lossy inputs, and union'ing
lossy inputs is an entirely insane operation as well.

At the moment my opinion is that gist__int_ops is too broken to be
usable, and it's also too uncommented to be fixable by anyone other
than the original author.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> These numbers are a bit crazy-making, but the upshot is that Gist is
> slow out of the gate, but with data cached, it's pretty speedy. With
> indexscan and bitmapscan disabled, these queries all took 300-400
> ms. So GIN was never better performing than a table scan.

I could not replicate that here at all --- GIN indexscans were
consistently better than seqscans for me, eg

regression=# set enable_bitmapscan TO 1;
SET
Time: 0.673 ms
regression=# explain analyze SELECT count(*) FROM listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
     AND deleted_at IS NULL AND status = 1;                                                             QUERY PLAN
                                   
 

--------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=1159.20..1159.21 rows=1 width=0) (actual time=23.964..23.964 rows=1 loops=1)  ->  Bitmap Heap Scan on listings
(cost=31.15..1158.18rows=406 width=0) (actual time=23.014..23.876 rows=772 loops=1)        Recheck Cond: ((features @@
'1368799& 1368800 & 1369043'::query_int) AND (deleted_at IS NULL) AND (status = 1))        ->  Bitmap Index Scan on
idx_gin_features (cost=0.00..31.05 rows=406 width=0) (actual time=22.913..22.913 rows=772 loops=1)              Index
Cond:(features @@ '1368799 & 1368800 & 1369043'::query_int)Total runtime: 24.040 ms
 
(6 rows)

Time: 24.968 ms
regression=# set enable_bitmapscan TO 0;
SET
Time: 0.458 ms
regression=# explain analyze SELECT count(*) FROM listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
     AND deleted_at IS NULL AND status = 1;                                                    QUERY PLAN
                                   
 

--------------------------------------------------------------------------------------------------------------------Aggregate
(cost=9158.24..9158.25 rows=1 width=0) (actual time=145.121..145.121 rows=1 loops=1)  ->  Seq Scan on listings
(cost=0.00..9157.22rows=406 width=0) (actual time=0.025..144.982 rows=772 loops=1)        Filter: ((deleted_at IS NULL)
AND(features @@ '1368799 & 1368800 & 1369043'::query_int) AND (status = 1))Total runtime: 145.177 ms
 
(4 rows)

Time: 146.228 ms

I'm noticing also that I get different rowcounts than you do, although
possibly that has something to do with the partial-index conditions,
which I'm not trying to duplicate here (all rows in my table pass those
two tests).

> * Why does it take 3-4x longer to create the GIN than the GiST index
> on tsvector?

Perhaps more maintenance_work_mem would help with that; although the
fine manual says specifically that GIN text search indexes take about
three times longer to build than equivalent GiST indexes, so maybe that
behavior is as designed.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:

> At the moment my opinion is that gist__int_ops is too broken to be
> usable, and it's also too uncommented to be fixable by anyone other
> than the original author.

That seems to jibe with your comments from last year:
 http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Best,

David


Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:
>> At the moment my opinion is that gist__int_ops is too broken to be
>> usable, and it's also too uncommented to be fixable by anyone other
>> than the original author.

> That seems to jibe with your comments from last year:
>   http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Well, based on what I saw just now, I believe there are one or more
actual bugs in there, not just that it's straining the design capacity
of the opclass.  I think the "lossy" aspect isn't working at all.
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 18, 2011, at 1:58 PM, Tom Lane wrote:

> I'm noticing also that I get different rowcounts than you do, although
> possibly that has something to do with the partial-index conditions,
> which I'm not trying to duplicate here (all rows in my table pass those
> two tests).

Shall I send you data with the other two columns?:

>> * Why does it take 3-4x longer to create the GIN than the GiST index
>> on tsvector?
> 
> Perhaps more maintenance_work_mem would help with that; although the
> fine manual says specifically that GIN text search indexes take about
> three times longer to build than equivalent GiST indexes, so maybe that
> behavior is as designed.

Okay then, thanks.

David



Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Jan 18, 2011, at 1:58 PM, Tom Lane wrote:
>> I'm noticing also that I get different rowcounts than you do, although
>> possibly that has something to do with the partial-index conditions,
>> which I'm not trying to duplicate here (all rows in my table pass those
>> two tests).

> Shall I send you data with the other two columns?:

No, I see no reason to think that has much to do with it.  I'm wondering
if your table is itself a bit bloated ...
        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 18, 2011, at 3:08 PM, Tom Lane wrote:

>> Shall I send you data with the other two columns?:
>
> No, I see no reason to think that has much to do with it.  I'm wondering
> if your table is itself a bit bloated ...

Nope. Just ran the bloat query from check_postgres.pl. Bloat is 0. Not surprising since the table was created by
insertingfrom another table (the original with the gist indexes) and has not been updated since. 

Any other ideas?

David



Re: Fixing GIN for empty/null/full-scan cases

From
Tom Lane
Date:
I wrote:
> No, I see no reason to think that has much to do with it.  I'm wondering
> if your table is itself a bit bloated ...

Actually ... I notice you did not show EXPLAIN ANALYZE output for your
tests.  Now I'm wondering whether you tested the right thing at all.
I got burnt that way too.  Observe:

regression=# create index idx_gin_features on listings using gin(features) WHERE deleted_at IS NULL AND status = 1;
CREATE INDEX
regression=# explain analyze SELECT count(*) FROM listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
     AND deleted_at IS NULL AND status = 1;                                                    QUERY PLAN
                                     
 

--------------------------------------------------------------------------------------------------------------------Aggregate
(cost=9158.24..9158.25 rows=1 width=0) (actual time=153.633..153.634 rows=1 loops=1)  ->  Seq Scan on listings
(cost=0.00..9157.22rows=406 width=0) (actual time=0.048..153.493 rows=772 loops=1)        Filter: ((deleted_at IS NULL)
AND(features @@ '1368799 & 1368800 & 1369043'::query_int) AND (status = 1))Total runtime: 153.713 ms
 
(4 rows)

regression=# set enable_seqscan TO 0;
SET
regression=# explain analyze SELECT count(*) FROM listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
     AND deleted_at IS NULL AND status = 1;                                                                  QUERY PLAN
                                                                 
 

------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=13253.42..13253.43 rows=1 width=0) (actual time=331.990..331.990 rows=1 loops=1)  ->  Bitmap Heap Scan on
listings (cost=4095.18..13252.40 rows=406 width=0) (actual time=164.785..331.858 rows=772 loops=1)        Recheck Cond:
((deleted_atIS NULL) AND (status = 1))        Filter: (features @@ '1368799 & 1368800 & 1369043'::query_int)        ->
BitmapIndex Scan on idx_gin_features  (cost=0.00..4095.07 rows=406215 width=0) (actual time=164.045..164.045
rows=406215loops=1)Total runtime: 332.169 ms
 
(6 rows)

The above is "using" the index, but only as a guide to where the rows
satisfying the partial-index predicate are --- note the lack of any
index condition in the indexscan node.  That's because the query_int
query is not in fact compatible with the core-provided index opclass.
We get much better results using intarray's gin__int_ops opclass:

regression=# drop index idx_gin_features;
DROP INDEX
regression=# create index idx_gin_features on listings using gin(features gin__int_ops) WHERE deleted_at IS NULL AND
status= 1;
 
CREATE INDEX
regression=# explain analyze SELECT count(*) FROM listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
     AND deleted_at IS NULL AND status = 1;                                                             QUERY PLAN
                                                       
 

--------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=1159.20..1159.21 rows=1 width=0) (actual time=23.896..23.896 rows=1 loops=1)  ->  Bitmap Heap Scan on listings
(cost=31.15..1158.18rows=406 width=0) (actual time=22.912..23.813 rows=772 loops=1)        Recheck Cond: ((features @@
'1368799& 1368800 & 1369043'::query_int) AND (deleted_at IS NULL) AND (status = 1))        ->  Bitmap Index Scan on
idx_gin_features (cost=0.00..31.05 rows=406 width=0) (actual time=22.811..22.811 rows=772 loops=1)              Index
Cond:(features @@ '1368799 & 1368800 & 1369043'::query_int)Total runtime: 24.036 ms
 
(6 rows)

        regards, tom lane


Re: Fixing GIN for empty/null/full-scan cases

From
"David E. Wheeler"
Date:
On Jan 18, 2011, at 3:46 PM, Tom Lane wrote:

> The above is "using" the index, but only as a guide to where the rows
> satisfying the partial-index predicate are --- note the lack of any
> index condition in the indexscan node.  That's because the query_int
> query is not in fact compatible with the core-provided index opclass.
> We get much better results using intarray's gin__int_ops opclass:

Ah-hah! Confirmed, thank you. I now get 23.2 ms and an index condition. Much, much better.

Thank you! And I think we can use the improved GIN indexes for this app, once 9.1 drops.

Best,

David