Re: Fixing GIN for empty/null/full-scan cases - Mailing list pgsql-hackers

From David E. Wheeler
Subject Re: Fixing GIN for empty/null/full-scan cases
Date
Msg-id EACA86F3-83D2-431B-8071-5136E28FF90B@kineticode.com
Whole thread Raw
In response to Re: Fixing GIN for empty/null/full-scan cases  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Fixing GIN for empty/null/full-scan cases  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Fixing GIN for empty/null/full-scan cases  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Fixing GIN for empty/null/full-scan cases  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Jan 8, 2011, at 9:41 PM, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>>> Well, actually, I just committed it.  If you want to test, feel free.
>>> Note that right now only the anyarray && <@ @> operators are genuinely
>>> fixed ... I plan to hack on tsearch and contrib pretty soon though.
>
>> Hrm, the queries I wrote for this sort of thing use intarray:
>>    WHERE blah @@ '(12|14)'::query_int
>> That's not done yet though, right?
>
> intarray is done now, feel free to test ...

Tom,

Well, I regret to say that what I found is…all over the place.

So I have a rather wide table with two GIST indexes, one on an integer[] column and one on a tsvector column. I duped
thetable and replaced those indexes with GIN indexes. And the results of my testing don't make much sense. 

Well, first the good news: I got no NULL-related errors at all. There are a lot of rows with an empty array in the
integer[]column. And I got the same results for my queries against the table with the GIN indexes as the one with the
GiSTindexes. So all that's good. 

One of the reasons our client wants GIN for the integer[] column so bad is because recreating the GiST integer[] index
isquite painful. Before I duped the table, I was just dropping and recreating the index on the original table. It was
greatto create the GIN index; for 400K rows, it took 1300 ms. After my initial round of testing, I dropped it and
createdthe GiST index. That ran for…well, *hours*. Four at least, maybe six or seven (I forgot \timing and was letting
itrun on screen while I did some iOS hacking). I think something might be really wrong with GiST index creation for
integerarrays, because the difference was just appalling.  

As a sanity check, I did the same thing today with the same table(s) on their ts_vector columns. Creating the GiST
indextook just under 7 seconds; the GIN index took 23.4 seconds. On a second attempt, GiST took 16371 ms and GIN 30452.
Ihad expected GIN to be faster here, but both are within the realms of the acceptable, compared to the time it took to
createthe GiST index on the integer[] column. 

As for the queries, here too I was surprised. As I said, the integer[] column had a lot of empty arrays. And the
indexeslook like so: 
 "idx_gist_features" gist (features) WHERE deleted_at IS NULL AND status = 1 "idx_gist_textsearch" gist (ts_index_col)
WHEREdeleted_at IS NULL AND status = 1 

Just s/gist/gin/ for the gin indexes. For the integer[] column, I ran a bunch of queries like so (again, just
s/gist/gin/for the gin versions): 
   explain analyze SELECT count(*) FROM gist_listings     WHERE features @@ '(1369043)'::query_int       AND deleted_at
ISNULL AND mls_status_id = 1; 

This integer had pretty high selectvity, 86 out of 406K rows.
Gist: 117.444 on the first run, around 3.2 ms thereafter
GIN:  Around 325 ms on all runs
   explain analyze SELECT count(*) FROM gist_listings     WHERE features @@ '(1368798|1369043)'::query_int       AND
deleted_atIS NULL AND mls_status_id = 1; 

Rows selected: 91528.
Gist: 4030.282 ms on the first run, around 210 ms thereafter.
GIN:  4309.259 ms on the first run, around 400 ms thereafter
   explain analyze SELECT count(*) FROM gist_listings     WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND mls_status_id = 1; 

Rows selected: 91528
Gist: 1738.568 ms on the first run, around  24 ms thereafter.
GIN:  4427.517 ms on the first run, around 340 ms thereafter

These numbers are a bit crazy-making, but the upshot is that Gist is slow out of the gate, but with data cached, it's
prettyspeedy. With indexscan and bitmapscan disabled, these queries all took 300-400 ms. So GIN was never better
performingthan a table scan. So while GIN is a big win for re-indexing (this database gets its intarray Gist indexes
updatedquite frequently, as they get quited bloated in a hurry), it's not a win at all for querying. 

So, thinking that there might be something funky with intarray GIN support, we wanted to test performance of GIST vs.
GINon the tsquery column. Here GIN was a much bigger win. With a query like this: 
   SELECT l.* FROM gist_listings    WHERE ts_index_col @@ to_tsquery(regexp_replace(        plainto_tsquery('english',
'1Infinite Loop')::text,        '''(?=[[:space:]]|$)', ''':B', 'g'    )) and deleted_at IS NULL AND status = 1; 

With zero rows returned, GIN consistently executed in 20 ms. Gist took 838.274 for the first run and 25-30 ms on
subsequentruns. So GIN is the clear winner here, except for index creation as noted above. 

And here's one that selects a single row:
   SELECT l.* FROM gist_listings    WHERE ts_index_col @@ to_tsquery(regexp_replace(        plainto_tsquery('english',
'volkswagon')::text,       '''(?=[[:space:]]|$)', ''':B', 'g'    )) and deleted_at IS NULL AND status = 1; 

GiST: 495.867 first run, 380 thereafter
GIN:  83.980 first run, 330 thereafter

Again, GIN is the clear winner here, though it's negligible when the data is in the cache.

So some questions:

* Is something seriously wrong with GiST index creation on integer[] columns?

* Why does GIN performance appear to be no better than table scans on integer[] columns?

* Why does it take 3-4x longer to create the GIN than the GiST index on tsvector? I thought that GIN was supposed to be
fasterto update 

Hope this is helpful, and please do let me know if there are any other tests you'd like me to run against this data.

Best,

David

pgsql-hackers by date:

Previous
From: Euler Taveira de Oliveira
Date:
Subject: Re: Per-column collation, the finale
Next
From: Euler Taveira de Oliveira
Date:
Subject: Re: Named restore points