Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers

From Rod Taylor
Subject Re: GIN improvements part2: fast scan
Date
Msg-id CAKddOFCqZw-t0e1aoz7i+de0itVU9x1=bNcn8kv=g_92TK1BWA@mail.gmail.com
Whole thread Raw
In response to Re: GIN improvements part2: fast scan  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
I tried again this morning using gin-packed-postinglists-16.patch and gin-fast-scan.6.patch. No crashes during index building.

Pg was compiled with debug enabled in both cases. The data is a ~0.1% random sample of production data (10,000,000 records for the test) with the below structure.

      Table "public.kp"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | bigint  | not null
 string | text    | not null
 score1 | integer |
 score2 | integer |
 score3 | integer |
 score4 | integer |
Indexes:
    "kp_pkey" PRIMARY KEY, btree (id)
    "kp_string_key" UNIQUE CONSTRAINT, btree (string)
    "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string)) WHERE score1 IS NOT NULL



All data is in Pg buffer cache for these timings. Words like "the" and "and" are very common (~9% of entries, each) and a word like "hotel" is much less common (~0.2% of entries). Below is the query tested:

  SELECT id,string
    FROM kp
   WHERE score1 IS NOT NULL
     AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
             -- ? is substituted with the query strings
ORDER BY score1 DESC, score2 ASC
LIMIT 1000;

 Limit  (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032 rows=142 loops=1)
   ->  Sort  (cost=56.04..56.04 rows=1 width=37) (actual time=250.008..250.017 rows=142 loops=1)
         Sort Key: score1, score2
         Sort Method: quicksort  Memory: 36kB
         ->  Bitmap Heap Scan on kp  (cost=52.01..56.03 rows=1 width=37) (actual time=249.711..249.945 rows=142 loops=1)
               Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
               ->  Bitmap Index Scan on textsearch_gin_idx  (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142 loops=1)
                     Index Cond: (to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery)
 Total runtime: 250.096 ms



Times are from \timing on.

MASTER
=======
the:               888.436 ms   926.609 ms   885.502 ms
and:               944.052 ms   937.732 ms   920.050 ms
hotel:              53.992 ms    57.039 ms    65.581 ms
and & the & hotel: 260.308 ms   248.275 ms   248.098 ms

These numbers roughly match what we get with Pg 9.2. The time savings between 'the' and 'and & the & hotel' is mostly heap lookups for the score and the final sort.



The size of the index on disk is about 2% smaller in the patched version.

PATCHED
=======
the:              1055.169 ms 1081.976 ms  1083.021 ms
and:               912.173 ms  949.364 ms   965.261 ms
hotel:              62.591 ms   64.341 ms    62.923 ms
and & the & hotel: 268.577 ms  259.293 ms   257.408 ms
hotel & and & the: 253.574 ms  258.071 ms  250.280 ms

I was hoping that the 'and & the & hotel' case would improve with this patch to be closer to the 'hotel' search, as I thought that was the kind of thing it targeted. Unfortunately, it did not. I actually applied the patches, compiled, initdb/load data, and ran it again thinking I made a mistake.

Reordering the terms 'hotel & and & the' doesn't change the result.



On Fri, Nov 15, 2013 at 1:51 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Nov 15, 2013 at 3:25 AM, Rod Taylor <rbt@simple-knowledge.com> wrote:
I checked out master and put together a test case using a small percentage of production data for a known problem we have with Pg 9.2 and text search scans.

A small percentage in this case means 10 million records randomly selected; has a few billion records.


Tests ran for master successfully and I recorded timings.



Applied the patch included here to master along with gin-packed-postinglists-14.patch.
Run make clean; ./configure; make; make install.
make check (All 141 tests passed.)

initdb, import dump


The GIN index fails to build with a segfault.

Thanks for testing. See fixed version in thread about packed posting lists.

------
With best regards,
Alexander Korotkov. 

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Database disconnection and switch inside a single bgworker
Next
From: Kevin Grittner
Date:
Subject: Re: strncpy is not a safe version of strcpy