Re: tsearch2, large data and indexes - Mailing list pgsql-performance

From Ivan Voras
Subject Re: tsearch2, large data and indexes
Date
Msg-id CAF-QHFWbee3Anp+KKc5DEe2wufHtWYZ8zc1ysqXu2-0SW+v-UA@mail.gmail.com
Whole thread Raw
In response to Re: tsearch2, large data and indexes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: tsearch2, large data and indexes
Re: tsearch2, large data and indexes
List pgsql-performance
On 22 April 2014 08:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 04/20/2014 02:15 AM, Ivan Voras wrote:
>> More details: after thinking about it some more, it might have
>> something to do with tsearch2 and indexes: the large data in this case
>> is a tsvector, indexed with GIN, and the query plan involves a
>> re-check condition.
>>
>> The query is of the form:
>> SELECT simple_fields FROM table WHERE fts @@ to_tsquery('...').
>>
>> Does the "re-check condition" mean that the original tsvector data is
>> always read from the table in addition to the index?
>
>
> Yes, if the re-check condition involves the fts column. I don't see why you
> would have a re-check condition with a query like that, though. Are there
> some other WHERE-conditions that you didn't show us?

Yes, I've read about tsearch2 and GIN indexes and there shouldn't be a
recheck condition - but there is.
This is the query:

SELECT documents.id, title, raw_data, q, ts_rank(fts_data, q, 4) AS
rank, html_filename
            FROM documents, to_tsquery('document') AS q
            WHERE fts_data @@ q
         ORDER BY rank DESC  LIMIT 25;

And here is the explain analyze: http://explain.depesz.com/s/4xm
It clearly shows a bitmap index scan operation is immediately followed
by a recheck operation AND that the recheck operation actually does
something, because it reduces the number of records from 61 to 58
(!!!).

This is the table structure:

nn=# \d documents
                             Table "public.documents"
    Column     |   Type   |                       Modifiers
---------------+----------+--------------------------------------------------------
 id            | integer  | not null default
nextval('documents_id_seq'::regclass)
 ctime         | integer  | not null default unix_ts(now())
 dtime         | integer  | not null
 title         | text     | not null
 html_filename | text     | not null
 raw_data      | text     |
 fts_data      | tsvector | not null
 tags          | text[]   |
 dtype         | integer  | not null default 0
 flags         | integer  | not null default 0
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)
    "documents_html_filename" UNIQUE, btree (html_filename)
    "documents_dtime" btree (dtime)
    "documents_fts_data" gin (fts_data)
    "documents_tags" gin (tags)


> The large fields are stored in the toast table. You can check if the toast
> table is accessed with a query like this:
>
> select * from pg_stat_all_tables where relid = (select reltoastrelid from
> pg_class where relname='table');
>
> Run that before and after your query, and see if the numbers change.

Before:


relid|schemaname|relname|seq_scan|seq_tup_read|idx_scan|idx_tup_fetch|n_tup_ins|n_tup_upd|n_tup_del|n_tup_hot_upd|n_live_tup|n_dead_tup|last_vacuum|last_autovacuum|last_analyze|last_autoanalyze|vacuum_count|autovacuum_count|analyze_count|autoanalyze_count
27290|pg_toast|pg_toast_27283|3|0|2481289|10631453|993194|0|266306|0|147931|2514||2014-04-18
00:49:11.066443+02|||0|11|0|0

After:


relid|schemaname|relname|seq_scan|seq_tup_read|idx_scan|idx_tup_fetch|n_tup_ins|n_tup_upd|n_tup_del|n_tup_hot_upd|n_live_tup|n_dead_tup|last_vacuum|last_autovacuum|last_analyze|last_autoanalyze|vacuum_count|autovacuum_count|analyze_count|autoanalyze_count
27290|pg_toast|pg_toast_27283|3|0|2481347|10632814|993194|0|266306|0|147931|2514||2014-04-18
00:49:11.066443+02|||0|11|0|0

idx_scan has changed from 2481289 to 2481347 (58)
idx_tup_fetch has changed from 10631453 to 10632814 (1361)

Number 58 corresponds to the number of rows found by the index, seen
in the EXPLAIN output, I don't know where 1361 comes from.

I'm also surprised by the amount of memory used for sorting (23 kB),
since the actually returned data from my query (all the tuples from
all the 58 rows) amount to around 2 kB - but this is not an actual
problem.


pgsql-performance by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search
Next
From: Souquieres Adam
Date:
Subject: Query on partitioned table not using index