Thread: GIN index always doing Re-check condition, postgres 9.1

GIN index always doing Re-check condition, postgres 9.1

From
Andrey Osenenko
Date:
I have a table with roughly 300 000 rows, each row containing two large strings - one is news article HTML, the other is news article plaintext. The table has a bigint primary key.

A GIN index is used to do fulltext search on the plaintext part. All I want to retrieve when I do fulltext search is values of primary key column.

With a popular word, the amount of results from fulltext search query can be pretty high - a query can return 23 000 rows and some can more, and will return more as the database continues to grow.

The problem I have is that postgres always does Re-check condition step for my request. That query with 23k rows takes 20 seconds to execute, and EXPLAIN shows that almost all of that time is spent
re-checking condition. The second time I run the same query, I get results immediately.
What this means is that every time user does a search for some word no one searched before, he has to wait a very long time, which is unacceptable for us.

Is this happening by design, or am I doing something wrong? The way I see it, since docs say GIN indexes are lossless, the database should be able to fetch just primary key values for matching rows for me.

Here's schema, query, explain and database version:

CREATE TABLE kard_md.fulldata
(
  id_iu bigint NOT NULL,
  url character varying NOT NULL,
  original text,
  edited text,
  plaintext text,
  date timestamp without time zone,
  CONSTRAINT fulldata_pkey PRIMARY KEY (id_iu)
);

CREATE INDEX fulldata_plaintext_idx
  ON kard_md.fulldata
  USING gin
  (to_tsvector('russian'::regconfig, plaintext));


EXPLAIN (ANALYZE, BUFFERS) select id_iu from kard_md.fulldata where to_tsvector('russian',fulldata.plaintext) @@ plainto_tsquery('russian','москва');

1st run:
Bitmap Heap Scan on fulldata  (cost=266.79..39162.57 rows=23069 width=8) (actual time=135.727..19499.667 rows=23132 loops=1)
  Recheck Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
  Buffers: shared hit=115 read=13000
  ->  Bitmap Index Scan on fulldata_plaintext_idx  (cost=0.00..261.02 rows=23069 width=0) (actual time=104.834..104.834 rows=23132 loops=1)
        Index Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
        Buffers: shared hit=3 read=21
Total runtime: 19512.479 ms

2nd run:
Bitmap Heap Scan on fulldata  (cost=266.79..39162.57 rows=23069 width=8) (actual time=25.423..48.649 rows=23132 loops=1)
  Recheck Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
  Buffers: shared hit=13115
  ->  Bitmap Index Scan on fulldata_plaintext_idx  (cost=0.00..261.02 rows=23069 width=0) (actual time=18.057..18.057 rows=23132 loops=1)
        Index Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
        Buffers: shared hit=24
Total runtime: 49.612 ms

select version()
'PostgreSQL 9.1.15, compiled by Visual C++ build 1500, 64-bit'


Re: GIN index always doing Re-check condition, postgres 9.1

From
Jeff Janes
Date:
On Sun, Nov 1, 2015 at 10:52 PM, Andrey Osenenko
<andrey.osenenko@gmail.com> wrote:
> I have a table with roughly 300 000 rows, each row containing two large
> strings - one is news article HTML, the other is news article plaintext. The
> table has a bigint primary key.
>
> A GIN index is used to do fulltext search on the plaintext part. All I want
> to retrieve when I do fulltext search is values of primary key column.
>
> With a popular word, the amount of results from fulltext search query can be
> pretty high - a query can return 23 000 rows and some can more, and will
> return more as the database continues to grow.
>
> The problem I have is that postgres always does Re-check condition step for
> my request. That query with 23k rows takes 20 seconds to execute, and
> EXPLAIN shows that almost all of that time is spent
> re-checking condition.

Explain does not address the issue of how much time was spent doing
rechecks.  You are misinterpreting something.

> The second time I run the same query, I get results
> immediately.

That suggests the time is spent reading data from disk the first time,
not spent doing rechecks.  Rechecks do not get faster by repeated
execution, except indirectly to the extent the data has already been
pulled into memory.  But other things get faster due to that, as well.

Now those are not mutually exclusive, as doing a recheck might lead to
reading toast tables that don't need to get read at all in the absence
of rechecks.  So rechecks can lead to IO problems.  But there is no
evidence that that is the case for you.

>
> 1st run:
> Bitmap Heap Scan on fulldata  (cost=266.79..39162.57 rows=23069 width=8)
> (actual time=135.727..19499.667 rows=23132 loops=1)
>   Recheck Cond: (to_tsvector('russian'::regconfig, plaintext) @@
> '''москв'''::tsquery)

This tells you what condition will be applied to the recheck, in case
a recheck is needed due to bitmap memory overflow.  It doesn't tell
how many times, if any, that was actually done, or how much time was
spent doing it.

As far as I know, there is no way to distinguish a "lossy index"
recheck form a "lossy bitmap" recheck in version 9.1.

>   Buffers: shared hit=115 read=13000

That you needed only 13115 blocks to deliver 23069 tells me that there
is little if any recheck-driven toast table access going on.  That the
second execution was very fast tells me that there is little
rechecking at all going on, because actual rechecking is CPU
intensive.

I don't think your problem has anything to do with rechecking.  You
simply have too much data that is not in memory.  You need more
memory, or some way to keep your memory pinned with what you need.  If
you are on a RAID, you could also increase effective_io_concurrency,
which lets the bitmap scan prefetch table blocks.

Cheers,

Jeff


Re: GIN index always doing Re-check condition, postgres 9.1

From
Andrey Osenenko
Date:
Thank you.

That's really sad news. This means that even though there is an index that lets you find rows you want almost immediately, to retrieve primary keys, you still have to do a lot of disk io.

I created a new table that contains only primary key and tsvector value, and (at least that's how I'm interpreting it) since there is less data to read per row, it returns same results about 2 times as quickly (I restarted computer to make sure nothing is in memory).

Original table:
Bitmap Heap Scan on fulldata  (cost=266.79..39162.57 rows=23069 width=8) (actual time=113.472..18368.769 rows=23132 loops=1)
  Recheck Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
  Buffers: shared hit=1 read=13114
  ->  Bitmap Index Scan on fulldata_plaintext_idx  (cost=0.00..261.02 rows=23069 width=0) (actual time=90.859..90.859 rows=23132 loops=1)
        Index Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
        Buffers: shared hit=1 read=23
Total runtime: 18425.265 ms

Table with only key and vector:
Bitmap Heap Scan on fts  (cost=273.67..27903.61 rows=23441 width=8) (actual time=219.896..10095.159 rows=23132 loops=1)
  Recheck Cond: (vector @@ '''москв'''::tsquery)
  Buffers: shared hit=1 read=10877
  ->  Bitmap Index Scan on fts_vector_idx  (cost=0.00..267.81 rows=23441 width=0) (actual time=204.631..204.631 rows=23132 loops=1)
        Index Cond: (vector @@ '''москв'''::tsquery)
        Buffers: shared hit=1 read=23
Total runtime: 10103.858 ms

It also looks like if there was a way to create a table with just primary key and add an index to it that indexes data from another table, it would work much, much faster since there would be very little to read from disk after index lookup. But looks like there isn't.

So am I correct in assumption that as the amount of rows grows, query times for rows that are not in memory (and considering how many of them there are, most won't be) will grow linearly?

On Mon, Nov 2, 2015 at 11:14 AM, Andrey Osenenko <andrey.osenenko@gmail.com> wrote:
Thank you.

That's really sad news. This means that even though there is an index that lets you find rows you want almost immediately, to retrieve primary keys, you still have to do a lot of disk io.

I created a new table that contains only primary key and tsvector value, and (at least that's how I'm interpreting it) since there is less data to read per row, it returns same results about 2 times as quickly (I restarted computer to make sure nothing is in memory).

Original table:
Bitmap Heap Scan on fulldata  (cost=266.79..39162.57 rows=23069 width=8) (actual time=113.472..18368.769 rows=23132 loops=1)
  Recheck Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
  Buffers: shared hit=1 read=13114
  ->  Bitmap Index Scan on fulldata_plaintext_idx  (cost=0.00..261.02 rows=23069 width=0) (actual time=90.859..90.859 rows=23132 loops=1)
        Index Cond: (to_tsvector('russian'::regconfig, plaintext) @@ '''москв'''::tsquery)
        Buffers: shared hit=1 read=23
Total runtime: 18425.265 ms

Table with only key and vector:
Bitmap Heap Scan on fts  (cost=273.67..27903.61 rows=23441 width=8) (actual time=219.896..10095.159 rows=23132 loops=1)
  Recheck Cond: (vector @@ '''москв'''::tsquery)
  Buffers: shared hit=1 read=10877
  ->  Bitmap Index Scan on fts_vector_idx  (cost=0.00..267.81 rows=23441 width=0) (actual time=204.631..204.631 rows=23132 loops=1)
        Index Cond: (vector @@ '''москв'''::tsquery)
        Buffers: shared hit=1 read=23
Total runtime: 10103.858 ms

It also looks like if there was a way to create a table with just primary key and add an index to it that indexes data from another table, it would work much, much faster since there would be very little to read from disk after index lookup. But looks like there isn't.

So am I correct in assumption that as the amount of rows grows, query times for rows that are not in memory (and considering how many of them there are, most won't be) will grow linearly?

Re: GIN index always doing Re-check condition, postgres 9.1

From
Jim Nasby
Date:
On 11/2/15 2:19 AM, Andrey Osenenko wrote:
> It also looks like if there was a way to create a table with just
> primary key and add an index to it that indexes data from another table,
> it would work much, much faster since there would be very little to read
> from disk after index lookup. But looks like there isn't.

That probably wouldn't help as much as you'd hope, because heap tuples
in Postgres have a minimum 24 byte overhead. Add in 8 bytes for bigint
and that's 32 bytes extra per row.

I think what might gain you more is if you moved 9.2 and got index only
scans. Though, if you're getting lossy results, I don't think that'll
help [1].

> So am I correct in assumption that as the amount of rows grows, query
> times for rows that are not in memory (and considering how many of them
> there are, most won't be) will grow linearly?

Maybe, maybe not. Query times for data that has to come from the disk
can vary wildly based on what other activity is happening on the IO
system. Ultimately, your IO system can only do so many IOs Per Second.

[1]
https://wiki.postgresql.org/wiki/Index-only_scans#Index-only_scans_and_index-access_methods
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: GIN index always doing Re-check condition, postgres 9.1

From
Jeff Janes
Date:
On Mon, Nov 2, 2015 at 12:19 AM, Andrey Osenenko
<andrey.osenenko@gmail.com> wrote:
>
> It also looks like if there was a way to create a table with just primary
> key and add an index to it that indexes data from another table, it would
> work much, much faster since there would be very little to read from disk
> after index lookup. But looks like there isn't.

There is a way to do this, but it is pretty gross.

You can define function which takes the primary key as input and
returns the data to index.  Mark the function as immutable, even
though it isn't.  Something like:

CREATE OR REPLACE FUNCTION public.filler_by_aid(integer)
 RETURNS text
 LANGUAGE sql
 IMMUTABLE
AS $function$ select filler::text from pgbench_accounts where aid=$1 $function$

Then create a table which has just the primary key, and create a
functional index on that table

create table foobar as select aid from pgbench_accounts;
create index on foobar (filler_by_aid(aid));

Now you can query the skinny table by reference to the data in the wide table:

explain analyze select count(*) from foobar where filler_by_aid(aid)='zebra';

Since you fibbed to PostgreSQL about the functions immutability, it is
pretty easy to get a corrupt index here.  Every time the parent is
updated, you have to be sure to delete and reinsert the primary key in
the corresponding skinny table, otherwise it will not reflect the
updated value.

What you gain in the skinny table you could very well lose with the
triggers needed to maintain it.  Not to mention the fragility.


It would be simpler if you could just force the wide data to always be
toasted, even if it is not wide enough to trigger the default toast
threshold.  You could get a skinnier table (although not quite as
skinny as one with only a single column), without having to do
unsupported hacks.  (I am assuming here, without real evidence other
than intuition, that most of your news articles are in fact coming in
under the toast threshold).


> So am I correct in assumption that as the amount of rows grows, query times
> for rows that are not in memory (and considering how many of them there are,
> most won't be) will grow linearly?

Yep.

What you really need are index only scans.  But those are not
supported by gin indexes, and given the gin index structure it seems
unlikely they will ever support index-only scans, at least not in a
way that would help you.

What are you going to do with these 23,000 primary keys once you get
them, anyway?  Perhaps you can push that analysis into the database
and gain some efficiencies there.

Or you could change your data structure.  If all you are doing is
searching for one tsvector token at a time, you could unnest ts_vector
and store it in a table like (ts_token text, id_iu bigint).  Then
build a regular btree index on (ts_token, id_iu) and get index-only
scans (once you upgrade from 9.1 to something newer)

Cheers,

Jeff


Re: GIN index always doing Re-check condition, postgres 9.1

From
Andrey Osenenko
Date:
Jeff Janes:

That is such a beautiful little trick. I made a table with just ids, and a query for it reads almost 10 times less buffers (as reported by explain analyze buffers), and sure enough, after another reboot, query executes about 10 times faster.

I'm not doing anything special with those results. I have a main table "core" with various information about entries. Some entries have plaintexts attached and those are stored in the additional table "fulldata". fulldata's primary key refers to core's primary key and to do a fulltext search filtering results using core's other fields I have to retrieve primary keys from fulldata.
We tried many different ways to join rows from fulldata and core for that query, and ended up with something along the lines of:

where core.id_iu in (with ids as(select id_iu from fulldata where <fulltext condition here>) select * from ids) and <other core conditions here>

It was just as fast/slow as table joins and subqueries but always used fulltext index no matter what planner had in mind.

I'll be sure to play around with fake immutable function, and I think it might be even worth it to add the index to core instead of thin table.

Thanks!

Jim Nasby:
Well, it's 32 bytes per row vs only god knows how many bytes per row since every row contains a tsvector value (although since practice shows 10 times less buffers read, it's probably somewhere around 320 bytes on average?).