Thread: TSearch2: find a QUERY that does match a single document

TSearch2: find a QUERY that does match a single document

From
"Dmitry Koterov"
Date:
Hello.

TSearch2 allows to search a table of tsvectors by a single tsquery.
I need to solve the reverse problem.

I have a large table of tsquery. I need to find all tsqueries in that table that match a single document tsvector:

CREATE TABLE "test"."test_tsq" (
  "id" SERIAL,
  "q" TSQUERY NOT NULL,
  CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id")
);

insert into test.test_tsq(q)
select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g;

explain analyze
select * from test.test_tsq
where to_tsvector('400000x400000') @@ q

This gets a strange explain analyze:

QUERY PLAN
Seq Scan on test_tsq  (cost=0.00..17477.01 rows=800 width=36) (actual time=68.698..181.458 rows=1 loops=1)
  Filter: ('''400000x400000'':1'::tsvector @@ q)
Total runtime: 181.484 ms

No matter if I use GIST index on test_tsq.q or not: the explain analyze result is the same.
So, why "rows=800"? The table contains much more rows...

Re: TSearch2: find a QUERY that does match a single document

From
Oleg Bartunov
Date:
On Fri, 12 Sep 2008, Dmitry Koterov wrote:

> Hello.
>
> TSearch2 allows to search a table of tsvectors by a single tsquery.
> I need to solve the reverse problem.
>
> *I have a large table of tsquery. I need to find all tsqueries in that table
> that match a single document tsvector:
> *
> CREATE TABLE "test"."test_tsq" (
>  "id" SERIAL,
>  "q" TSQUERY NOT NULL,
>  CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id")
> );
>
> insert into test.test_tsq(q)
> select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g;
>
> explain analyze
> select * from test.test_tsq
> where to_tsvector('400000x400000') @@ q

why do you need tsvector @@ q ? Much better to use  tsquery = tsquery

test=# explain analyze select * from test_tsq where q = '400000x400000'::tsque>
                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------
  Seq Scan on test_tsq  (cost=0.00..16667.01 rows=1 width=38) (actual time=129.208..341.111 rows=1 loops=1)
    Filter: (q = '''400000x400000'''::tsquery)
  Total runtime: 341.134 ms
(3 rows)

Time: 341.478 ms


>
> This gets a strange explain analyze:
>
> QUERY PLAN
> Seq Scan on test_tsq  (cost=0.00..17477.01 rows=800 width=36) (actual
> time=68.698..181.458 rows=1 loops=1)
>  Filter: ('''400000x400000'':1'::tsvector @@ q)
> Total runtime: 181.484 ms
>
> No matter if I use GIST index on test_tsq.q or not: the explain analyze
> result is the same.
> So, why "rows=800"? The table contains much more rows...

'800' is the number of estimated rows, which is not good, since you got only
1 row.

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: TSearch2: find a QUERY that does match a single document

From
"Dmitry Koterov"
Date:
explain analyze
select * from test.test_tsq
where to_tsvector('400000x400000') @@ q

why do you need tsvector @@ q ? Much better to use  tsquery = tsquery

test=# explain analyze select * from test_tsq where q = '400000x400000'::tsque>
                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on test_tsq  (cost=0.00..16667.01 rows=1 width=38) (actual time=129.208..341.111 rows=1 loops=1)
  Filter: (q = '''400000x400000'''::tsquery)
 Total runtime: 341.134 ms
(3 rows)
M-mmm... Seems your understood me incorrectly.

I have to find NOT queries which are exactly equal to another query, BUT queries which MATCH the GIVEN document. '400000x400000' was a sample only, in real cases it will be 1-2K document.

Here is a more realistic sample:

explain analyze
select * from test.test_tsq
where to_tsvector('
  Here is a real document text. It may be long, 1-2K.
  In this sample it contains a lexem "400000x400000", so there is a tsquery
  in test_tsq.q which matches this document. I need to find all such queries fast.
  Of course, in real cases the document text is unpredictable.
') @@ q

 
QUERY PLAN
Seq Scan on test_tsq  (cost=0.00..17477.01 rows=800 width=36) (actual
time=68.698..181.458 rows=1 loops=1)
 Filter: ('''400000x400000'':1'::tsvector @@ q)
Total runtime: 181.484 ms

'800' is the number of estimated rows, which is not good, since you got only 1 row.
Why 800? The table contains 800000 rows, and seqscan is used. Does it scan the whole table or not? If yes, possibly there is a bug in explain output? (No mater if I create GIST index on test_tsq.q or not, the number of rows is still 800, so it seems to me that GIST index is not used at all.)