Re: tsearch2: ts_headline performance problem - Mailing list pgsql-sql

From Oleg Bartunov
Subject Re: tsearch2: ts_headline performance problem
Date
Msg-id Pine.LNX.4.64.0801061804490.13244@sn.sai.msu.ru
Whole thread Raw
In response to tsearch2: ts_headline performance problem  (Anders Østergaard Jensen <aj@itersys.dk>)
List pgsql-sql
Anders,

headline is a slow, since it should read and process an original document,
so you should avoid simple search query, which calculates ts_headline()
for each document in the result set. Use subselect instead !

In example below, 2nd query is way faster the 1rd !

select id,headline(body,q),rank(ti,q) as rank
from apod, to_tsquery('stars') q 
where ti @@ q order by rank desc limit 10;

and

select id,headline(body,q),rank
from ( select id,body,q, rank(ti,q) as rank from apod, to_tsquery('stars') q
where ti @@ q order by rank desc limit 10) as foo;


Oleg
On Sun, 6 Jan 2008, Anders ?stergaard Jensen wrote:

> Hi there people,
>
> I'm trying to build a text search engine for relatively large PDF documents 
> with a web frontend. I use PostgreSQL 8.3 beta3/4 with tsearch2 for full-text 
> indexing. Everything performs quite well, until I start indexing 800 pages of 
> text into the database (with the utf8 danish locale and dictionary) -- then 
> ts_headline performs unreasonably slow. I use a GIN index on the table 
> (called document_revisions which is linked to a master document table called 
> 'documents' -- the system is made for version tracking of multiple documents 
> into the same index) and an update trigger for maintaining the tsvector row 
> on the table.
>
> Omitting ts_headline yields the following EXPLAIN result:
>
> <snip>
> metabase=# explain analyze select * from f_search_revision('website');
> NOTICE:  f_search_revisions_arr: loop: (1) doc_id = (1000)
> CONTEXT:  SQL statement "SELECT f_search_revision_arr( $1 ) AS res"
> PL/pgSQL function "f_search_revision_int" line 13 at SQL statement
> SQL function "f_search_revision" statement 1
> NOTICE:  hest: ({1000})
> CONTEXT:  SQL function "f_search_revision" statement 1
> NOTICE:  total ranking for (1) eq (0.0865452) for id = (1000)
> CONTEXT:  SQL function "f_search_revision" statement 1
>                                                        QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
> Function Scan on f_search_revision  (cost=0.00..260.00 rows=1000 width=408) 
> (actual time=2413.384..2413.384 rows=1 loops=1)
> Total runtime: 2413.465 ms
> (2 rows)
> </snip>
>
> The function f_search_revision invokes another function that fetches all 
> matching document_revisions without calling ts_headline (method 
> f_search_revision_arr), followed by a post-processing of the matching rows 
> for fetching the ts_ranking values and ts_headline. The following snippet 
> shows the query plan for the function call with the same query:
>
> <snip>
> metabase=# explain analyze select * from f_search_revision_arr('website');
> NOTICE:  f_search_revisions_arr: loop: (1) doc_id = (1000)
>                                                    QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
> Function Scan on f_search_revision_arr  (cost=0.00..0.26 rows=1 width=32) 
> (actual time=3.417..3.418 rows=1 loops=1)
> Total runtime: 3.442 ms
> (2 rows)
> </snip>
>
> Here is the code for the functions that I am using:
>
> CREATE OR REPLACE FUNCTION f_search_revision_arr(q TEXT) RETURNS INTEGER[] AS 
> $$
>     declare
>         iter        record;            results integer[];
>         i                integer;
>     begin
>         i := 1;
>         for iter in SELECT * FROM documents doc, plainto_tsquery(q) 
> AS tsq
>             WHERE (doc.search_idx @@ tsq)
>             OR doc.id IN (SELECT document_id FROM 
> document_revisions dr, plainto_tsquery(q) AS tsq2
>             WHERE dr.search_idx @@ tsq2) loop
>             results[i] = iter.id;
>             raise notice 'f_search_revisions_arr: loop: (%) 
> doc_id = (%)', i, results[i];
>             i := i + 1;
>         end loop;
>         return results;
>     end;
> $$ LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION f_search_revision_int(q TEXT) RETURNS 
> t_search_result[] AS $$
>     declare
>         doc_ids integer[];
>         doc_rankings float[];
>         i integer;
>         iter record;
>         pointer_doc record;
>         pointer_rev record;
>         pointer_type record;
>         results    t_search_result[];
>     begin
>         i := 1;
>
>         SELECT INTO iter f_search_revision_arr(q) AS res;
>         doc_ids := iter.res;
>         raise notice 'hest: (%)', doc_ids;
>
>         -- Ensure that something was found:
>         if (doc_ids IS NULL) then
>             return results;
>         end if;
>
>         for i in 1..array_upper(doc_ids, 1) loop
>             -- raise notice 'upper = (%)', doc_ids[i];
>             SELECT INTO pointer_doc ts_rank(doc.search_idx, tsq) 
> AS rank_doc,
>                 ts_headline(doc.name, tsq) AS headline_name,
>                 ts_headline(doc.description, tsq) AS 
> headline_description,
>                 ts_headline(doc.keywords, tsq) AS 
> headline_keywords,
>                 doc.name,
>                 doc.description,
>                 doc.keywords,
>                 doc.document_type_id
>                  FROM documents doc, plainto_tsquery(q) AS tsq
>                 WHERE doc.id = doc_ids[i];
>
>             SELECT INTO pointer_rev ts_rank(rev.search_idx, tsq) 
> AS rank_rev,
>                 ts_headline(rev.content, tsq) AS 
> headline_content,
>                 rev.id AS revision_id
>                 FROM document_revisions rev, 
> plainto_tsquery(q) AS tsq
>                 WHERE rev.document_id = doc_ids[i];
>
>             -- Select the document type:
>             SELECT INTO pointer_type mime_type, type_name, 
> image_url, extension, internal_type
>                 FROM document_types WHERE id = 
> pointer_doc.document_type_id;
>
>             doc_rankings[i] = pointer_doc.rank_doc + 
> pointer_rev.rank_rev; -- SUM!
>             raise notice 'total ranking for (%) eq (%) for id = 
> (%)', i, doc_rankings[i], doc_ids[i];
>             results[i] = ROW(pointer_rev.revision_id,
>                      doc_ids[i],
>                      pointer_doc.name, 
> pointer_doc.description, pointer_doc.keywords,
>                     pointer_doc.document_type_id, 
> pointer_type.mime_type, pointer_type.type_name, pointer_type.image_url, 
> pointer_type.extension, pointer_type.internal_type,
>                      doc_rankings[i], 
> pointer_doc.headline_name, pointer_doc.headline_description, 
> pointer_doc.headline_keywords, pointer_rev.headline_content,
>                      0);
>         end loop;
>         return results;
>     end;
> $$ LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION f_search_revision(q TEXT) RETURNS SETOF 
> t_search_result AS $$
>     SELECT * FROM f_unnest(f_search_revision_int($1));
> $$ LANGUAGE sql;
>
> ------
>
> How come that ts_headline yields so bad performance? Is there any way of 
> tuning ts_headline? I need the highlighting functionality for presenting the 
> search results in the web frontend properly. Alternately, are there any 
> alternatives to the ts_headline function?
>
> Sincerely,
>
> Anders
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


pgsql-sql by date:

Previous
From: Anders Østergaard Jensen
Date:
Subject: tsearch2: ts_headline performance problem
Next
From: Gerardo Herzig
Date:
Subject: reading WAL files in python