tsearch2: ts_headline performance problem - Mailing list pgsql-sql

From Anders Østergaard Jensen
Subject tsearch2: ts_headline performance problem
Date
Msg-id 51B2467A-B7B0-40DE-8975-4E8ACF5BFA28@itersys.dk
Whole thread Raw
Responses Re: tsearch2: ts_headline performance problem  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-sql
Hi there people,

I'm trying to build a text search engine for relatively large PDF  
documents with a web frontend. I use PostgreSQL 8.3 beta3/4 with  
tsearch2 for full-text indexing. Everything performs quite well, until  
I start indexing 800 pages of text into the database (with the utf8  
danish locale and dictionary) -- then ts_headline performs  
unreasonably slow. I use a GIN index on the table (called  
document_revisions which is linked to a master document table called  
'documents' -- the system is made for version tracking of multiple  
documents into the same index) and an update trigger for maintaining  
the tsvector row on the table.

Omitting ts_headline yields the following EXPLAIN result:

<snip>
metabase=# explain analyze select * from f_search_revision('website');
NOTICE:  f_search_revisions_arr: loop: (1) doc_id = (1000)
CONTEXT:  SQL statement "SELECT f_search_revision_arr( $1 ) AS res"
PL/pgSQL function "f_search_revision_int" line 13 at SQL statement
SQL function "f_search_revision" statement 1
NOTICE:  hest: ({1000})
CONTEXT:  SQL function "f_search_revision" statement 1
NOTICE:  total ranking for (1) eq (0.0865452) for id = (1000)
CONTEXT:  SQL function "f_search_revision" statement 1                                                         QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------
FunctionScan on f_search_revision  (cost=0.00..260.00 rows=1000  
 
width=408) (actual time=2413.384..2413.384 rows=1 loops=1) Total runtime: 2413.465 ms
(2 rows)
</snip>

The function f_search_revision invokes another function that fetches  
all matching document_revisions without calling ts_headline (method  
f_search_revision_arr), followed by a post-processing of the matching  
rows for fetching the ts_ranking values and ts_headline. The following  
snippet shows the query plan for the function call with the same query:

<snip>
metabase=# explain analyze select * from  
f_search_revision_arr('website');
NOTICE:  f_search_revisions_arr: loop: (1) doc_id = (1000)                                                     QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------
FunctionScan on f_search_revision_arr  (cost=0.00..0.26 rows=1  
 
width=32) (actual time=3.417..3.418 rows=1 loops=1) Total runtime: 3.442 ms
(2 rows)
</snip>

Here is the code for the functions that I am using:

CREATE OR REPLACE FUNCTION f_search_revision_arr(q TEXT) RETURNS  
INTEGER[] AS $$declare    iter        record;        results integer[];    i                integer;begin    i := 1;
foriter in SELECT * FROM documents doc, plainto_tsquery(q) AS tsq        WHERE (doc.search_idx @@ tsq)        OR doc.id
IN(SELECT document_id FROM document_revisions dr,  
 
plainto_tsquery(q) AS tsq2        WHERE dr.search_idx @@ tsq2) loop        results[i] = iter.id;        raise notice
'f_search_revisions_arr:loop: (%) doc_id = (%)', i,  
 
results[i];        i := i + 1;    end loop;    return results;end;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION f_search_revision_int(q TEXT) RETURNS  
t_search_result[] AS $$declare    doc_ids integer[];    doc_rankings float[];    i integer;    iter record;
pointer_docrecord;    pointer_rev record;    pointer_type record;    results    t_search_result[];begin    i := 1;
 
    SELECT INTO iter f_search_revision_arr(q) AS res;    doc_ids := iter.res;    raise notice 'hest: (%)', doc_ids;
    -- Ensure that something was found:    if (doc_ids IS NULL) then        return results;    end if;
    for i in 1..array_upper(doc_ids, 1) loop        -- raise notice 'upper = (%)', doc_ids[i];        SELECT INTO
pointer_docts_rank(doc.search_idx, tsq) AS rank_doc,            ts_headline(doc.name, tsq) AS headline_name,
ts_headline(doc.description,tsq) AS headline_description,            ts_headline(doc.keywords, tsq) AS
headline_keywords,           doc.name,            doc.description,            doc.keywords,
doc.document_type_id            FROM documents doc, plainto_tsquery(q) AS tsq            WHERE doc.id = doc_ids[i];
 
        SELECT INTO pointer_rev ts_rank(rev.search_idx, tsq) AS rank_rev,            ts_headline(rev.content, tsq) AS
headline_content,           rev.id AS revision_id            FROM document_revisions rev, plainto_tsquery(q) AS tsq
      WHERE rev.document_id = doc_ids[i];
 
        -- Select the document type:        SELECT INTO pointer_type mime_type, type_name, image_url,  
extension, internal_type            FROM document_types WHERE id = pointer_doc.document_type_id;
   
 
        doc_rankings[i] = pointer_doc.rank_doc + pointer_rev.rank_rev; --  
SUM!        raise notice 'total ranking for (%) eq (%) for id = (%)', i,  
doc_rankings[i], doc_ids[i];        results[i] = ROW(pointer_rev.revision_id,                 doc_ids[i],
 pointer_doc.name, pointer_doc.description, pointer_doc.keywords,                pointer_doc.document_type_id,
pointer_type.mime_type, 
 
pointer_type.type_name, pointer_type.image_url,  
pointer_type.extension, pointer_type.internal_type,                 doc_rankings[i], pointer_doc.headline_name,  
pointer_doc.headline_description, pointer_doc.headline_keywords,  
pointer_rev.headline_content,                 0);    end loop;    return results;end;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION f_search_revision(q TEXT) RETURNS SETOF  
t_search_result AS $$SELECT * FROM f_unnest(f_search_revision_int($1));
$$ LANGUAGE sql;

------

How come that ts_headline yields so bad performance? Is there any way  
of tuning ts_headline? I need the highlighting functionality for  
presenting the search results in the web frontend properly.  
Alternately, are there any alternatives to the ts_headline function?

Sincerely,

Anders





pgsql-sql by date:

Previous
From: "Fernando Hevia"
Date:
Subject: Re: Argentinian timezone change at the last moment. How to change pgsql tz db?
Next
From: Oleg Bartunov
Date:
Subject: Re: tsearch2: ts_headline performance problem