Tsearch2 - bad performance with concatenated ts-vectors - Mailing list pgsql-performance

From Jan Wielgus
Subject Tsearch2 - bad performance with concatenated ts-vectors
Date
Msg-id 3479994c.402ea88e.4e3797ac.972cd@tlen.pl
Whole thread Raw
Responses Re: Tsearch2 - bad performance with concatenated ts-vectors  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Re: Tsearch2 - bad performance with concatenated ts-vectors  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-performance
Hello everyone,

This is my first post on this list, I tried to look after possible solutions in the archive, as well as in google, but
Icould not find an explanation for such a specific situation. 

I am facing a performance problem connected with Postgres Tsearch2 FTS mechanism.

Here is my query:

select participant.participant_id from participant participant
join person person on person.person_participant_id = participant.participant_id
left join registration registration on registration.registration_registered_participant_id = participant.participant_id
left join enrollment enrollment on registration.registration_enrollment_id = enrollment.enrollment_id
join registration_configuration registration_configuration on enrollment.enrollment_configuration_id =
registration_configuration.configuration_id
left join event_context context on context.context_id = registration_configuration.configuration_context_id
where participant.participant_type = 'PERSON'
and participant_status = 'ACTIVE'
and context.context_code in ('GB2TST2010A')
and registration_configuration.configuration_type in ('VISITOR')
and registration_configuration.configuration_id is not null
and participant.participant_tsv || person.person_tsv @@ to_tsquery('simple',to_tsquerystring('Abigail'))
limit 100

As you see, I am using two vectors which I concatenate and check against a tsquery.

Both vectors are indexed with GIN and updated with respective triggers in the following way:

ALTER TABLE person ALTER COLUMN person_tsv SET STORAGE EXTENDED;
CREATE INDEX person_ft_index ON person USING gin(person_tsv);
CREATE OR REPLACE FUNCTION update_person_tsv() RETURNS trigger AS $$ BEGIN NEW.person_tsv :=
to_tsvector('simple',create_tsv(ARRAY[NEW.person_first_name, NEW.person_last_name, NEW.person_middle_name] )); RETURN
NEW;END; $$ LANGUAGE 'plpgsql'; 
CREATE TRIGGER person_tsv_update BEFORE INSERT or UPDATE ON person FOR EACH ROW EXECUTE PROCEDURE update_person_tsv();

ALTER TABLE participant ALTER COLUMN participant_tsv SET STORAGE EXTENDED;
CREATE INDEX participant_ft_index ON participant USING gin(participant_tsv);
CREATE OR REPLACE FUNCTION update_participant_tsv() RETURNS trigger AS $$ BEGIN NEW.participant_tsv :=
to_tsvector('simple',create_tsv(ARRAY[NEW.participant_login, NEW.participant_email] )); RETURN NEW; END; $$ LANGUAGE
'plpgsql';
CREATE TRIGGER participant_tsv_update BEFORE INSERT or UPDATE ON participant FOR EACH ROW EXECUTE PROCEDURE
update_participant_tsv();

The database is quite big - has almost one million of participant records. The above query has taken almost 67 seconds
toexecute and fetch 100 rows, which is unacceptable for us. 

As I assume, the problem is, when the vectors are concatenated, the individual indexes for each vector are not used.
Theexecution plan done after 1st execution of the query: 

"Limit  (cost=46063.13..93586.79 rows=100 width=4) (actual time=4963.620..39703.645 rows=100 loops=1)"
"  ->  Nested Loop  (cost=46063.13..493736.04 rows=942 width=4) (actual time=4963.617..39703.349 rows=100 loops=1)"
"        Join Filter: (registration_configuration.configuration_id = enrollment.enrollment_configuration_id)"
"        ->  Nested Loop  (cost=46063.13..493662.96 rows=3769 width=8) (actual time=4963.517..39701.557 rows=159
loops=1)"
"              ->  Nested Loop  (cost=46063.13..466987.33 rows=3769 width=8) (actual time=4963.498..39698.542 rows=159
loops=1)"
"                    ->  Hash Join  (cost=46063.13..430280.76 rows=4984 width=8) (actual time=4963.464..39692.676
rows=216loops=1)" 
"                          Hash Cond: (participant.participant_id = person.person_participant_id)"
"                          Join Filter: ((participant.participant_tsv || person.person_tsv) @@
to_tsquery('simple'::regconfig,to_tsquerystring('Abigail'::text)))" 
"                          ->  Seq Scan on participant  (cost=0.00..84680.85 rows=996741 width=42) (actual
time=0.012..3132.944rows=1007151 loops=1)" 
"                                Filter: (((participant_type)::text = 'PERSON'::text) AND ((participant_status)::text =
'ACTIVE'::text))"
"                          ->  Hash  (cost=25495.39..25495.39 rows=1012539 width=38) (actual time=3145.628..3145.628
rows=1007151loops=1)" 
"                                Buckets: 2048  Batches: 128  Memory Usage: 556kB"
"                                ->  Seq Scan on person  (cost=0.00..25495.39 rows=1012539 width=38) (actual
time=0.062..1582.990rows=1007151 loops=1)" 
"                    ->  Index Scan using idx_registration_registered_participant_id on registration  (cost=0.00..7.35
rows=1width=8) (actual time=0.018..0.019 rows=1 loops=216)" 
"                          Index Cond: (registration.registration_registered_participant_id =
person.person_participant_id)"
"              ->  Index Scan using enrollment_pkey on enrollment  (cost=0.00..7.07 rows=1 width=8) (actual
time=0.011..0.013rows=1 loops=159)" 
"                    Index Cond: (enrollment.enrollment_id = registration.registration_enrollment_id)"
"        ->  Materialize  (cost=0.00..16.55 rows=1 width=4) (actual time=0.002..0.005 rows=2 loops=159)"
"              ->  Nested Loop  (cost=0.00..16.55 rows=1 width=4) (actual time=0.056..0.077 rows=2 loops=1)"
"                    Join Filter: (registration_configuration.configuration_context_id = context.context_id)"
"                    ->  Index Scan using idx_configuration_type on registration_configuration  (cost=0.00..8.27 rows=1
width=8)(actual time=0.018..0.022 rows=3 loops=1)" 
"                          Index Cond: ((configuration_type)::text = 'VISITOR'::text)"
"                          Filter: (configuration_id IS NOT NULL)"
"                    ->  Index Scan using idx_event_context_code on event_context context  (cost=0.00..8.27 rows=1
width=4)(actual time=0.008..0.010 rows=1 loops=3)" 
"                          Index Cond: ((context.context_code)::text = 'GB2TST2010A'::text)"
"Total runtime: 39775.578 ms"

The assumption seems to be correct, no indexes on vectors are used - sequence scans are done instead:

Join Filter: ((participant.participant_tsv || person.person_tsv) @@ to_tsquery('simple'::regconfig,
to_tsquerystring('Abigail'::text)))"
"                          ->  Seq Scan on participant  (cost=0.00..84680.85 rows=996741 width=42) (actual
time=0.012..3132.944rows=1007151 loops=1)" 
"                                Filter: (((participant_type)::text = 'PERSON'::text) AND ((participant_status)::text =
'ACTIVE'::text))"
"                          ->  Hash  (cost=25495.39..25495.39 rows=1012539 width=38) (actual time=3145.628..3145.628
rows=1007151loops=1)" 
"                                Buckets: 2048  Batches: 128  Memory Usage: 556kB"
"                                ->  Seq Scan on person  (cost=0.00..25495.39 rows=1012539 width=38) (actual
time=0.062..1582.990rows=1007151 loops=1)" 


After I removed one of the vectors from the query and used only a single vector
...
and person.person_tsv @@ to_tsquery('simple', to_tsquery('simple',to_tsquerystring('Abigail'))
...
then the execution was much faster - about 5 seconds

Plan afterwards:

"Limit  (cost=41.14..8145.82 rows=100 width=4) (actual time=3.776..13.454 rows=100 loops=1)"
"  ->  Nested Loop  (cost=41.14..21923.77 rows=270 width=4) (actual time=3.773..13.248 rows=100 loops=1)"
"        ->  Nested Loop  (cost=41.14..19730.17 rows=270 width=8) (actual time=3.760..11.971 rows=100 loops=1)"
"              Join Filter: (registration_configuration.configuration_id = enrollment.enrollment_configuration_id)"
"              ->  Nested Loop  (cost=0.00..16.55 rows=1 width=4) (actual time=0.051..0.051 rows=1 loops=1)"
"                    Join Filter: (registration_configuration.configuration_context_id = context.context_id)"
"                    ->  Index Scan using idx_configuration_type on registration_configuration  (cost=0.00..8.27 rows=1
width=8)(actual time=0.020..0.022 rows=2 loops=1)" 
"                          Index Cond: ((configuration_type)::text = 'VISITOR'::text)"
"                          Filter: (configuration_id IS NOT NULL)"
"                    ->  Index Scan using idx_event_context_code on event_context context  (cost=0.00..8.27 rows=1
width=4)(actual time=0.008..0.009 rows=1 loops=2)" 
"                          Index Cond: ((context.context_code)::text = 'GB2TST2010A'::text)"
"              ->  Nested Loop  (cost=41.14..19700.12 rows=1080 width=12) (actual time=3.578..11.431 rows=269 loops=1)"
"                    ->  Nested Loop  (cost=41.14..12056.27 rows=1080 width=12) (actual time=3.568..8.203 rows=269
loops=1)"
"                          ->  Bitmap Heap Scan on person  (cost=41.14..3687.07 rows=1080 width=4) (actual
time=3.553..4.401rows=346 loops=1)" 
"                                Recheck Cond: (person_tsv @@ to_tsquery('simple'::regconfig,
to_tsquerystring('Abigail'::text)))"
"                                ->  Bitmap Index Scan on person_ft_index  (cost=0.00..40.87 rows=1080 width=0) (actual
time=3.353..3.353rows=1060 loops=1)" 
"                                      Index Cond: (person_tsv @@ to_tsquery('simple'::regconfig,
to_tsquerystring('Abigail'::text)))"
"                          ->  Index Scan using idx_registration_registered_participant_id on registration
(cost=0.00..7.74rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=346)" 
"                                Index Cond: (registration.registration_registered_participant_id =
person.person_participant_id)"
"                    ->  Index Scan using enrollment_pkey on enrollment  (cost=0.00..7.07 rows=1 width=8) (actual
time=0.006..0.007rows=1 loops=269)" 
"                          Index Cond: (enrollment.enrollment_id = registration.registration_enrollment_id)"
"        ->  Index Scan using participant_pkey on participant  (cost=0.00..8.11 rows=1 width=4) (actual
time=0.007..0.009rows=1 loops=100)" 
"              Index Cond: (participant.participant_id = person.person_participant_id)"
"              Filter: (((participant.participant_type)::text = 'PERSON'::text) AND
((participant.participant_status)::text= 'ACTIVE'::text))" 
"Total runtime: 13.858 ms"

Now the index on vector was used:

"Recheck Cond: (person_tsv @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
"                                ->  Bitmap Index Scan on person_ft_index  (cost=0.00..40.87 rows=1080 width=0) (actual
time=3.353..3.353rows=1060 loops=1)" 
"                                      Index Cond: (person_tsv @@ to_tsquery('simple'::regconfig,
to_tsquerystring('Abigail'::text)))"

So, there is apparently a problem with vector concatenating - the indexes don't work then. I tried to use the vectors
separatelyand to make 'OR' comparison between single vector @@ ts_query checks, 
but it didn't help very much (performance was better, but still over 20 sec):
...
(participant.participant_tsv @@ to_tsquery('simple',to_tsquerystring('Abigail'))) OR (person.person_tsv @@
to_tsquery('simple',to_tsquerystring('Abigail'))) 
...

Is there a way to make this work with better performance? Or is it necessary to create a single vector that contains
datafrom multiple tables and then add an index on it? It would be so far problematic for us, 
because we are using multiple complex queries with variable number of selected columns. I know that another solution
mightbe an union among multiple queries, every of which uses a single vector, 
but this solution is inconvenient too.

Greetings

Jan

pgsql-performance by date:

Previous
From: Vitalii Tymchyshyn
Date:
Subject: Re: Performance die when COPYing to table with bigint PK
Next
From: Andreas Brandl
Date:
Subject: Array access performance