Thread: Performance regression between 8.3 and 8.4 on heavy text indexing

Performance regression between 8.3 and 8.4 on heavy text indexing

From
gael@pilotsystems.net (Gaël Le Mignot)
Date:
Hello,

We are using PostgreSQL to index a huge collection (570 000) of articles for a french daily newspaper (Libération). We
usemassively the full text search feature. I attach to this mail the schema of the database we use. 

Overall, we have very interesting performances, except in a few cases, when combining a full text match with a lot of
matcheswith a date order and a limit (which is a very common use case, asking for the 50 more recent articles speaking
abouta famous person, for example). 

The reason of this mail is what we noticed a performance drop from PostgreSQL 8.3 to PostgreSQL 8.4.

In order to try to locate the performance cost, I changed a few settings in 8.4 to have the same values than in 8.3
(andrerun analyze after) :: 

 cursor_tuple_fraction = 1.0
 default_statistics_target = 10

We the modified settings, the peformance drop is much lower, but still
present. Here are the statistics  on replaying sequentially a bunch of
real-life queries to the two versions of the database :

With 8.3 ::

 7334 queries, average time is 0.20 s
 6 queries above 20.00 s (0.08 %)
 20 queries above 10.00 s (0.27 %)
 116 queries above 2.00 s (1.58 %)
 top ten:  15.09 15.15 15.19 16.60 20.40 63.05 67.89 78.21 90.30 97.56

With 8.4 ::

 7334 queries, average time is 0.23 s
 12 queries above 20.00 s (0.16 %)
 24 queries above 10.00 s (0.33 %)
 112 queries above 2.00 s (1.53 %)
 top ten:  31.76 31.94 32.63 47.21 48.80 63.50 79.57 83.36 96.44 113.61


Here is an example query that is significantly slower in 8.4 (31.76 seconds) than in 8.3 (10.52 seconds) ::

 SELECT classname, id FROM libeindex WHERE (classname = 'article' AND (source IN ('methode','nica') AND (keywords_tsv
@@plainto_tsquery('french', 'assassinat') AND fulltext_tsv @@ to_tsquery('french', 'claude & duviau')))) ORDER BY
publicationDateDESC,pageNumber ASC LIMIT 50 

And the explain on it :

With 8.3 ::

 Limit  (cost=752.67..752.67 rows=1 width=24)
   ->  Sort  (cost=752.67..752.67 rows=1 width=24)
         Sort Key: publicationdate, pagenumber
         ->  Bitmap Heap Scan on libeindex  (cost=748.64..752.66 rows=1 width=24)
               Recheck Cond: ((keywords_tsv @@ '''assassinat'''::tsquery) AND (fulltext_tsv @@ '''claud'' &
''duviau'''::tsquery))
               Filter: (((source)::text = ANY ('{methode,nica}'::text[])) AND ((classname)::text = 'article'::text))
               ->  BitmapAnd  (cost=748.64..748.64 rows=1 width=0)
                     ->  Bitmap Index Scan on keywords_index  (cost=0.00..48.97 rows=574 width=0)
                           Index Cond: (keywords_tsv @@ '''assassinat'''::tsquery)
                     ->  Bitmap Index Scan on fulltext_index  (cost=0.00..699.42 rows=574 width=0)
                           Index Cond: (fulltext_tsv @@ '''claud'' & ''duviau'''::tsquery)
 (11 rows)

With 8.4 ::

 Limit  (cost=758.51..758.51 rows=1 width=24)
   ->  Sort  (cost=758.51..758.51 rows=1 width=24)
         Sort Key: publicationdate, pagenumber
         ->  Bitmap Heap Scan on libeindex  (cost=14.03..758.50 rows=1 width=24)
               Recheck Cond: (keywords_tsv @@ '''assassinat'''::tsquery)
               Filter: (((source)::text = ANY ('{methode,nica}'::text[])) AND (fulltext_tsv @@ '''claud'' &
''duviau'''::tsquery)AND ((classname)::text = 'article'::text)) 
               ->  Bitmap Index Scan on keywords_index  (cost=0.00..14.03 rows=192 width=0)
                     Index Cond: (keywords_tsv @@ '''assassinat'''::tsquery)
 (8 rows)

More informations on the setup :

- postgresql 8.3.7 from Debian Lenny ;

- postgresql 8.4.0 from Debian Lenny backports ;

- rurnning in a Xen virtual machine, using 64-bits kernel ;

- 2 cores of a 2GHz Core2Quad and 2Gb of RAM dedicated to the VM.

If you need additional informations, we'll gladly provide them. If you have any tips or advises so we could make the
8.4behave as least as good as the 8.3 it would be very nice. 

Hoping this can help you to improve this great software.

Regards,

--
Gaël Le Mignot - gael@pilotsystems.net
Pilot Systems - 9, rue Desargues - 75011 Paris
Tel : +33 1 44 53 05 55 - www.pilotsystems.net
Gérez vos contacts et vos newsletters : www.cockpit-mailing.com

Re: Performance regression between 8.3 and 8.4 on heavy text indexing

From
Guillaume Smet
Date:
Hi Gaël,

On Fri, Aug 21, 2009 at 3:37 PM, Gaël Le Mignot<gael@pilotsystems.net> wrote:
> With 8.3 ::
>
>  Limit  (cost=752.67..752.67 rows=1 width=24)
>  (11 rows)
>
> With 8.4 ::
>  (8 rows)

Could you provide us the EXPLAIN *ANALYZE* output of both plans?

From what I can see, one of the difference is that the estimates of
the number of rows are / 3 for this part of the query:
8.3 ->  Bitmap Index Scan on keywords_index  (cost=0.00..48.97 rows=574 width=0)
8.4 ->  Bitmap Index Scan on keywords_index  (cost=0.00..14.03 rows=192 width=0)

It might be interesting to see if 8.4 is right or not.

Before 8.4, the selectivity for full text search was a constant (as
you can see it in your 8.3 plan: the number of rows are equal in both
bitmap index scan). 8.4 is smarter which might lead to other plans.

--
Guillaume

Re: Performance regression between 8.3 and 8.4 on heavy text indexing

From
gael@pilotsystems.net (Gaël Le Mignot)
Date:
Hello Guillaume!

Sun, 23 Aug 2009 14:49:05 +0200, you wrote:

 > Hi Gaël,
 > On Fri, Aug 21, 2009 at 3:37 PM, Gaël Le Mignot<gael@pilotsystems.net> wrote:
 >> With 8.3 ::
 >>
 >>  Limit  (cost=752.67..752.67 rows=1 width=24)
 >>  (11 rows)
 >>
 >> With 8.4 ::
 >>  (8 rows)

 > Could you provide us the EXPLAIN *ANALYZE* output of both plans?

Sure, here it is :

With 8.3 ::

libearticles=>  explain analyze SELECT classname, id FROM libeindex WHERE (classname = 'article' AND (source IN
('methode','nica')AND (keywords_tsv @@ plainto_tsquery('french', 'assassinat') AND fulltext_tsv @@ to_tsquery('french',
'claude& duviau')))) ORDER BY publicationDate DESC,pageNumber ASC LIMIT 50; 
                                                                     QUERY PLAN
                            

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=760.74..760.74 rows=1 width=24) (actual time=449.057..449.080 rows=9 loops=1)
   ->  Sort  (cost=760.74..760.74 rows=1 width=24) (actual time=449.053..449.061 rows=9 loops=1)
         Sort Key: publicationdate, pagenumber
         Sort Method:  quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on libeindex  (cost=756.71..760.73 rows=1 width=24) (actual time=420.704..448.571 rows=9
loops=1)
               Recheck Cond: ((keywords_tsv @@ '''assassinat'''::tsquery) AND (fulltext_tsv @@ '''claud'' &
''duviau'''::tsquery))
               Filter: (((source)::text = ANY ('{methode,nica}'::text[])) AND ((classname)::text = 'article'::text))
               ->  BitmapAnd  (cost=756.71..756.71 rows=1 width=0) (actual time=420.612..420.612 rows=0 loops=1)
                     ->  Bitmap Index Scan on keywords_index  (cost=0.00..48.96 rows=573 width=0) (actual
time=129.338..129.338rows=10225 loops=1) 
                           Index Cond: (keywords_tsv @@ '''assassinat'''::tsquery)
                     ->  Bitmap Index Scan on fulltext_index  (cost=0.00..707.50 rows=573 width=0) (actual
time=289.775..289.775rows=14 loops=1) 
                           Index Cond: (fulltext_tsv @@ '''claud'' & ''duviau'''::tsquery)
 Total runtime: 471.905 ms
(13 rows)

With 8.4 ::

libebench=>  explain analyze SELECT classname, id FROM libeindex WHERE (classname = 'article' AND (source IN
('methode','nica')AND (keywords_tsv @@ plainto_tsquery('french', 'assassinat') AND fulltext_tsv @@ to_tsquery('french',
'claude& duviau')))) ORDER BY publicationDate DESC,pageNumber ASC LIMIT 50; 
                                                                                 QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=758.51..758.51 rows=1 width=24) (actual time=50816.635..50816.660 rows=9 loops=1)
   ->  Sort  (cost=758.51..758.51 rows=1 width=24) (actual time=50816.628..50816.637 rows=9 loops=1)
         Sort Key: publicationdate, pagenumber
         Sort Method:  quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on libeindex  (cost=14.03..758.50 rows=1 width=24) (actual time=8810.133..50816.484
rows=9loops=1) 
               Recheck Cond: (keywords_tsv @@ '''assassinat'''::tsquery)
               Filter: (((source)::text = ANY ('{methode,nica}'::text[])) AND (fulltext_tsv @@ '''claud'' &
''duviau'''::tsquery)AND ((classname)::text = 'article'::text)) 
               ->  Bitmap Index Scan on keywords_index  (cost=0.00..14.03 rows=192 width=0) (actual
time=158.563..158.563rows=10222 loops=1) 
                     Index Cond: (keywords_tsv @@ '''assassinat'''::tsquery)
 Total runtime: 50817.040 ms
(10 rows)

So it seems it was quite wrong about estimated matching rows (192 predicted, 10222 reals).

 >> From what I can see, one of the difference is that the estimates of
 > the number of rows are / 3 for this part of the query:
 > 8.3 ->  Bitmap Index Scan on keywords_index  (cost=0.00..48.97 rows=574 width=0)
 > 8.4 ->  Bitmap Index Scan on keywords_index  (cost=0.00..14.03 rows=192 width=0)

 > It might be interesting to see if 8.4 is right or not.

 > Before 8.4, the selectivity for full text search was a constant (as
 > you can see it in your 8.3 plan: the number of rows are equal in both
 > bitmap index scan). 8.4 is smarter which might lead to other plans.

I see, thanks  for your answer. What's weird  is that this "smartness"
leads to overall worse results in  our case, is there some tweaking we
can  do?   I  didn't  see  anything in  the  documentation  to  change
weighting inside the text-match heuristic.

--
Gaël Le Mignot - gael@pilotsystems.net
Pilot Systems - 9, rue Desargues - 75011 Paris
Tel : +33 1 44 53 05 55 - www.pilotsystems.net
Gérez vos contacts et vos newsletters : www.cockpit-mailing.com

Re: Performance regression between 8.3 and 8.4 on heavy text indexing

From
Tom Lane
Date:
gael@pilotsystems.net (=?iso-8859-1?Q?Ga=EBl?= Le Mignot) writes:
> So it seems it was quite wrong about estimated matching rows (192 predicted, 10222 reals).

Yup.  What's even more interesting is that it seems the real win would
have been to use just the 'claude & duviau' condition (which apparently
matched only 14 rows).  8.3 had no hope whatever of understanding that,
it just got lucky.  8.4 should have figured it out, I'm thinking.
Does it help if you increase the statistics target for fulltext_tsv?
(Don't forget to re-ANALYZE after doing so.)

            regards, tom lane

Re: Performance regression between 8.3 and 8.4 on heavy text indexing

From
Guillaume Smet
Date:
On Wed, Aug 26, 2009 at 6:29 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> gael@pilotsystems.net (=?iso-8859-1?Q?Ga=EBl?= Le Mignot) writes:
>> So it seems it was quite wrong about estimated matching rows (192 predicted, 10222 reals).
>
> Yup.  What's even more interesting is that it seems the real win would
> have been to use just the 'claude & duviau' condition (which apparently
> matched only 14 rows).  8.3 had no hope whatever of understanding that,
> it just got lucky.  8.4 should have figured it out, I'm thinking.
> Does it help if you increase the statistics target for fulltext_tsv?
> (Don't forget to re-ANALYZE after doing so.)

It could be interesting to run the query without the condition
(keywords_tsv @@ '''assassinat'''::tsquery) to see the estimate of
(fulltext_tsv @@ '''claud'' & ''duviau'''::tsquery) in 8.4.

Btw, what Tom means by increasing the statistics is executing the
following queries:
ALTER TABLE libeindex ALTER COLUMN fulltext_tsv SET STATISTICS 500;
ANALYZE;
run your query with EXPLAIN ANALYZE;
ALTER TABLE libeindex ALTER COLUMN fulltext_tsv SET STATISTICS 1000;
ANALYZE;
run your query with EXPLAIN ANALYZE;
ALTER TABLE libeindex ALTER COLUMN fulltext_tsv SET STATISTICS 5000;
ANALYZE;
run your query with EXPLAIN ANALYZE;

to see if it improves the estimates.

--
Guillaume

Re: Performance regression between 8.3 and 8.4 on heavy text indexing

From
gael@pilotsystems.net (Gaël Le Mignot)
Date:
Hello Guillaume!

Wed, 26 Aug 2009 23:59:25 +0200, you wrote:

 > On Wed, Aug 26, 2009 at 6:29 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
 >> gael@pilotsystems.net (=?iso-8859-1?Q?Ga=EBl?= Le Mignot) writes:
 >>> So it seems it was quite wrong about estimated matching rows (192 predicted, 10222 reals).
 >>
 >> Yup.  What's even more interesting is that it seems the real win would
 >> have been to use just the 'claude & duviau' condition (which apparently
 >> matched only 14 rows).  8.3 had no hope whatever of understanding that,
 >> it just got lucky.  8.4 should have figured it out, I'm thinking.
 >> Does it help if you increase the statistics target for fulltext_tsv?
 >> (Don't forget to re-ANALYZE after doing so.)

 > It could be interesting to run the query without the condition
 > (keywords_tsv @@ '''assassinat'''::tsquery) to see the estimate of
 > (fulltext_tsv @@ '''claud'' & ''duviau'''::tsquery) in 8.4.

Here it is ::

libebench=>  explain analyze SELECT classname, id FROM libeindex WHERE (classname = 'article' AND (source IN
('methode','nica')AND fulltext_tsv @@ to_tsquery('french', 'claude & duviau'))) ORDER BY publicationDate
DESC,pageNumberASC LIMIT 50; 
                                                                QUERY PLAN
                  

------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=12264.98..12265.11 rows=50 width=24) (actual time=3.799..3.825 rows=10 loops=1)
   ->  Sort  (cost=12264.98..12271.03 rows=2421 width=24) (actual time=3.794..3.802 rows=10 loops=1)
         Sort Key: publicationdate, pagenumber
         Sort Method:  quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on libeindex  (cost=2363.10..12184.56 rows=2421 width=24) (actual time=3.579..3.693
rows=10loops=1) 
               Recheck Cond: (fulltext_tsv @@ '''claud'' & ''duviau'''::tsquery)
               Filter: (((source)::text = ANY ('{methode,nica}'::text[])) AND ((classname)::text = 'article'::text))
               ->  Bitmap Index Scan on fulltext_index  (cost=0.00..2362.49 rows=2877 width=0) (actual
time=3.499..3.499rows=14 loops=1) 
                     Index Cond: (fulltext_tsv @@ '''claud'' & ''duviau'''::tsquery)
 Total runtime: 166.772 ms
(10 rows)

So it estimates 2877 rows for that, while in reality it's 14.

 > Btw, what Tom means by increasing the statistics is executing the
 > following queries:
 > ALTER TABLE libeindex ALTER COLUMN fulltext_tsv SET STATISTICS 500;

Ok, I  did it for 500 also  on the keywords_tsv column,  which was the
other  contestor. Here  we have  a  clear improvement:  the search  in
keyword_tsv is now  estimated at 10398 (real being  10222) and the one
on fulltext_tsv at 1 (real being 14).

I did it at 1000 too, it's almost the same result.

By re-running  our sampling of 7334  queries on the  database with the
statistics at 1000  on both fulltext_tsv and keywords_tsv,  we do have
overall better results than with 8.3 ! So a greeat thanks to everyone.

The  weird thing  was  that with  the  default of  100 for  statistics
target, it was  worse than when we  moved back to 10. So  I didn't try
with 1000, but I should have.

I'll do  more tests and  keep the  list informed if  it can be  of any
help.

--
Gaël Le Mignot - gael@pilotsystems.net
Pilot Systems - 9, rue Desargues - 75011 Paris
Tel : +33 1 44 53 05 55 - www.pilotsystems.net
Gérez vos contacts et vos newsletters : www.cockpit-mailing.com

Re: Performance regression between 8.3 and 8.4 on heavy text indexing

From
Guillaume Smet
Date:
2009/8/27 Gaël Le Mignot <gael@pilotsystems.net>:
> The  weird thing  was  that with  the  default of  100 for  statistics
> target, it was  worse than when we  moved back to 10. So  I didn't try
> with 1000, but I should have.

When you have so much data and a statistics target so low, you can't
expect the sample taken to be representative :between different runs
of ANALYZE, you can have totally different estimates and so totally
different plans. You just were lucky at 10 and unlucky at 100.

Thanks for your feedback and it's nice to see your problem solved.

--
Guillaume