Re: tsearch parser inefficiency if text includes urls or emails - new version - Mailing list pgsql-hackers

From Andres Freund
Subject Re: tsearch parser inefficiency if text includes urls or emails - new version
Date
Msg-id 200912091906.46404.andres@anarazel.de
Whole thread Raw
In response to Re: tsearch parser inefficiency if text includes urls or emails - new version  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: tsearch parser inefficiency if text includes urls or emails - new version  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
> Andres Freund <andres@anarazel.de> wrote:
> > Could you show your testcase?
> 
> OK.  I was going to try to check other platforms first, and package
> up the information better, but here goes.
> 
> I created 10000 lines with random IP-based URLs for a test.  The
> first few lines are:
> 
> create table t1 (c1 int not null primary key, c2 text);
> insert into t1 values (2,
> 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html
>  http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html
>  http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html
I think you see no real benefit, because your strings are rather short - the 
documents I scanned when noticing the issue where rather long.
If your strings are short, they and the copy will fit into cpu cache anyway, so 
copying them around/converting them to some other string format is not that 
expensive compared to the rest of the work done.

Also after each copying step for large strings the complete cache is filled 
with unrelated information (namely the end of the string). So every charwise 
access will need to wait for a memory access.

A rather extreme/contrived example:


postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 3.740 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1000) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 115.027 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
10000) g(i)), ' -  ')); ?column? ----------        1(1 row) 
Time: 24355.339 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
20000) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 47276.739 ms


One easily can see the quadratic complexity here. The quadratic complexity 
lies in the length/amount of emails/urls of the strings, not in the number of 
to_tsvector calls!

After the patch:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
20000) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 168.384 ms


I could not reproduce the slowdown you mentioned:


Without patch:
postgres=# SELECT to_tsvector('andres@anarazel.de 
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) 
ORDER BY g.i LIMIT 1;                                  to_tsvector
-------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1
'www.postgresql.org':3
 
'www.postgresql.org/1':2(1 row)
Time: 1109.833 ms

With patch:
postgres=# SELECT to_tsvector('andres@anarazel.de 
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) 
ORDER BY g.i LIMIT 1;                                  to_tsvector
-------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1
'www.postgresql.org':3
 
'www.postgresql.org/1':2(1 row)
Time: 1036.689 ms

So on the hardware I tried its even a little bit faster for small strings 
(Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit).


I could not reproduce any difference with strings not involving urls or emails:

Without patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) 
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;                      to_tsvector
--------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row)
 
Time: 988.426 ms

With patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) 
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;                      to_tsvector
--------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row)
 
Time: 975.339 ms


So at least in my testing I do see no danger in the patch ;-)

Andres



PS: I averaged all the time result over multiple runs where it was relevant.


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Need a mentor, and a project.
Next
From: Greg Smith
Date:
Subject: Re: XLogInsert