Re: tsearch parser inefficiency if text includes urls or emails - new version - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: tsearch parser inefficiency if text includes urls or emails - new version |
Date | |
Msg-id | 200912091906.46404.andres@anarazel.de Whole thread Raw |
In response to | Re: tsearch parser inefficiency if text includes urls or emails - new version ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Responses |
Re: tsearch parser inefficiency if text includes
urls or emails - new version
("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
|
List | pgsql-hackers |
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote: > Andres Freund <andres@anarazel.de> wrote: > > Could you show your testcase? > > OK. I was going to try to check other platforms first, and package > up the information better, but here goes. > > I created 10000 lines with random IP-based URLs for a test. The > first few lines are: > > create table t1 (c1 int not null primary key, c2 text); > insert into t1 values (2, > 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html > http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html > http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html I think you see no real benefit, because your strings are rather short - the documents I scanned when noticing the issue where rather long. If your strings are short, they and the copy will fit into cpu cache anyway, so copying them around/converting them to some other string format is not that expensive compared to the rest of the work done. Also after each copying step for large strings the complete cache is filled with unrelated information (namely the end of the string). So every charwise access will need to wait for a memory access. A rather extreme/contrived example: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 3.740 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 115.027 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 10000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 24355.339 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 20000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 47276.739 ms One easily can see the quadratic complexity here. The quadratic complexity lies in the length/amount of emails/urls of the strings, not in the number of to_tsvector calls! After the patch: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 20000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 168.384 ms I could not reproduce the slowdown you mentioned: Without patch: postgres=# SELECT to_tsvector('andres@anarazel.de http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector -------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1 'www.postgresql.org':3 'www.postgresql.org/1':2(1 row) Time: 1109.833 ms With patch: postgres=# SELECT to_tsvector('andres@anarazel.de http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector -------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1 'www.postgresql.org':3 'www.postgresql.org/1':2(1 row) Time: 1036.689 ms So on the hardware I tried its even a little bit faster for small strings (Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit). I could not reproduce any difference with strings not involving urls or emails: Without patch: postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector --------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row) Time: 988.426 ms With patch: postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector --------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row) Time: 975.339 ms So at least in my testing I do see no danger in the patch ;-) Andres PS: I averaged all the time result over multiple runs where it was relevant.
pgsql-hackers by date: