Home > mailing lists

Re: tsearch parser inefficiency if text includes urls or emails - new version - Mailing list pgsql-hackers

From	Kevin Grittner
Subject	Re: tsearch parser inefficiency if text includes urls or emails - new version
Date	December 8, 2009 15:15:57
Msg-id	4B1E2748020000250002D1EF@gw.wicourts.gov Whole thread Raw
In response to	Re: tsearch parser inefficiency if text includes urls or emails - new version (Andres Freund <andres@anarazel.de>)
Responses	Re: tsearch parser inefficiency if text includes urls or emails - new version (Andres Freund <andres@anarazel.de>) Re: tsearch parser inefficiency if text includes urls or emails - new version ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) Re: tsearch parser inefficiency if text includes urls or emails - new version (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers

Tree view

Andres Freund <andres@anarazel.de> wrote: 
> Could you show your testcase?
OK.  I was going to try to check other platforms first, and package
up the information better, but here goes.
I created 10000 lines with random IP-based URLs for a test.  The
first few lines are:
create table t1 (c1 int not null primary key, c2 text);
insert into t1 values (2, 

'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.htmlhttp://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.htmlhttp://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html
Actually, the special character was initially the word "the", but I
wanted to see if having non-ASCII characters in the value made any
difference.  It didn't.
Unfortunately, I was testing at home last night and forgot to bring
the exact test query with me, but it was this or something close to
it:
\timing
select to_tsvector(c2) from t1, (select generate_series(1,200)) x where c1 = 2;
I was running on Ubuntu 9.10, an AMD dual core CPU (don't have the
model number handy), UTF-8, en_US.UTF8.
> I dont see why it could get slower?
I don't either.  The best I can tell, following the pointer from
orig to any of its elements seems to be way more expensive than I
would ever have guessed.  The only thing that seemed to improve the
speed was minimizing that by using a local variable to capture any
element referenced more than once.  (Although, there is overlap
between the timings for the original patch and the one which seemed
a slight improvement; I would need to do more testing to really rule
out noise and have complete confidence that my changes actually are
an improvement on the original patch.)
Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
CPU?  (I'm looking forward to testing this today on a 64 bit build
on an Intel CPU.)
-Kevin

pgsql-hackers by date:

From: Tom Lane
Date: 08 December 2009, 15:10:59
Subject: Re: Sought after architectures for the PostgreSQL buildfarm?

From: Andres Freund
Date: 08 December 2009, 15:17:30
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version

Re: tsearch parser inefficiency if text includes urls or emails - new version - Mailing list pgsql-hackers

Previous

Next