Re: procost for to_tsvector - Mailing list pgsql-hackers

From Tom Lane
Subject Re: procost for to_tsvector
Date
Msg-id 55153.1430541600@sss.pgh.pa.us
Whole thread Raw
In response to Re: procost for to_tsvector  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Responses Re: procost for to_tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>>> In the OP, he suggested "on the order of 100".  Maybe we could just
>>> go with 100.

>  Tom> I'm OK with that in view of <87h9trs0zm.fsf@news-spur.riddles.org.uk>

> Note that the results from that post suggest 100 as a bare minimum,
> higher values would be quite reasonable.

I'm not entirely convinced that your experiments disentangled the CPU cost
of to_tsvector itself from the costs of detoasting its input, which is an
issue that we ought to address separately.  In particular, comparing to
textlen() is unreliable for this purpose since in single-byte encodings
textlen() does not have to dereference a TOAST pointer at all.

It is possible to prove that to_tsvector() is much more expensive per-byte
than, say, md5():

regression=# select sum(length((repeat('xyzzy ', i)))) from generate_series(1,10000) i;   sum    
-----------300030000
(1 row)

Time: 360.423 ms
regression=# select sum(length(md5(repeat('xyzzy ', i)))) from generate_series(1,10000) i; sum   
--------320000
(1 row)

Time: 1339.806 ms
regression=# select sum(length(to_tsvector(repeat('xyzzy ', i)))) from generate_series(1,10000) i; sum  
-------10000
(1 row)

Time: 78564.333 ms

These numbers put md5() at about 3.3 nsec/input byte on my machine, and
to_tsvector() with the 'english' configuration at about 260 nsec/byte.
It's certainly possible that lots of repetitions of 'xyzzy ' isn't a very
representative sample of typical to_tsvector input; but at least this
test does not involve any toasted-value access.  So, as I said, I'm okay
with costing to_tsvector() at 100x the cost of md5().  I'm not convinced
that any factor above that is to_tsvector's fault.

>  Tom> and some experiments of my own, but I wonder why we are only
>  Tom> thinking of to_tsvector.  Isn't to_tsquery, for example, just
>  Tom> about as expensive?  What of other text search functions?

> Making the same change for to_tsquery and plainto_tsquery would be
> reasonable; that would help with the seqscan cost for cases like
> to_tsvector('config',col) @@ to_tsquery('blah') where the non-immutable
> form of to_tsquery is used.

Works for me.

> I don't recall seeing cases of any of the other functions figuring into
> planner decisions.

It's not so much "are they popular" as "do they involve parsing raw
text".  Once you've got the tsvector or tsquery, later steps are
(I think) much more efficient.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Auditing extension for PostgreSQL (Take 2)
Next
From: Kohei KaiGai
Date:
Subject: Re: feature freeze and beta schedule