Re: gsoc, oprrest function for text search take 2 - Mailing list pgsql-hackers

From Jan Urbański
Subject Re: gsoc, oprrest function for text search take 2
Date
Msg-id 48A410B7.3020004@students.mimuw.edu.pl
Whole thread Raw
In response to Re: gsoc, oprrest function for text search take 2  ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Responses Re: gsoc, oprrest function for text search take 2  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-hackers
Heikki Linnakangas wrote:
> Jan Urbański wrote:
>> So right now the idea is to:
>>  (1) pre-sort STATISTIC_KIND_MCELEM values
>>  (2) build an array of pointers to detoasted values in tssel()
>>  (3) use binary search when looking for MCELEMs during tsquery analysis
>
> Sounds like a plan. In (2), it's even better to detoast the values
> lazily. For a typical one-word tsquery, the binary search will only look
> at a small portion of the elements.

Hm, how can I do that? Toast is still a bit black magic to me... Do you
mean I should stick to having Datums in TextFreq? And use DatumGetTextP
in bsearch() (assuming I'll get rid of qsort())? I wanted to avoid that,
so I won't detoast the same value multiple times, but it's true: a
binary search won't touch most elements.

> Another thing is, how significant is the time spent in tssel() anyway,
> compared to actually running the query? You ran pgbench on EXPLAIN,
> which is good to see where in tssel() the time is spent, but if the time
> spent in tssel() is say 1% of the total execution time, there's no point
> optimizing it further.

Changed to the pgbench script to
select * from manual where tsvector @@ to_tsquery('foo');
and the parameters to
pgbench -n -f tssel-bench.sql -t 1000 postgres

and got

number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 12.238282 (including connections establishing)
tps = 12.238606 (excluding connections establishing)

samples  %        symbol name
174731   31.6200  pglz_decompress
88105    15.9438  tsvectorout
17280     3.1271  pg_mblen
13623     2.4653  AllocSetAlloc
13059     2.3632  hash_search_with_hash_value
10845     1.9626  pg_utf_mblen
10335     1.8703  internal_text_pattern_compare
9196      1.6641  index_getnext
9102      1.6471  bttext_pattern_cmp
8075      1.4613  pg_detoast_datum_packed
7437      1.3458  LWLockAcquire
7066      1.2787  hash_any
6811      1.2325  AllocSetFree
6623      1.1985  pg_qsort
6439      1.1652  LWLockRelease
5793      1.0483  DirectFunctionCall2
5322      0.9631  _bt_compare
4664      0.8440  tsCompareString
4636      0.8389  .plt
4539      0.8214  compare_two_textfreqs

But I think I'll go with pre-sorting anyway, it feels cleaner and neater.
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Join Removal/ Vertical Partitioning
Next
From: Magnus Hagander
Date:
Subject: Re: Parsing of pg_hba.conf and authentication inconsistencies