Home > mailing lists

Re: gsoc, oprrest function for text search take 2 - Mailing list pgsql-hackers

From	Jan Urbański
Subject	Re: gsoc, oprrest function for text search take 2
Date	August 13, 2008 20:44:18
Msg-id	48A3477E.7090205@students.mimuw.edu.pl Whole thread Raw
In response to	Re: gsoc, oprrest function for text search take 2 ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Responses	Re: gsoc, oprrest function for text search take 2 (Heikki Linnakangas <heikki@enterprisedb.com>)
List	pgsql-hackers

Tree view

Heikki Linnakangas wrote:
> Jan Urbański wrote:
>> through it. The only tiny ugliness is that there's one function used
>> for qsort() and another for bsearch(), because I'm sorting an array of
>> texts (from pg_statistic) and I'm binary searching for a lexeme
>> (non-NULL terminated string with length).
>
> It would be nice to clean that up a bit. I think you could convert the
> lexeme to a TextFreq, or make the TextFreq.element a "text *" instead of
> Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array
> for qsort).

Hm, making it a text * won't help, I think, because the problem is: - pg_statistics holds an array of text datums and
anarray of float  
datums, ordered by the latter - a tsquery is a tree of WordEntries (lexemes), ie. non-NULL
terminated strings

The approach was to: (1) create an array of (text, float) pairs by zipping the two
pg_statistics arrays into one (2) sort that array on text values (3) every time a frequency of a WordEntry needs to be
determined,look  
it up using binary search in the sorted array

So for (2) I needed a function that compares text with text (I reused
bttext_pattern_cmp for that). And to do (3) I needed a function that
compares text with WordEntries. I didn't want to convert WordEntries
into "text *" values, because that would require a palloc().
Hmm, maybe I could build an array of (lexeme, float) in (1) instead,
turning "text *" into a lexeme is very straightforward. Then I'd use the
same function in (2) and (3) - cleaner.

However, maybe I should just skip all that qsort() nonsense and use a
hashtable?
Another bold thought is: keep the pg_statistics arrays for tsvectors
ordered by text datums, and not frequencies. Would've been awkward,
'cause people expect the most common frequencies array to be sorted and
not the most common values/elements one, but it'd help a lot and
simplify the code quite a bit. It would induce one extra qsort() in
ts_typanalyze(), but would allow only bsearch()es in tssel().

>> My medicore gprof skills got me:
>> [... nonsense ...]

> I'd like to see a little bit more testing of that. I can't read gprof
> myself, so the above doesn't give me much confidence. I use oprofile,
> which I find is much simpler to use.
> I think the worst case scenario is with statistics_target set to
> maximum, with a simplest possible query and simplest possible tsquery.

One kernel recompile later...
I got oprofile running, here's the setup:
$ pgbench -n -f tssel-bench.sql -c 10 -t 1000 postgres
And here's tssel-bench.sql:
explain select * from manuals where tsvector @@ to_tsquery('foo');

The "manuals" table was rather small (but that's irrelevant I think) and
statistic_target for the "tsvector" column were set to 100.

Obviously foo() isn't a most common element in my test data, so the
bsearch()es always miss. The results are:
samples  %        symbol name
101507   13.4461  internal_text_pattern_compare
91398    12.1070  bttext_pattern_cmp
82753    10.9618  pg_detoast_datum_packed
66434     8.8001  pg_qsort
54005     7.1537  DirectFunctionCall2
48925     6.4808  pglz_decompress
44931     5.9518  compare_two_textfreqs
40178     5.3222  AllocSetAlloc
26763     3.5451  AllocSetCheck
20839     2.7604  AtEOXact_CatCache
16057     2.1270  AllocSetFree
13772     1.8243  swapfunc
10001     1.3248  .plt
7859      1.0410  text_to_cstring
7556      1.0009  datumCopy

So, maybe qsorting() every time you plan a query is not that cheap after
all. I think hashing would also be an overkill. How do peole feel about
storing the statistics sorted on values and not on frequencies?

Cheers,
Jan

PS:
I used a simple
$ opreport --symbols /path/to/postgres
are there any magic switches I need to add?

J

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 13 August 2008, 20:33:45
Subject: Re: Replay attack of query cancel

From: Decibel!
Date: 13 August 2008, 20:52:35
Subject: Re: SeqScan costs

Re: gsoc, oprrest function for text search take 2 - Mailing list pgsql-hackers

Previous

Next