Re: gsoc, oprrest function for text search take 2 - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | Re: gsoc, oprrest function for text search take 2 |
Date | |
Msg-id | 48A3477E.7090205@students.mimuw.edu.pl Whole thread Raw |
In response to | Re: gsoc, oprrest function for text search take 2 ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
Responses |
Re: gsoc, oprrest function for text search take 2
(Heikki Linnakangas <heikki@enterprisedb.com>)
|
List | pgsql-hackers |
Heikki Linnakangas wrote: > Jan Urbański wrote: >> through it. The only tiny ugliness is that there's one function used >> for qsort() and another for bsearch(), because I'm sorting an array of >> texts (from pg_statistic) and I'm binary searching for a lexeme >> (non-NULL terminated string with length). > > It would be nice to clean that up a bit. I think you could convert the > lexeme to a TextFreq, or make the TextFreq.element a "text *" instead of > Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array > for qsort). Hm, making it a text * won't help, I think, because the problem is: - pg_statistics holds an array of text datums and anarray of float datums, ordered by the latter - a tsquery is a tree of WordEntries (lexemes), ie. non-NULL terminated strings The approach was to: (1) create an array of (text, float) pairs by zipping the two pg_statistics arrays into one (2) sort that array on text values (3) every time a frequency of a WordEntry needs to be determined,look it up using binary search in the sorted array So for (2) I needed a function that compares text with text (I reused bttext_pattern_cmp for that). And to do (3) I needed a function that compares text with WordEntries. I didn't want to convert WordEntries into "text *" values, because that would require a palloc(). Hmm, maybe I could build an array of (lexeme, float) in (1) instead, turning "text *" into a lexeme is very straightforward. Then I'd use the same function in (2) and (3) - cleaner. However, maybe I should just skip all that qsort() nonsense and use a hashtable? Another bold thought is: keep the pg_statistics arrays for tsvectors ordered by text datums, and not frequencies. Would've been awkward, 'cause people expect the most common frequencies array to be sorted and not the most common values/elements one, but it'd help a lot and simplify the code quite a bit. It would induce one extra qsort() in ts_typanalyze(), but would allow only bsearch()es in tssel(). >> My medicore gprof skills got me: >> [... nonsense ...] > I'd like to see a little bit more testing of that. I can't read gprof > myself, so the above doesn't give me much confidence. I use oprofile, > which I find is much simpler to use. > I think the worst case scenario is with statistics_target set to > maximum, with a simplest possible query and simplest possible tsquery. One kernel recompile later... I got oprofile running, here's the setup: $ pgbench -n -f tssel-bench.sql -c 10 -t 1000 postgres And here's tssel-bench.sql: explain select * from manuals where tsvector @@ to_tsquery('foo'); The "manuals" table was rather small (but that's irrelevant I think) and statistic_target for the "tsvector" column were set to 100. Obviously foo() isn't a most common element in my test data, so the bsearch()es always miss. The results are: samples % symbol name 101507 13.4461 internal_text_pattern_compare 91398 12.1070 bttext_pattern_cmp 82753 10.9618 pg_detoast_datum_packed 66434 8.8001 pg_qsort 54005 7.1537 DirectFunctionCall2 48925 6.4808 pglz_decompress 44931 5.9518 compare_two_textfreqs 40178 5.3222 AllocSetAlloc 26763 3.5451 AllocSetCheck 20839 2.7604 AtEOXact_CatCache 16057 2.1270 AllocSetFree 13772 1.8243 swapfunc 10001 1.3248 .plt 7859 1.0410 text_to_cstring 7556 1.0009 datumCopy So, maybe qsorting() every time you plan a query is not that cheap after all. I think hashing would also be an overkill. How do peole feel about storing the statistics sorted on values and not on frequencies? Cheers, Jan PS: I used a simple $ opreport --symbols /path/to/postgres are there any magic switches I need to add? J -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
pgsql-hackers by date: