Thread: gsoc, oprrest function for text search take 2
Hi, I know Commit Fest is in progress, as well as the holiday season. But the Summer of Code ends in about three weeks, so I'd like to request a bit of out-of-order processing :) My previous mail sent to -hackers is here: http://archives.postgresql.org/message-id/4881CCCF.4020905@students.mimuw.edu.pl I had problems applying the patch from that mail (it got mangled somehow, could by my mail agent...), so I'm attaching it again. There are two things that I'm not particularly proud of in the patch. First is palloc()ing and filling in a table simply to user qsort() on it. I guess I could either hack up a version of get_attstatsslot() that returns an array of (element, frequency) pairs or sort the elements by hand instead of using qsort() and keep the order of frequencies in sync while doing the sort. Another thing are cstring_to_text_with_len calls. I'm doing them so I can use bttextcmp in bsearch(). I think I could come up with a dedicated function to return text Datums and WordEntries (read: non-NULL terminated strings with a given length). Are these unsignificant? Or should I do these optimizations? Or, sadly, signs that using binary search is not a good decision? Thanks, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile index e20a4a2..ba728eb 100644 --- a/src/backend/tsearch/Makefile +++ b/src/backend/tsearch/Makefile @@ -19,7 +19,7 @@ DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \ OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ - to_tsany.o ts_typanalyze.o ts_utils.o + to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c new file mode 100644 index 0000000..4b86072 --- /dev/null +++ b/src/backend/tsearch/ts_selfuncs.c @@ -0,0 +1,299 @@ +/*------------------------------------------------------------------------- + * + * ts_selfuncs.c + * Selectivity functions for text search types. + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "miscadmin.h" /* for check_stack_depth() */ +#include "utils/memutils.h" +#include "utils/builtins.h" +#include "utils/syscache.h" +#include "utils/lsyscache.h" +#include "utils/selfuncs.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "nodes/nodes.h" +#include "tsearch/ts_type.h" + +/* lookup table type for binary searching through MCELEMs */ +typedef struct +{ + Datum element; + float4 frequency; +} TextFreq; + +static int +compare_textfreq(const void *e1, const void *e2); +static Selectivity +tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq); +static Selectivity +mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers); +static double +tsquerysel(VariableStatData *vardata, Datum constval); + + +/* TSQuery traversal function */ +static Selectivity +tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq) +{ + TextFreq key, + *searchres; + Selectivity s1, s2; + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + + if (item->type == QI_VAL) + { + QueryOperand *oper = (QueryOperand *) item; + + /* + * Prepare the key for bsearch(). No need to initialize key.frequency, + * because sorting is only on key.element. + */ + key.element = PointerGetDatum( + cstring_to_text_with_len(operand + oper->distance, oper->length)); + + searchres = (TextFreq *) bsearch(&key, lookup, length, + sizeof(TextFreq), compare_textfreq); + if (searchres) + { + /* + * The element is in MCELEM. Return precise selectivity (or at + * least as precise, as ANALYZE could find out). + */ + return (Selectivity) searchres->frequency; + } + else + { + /* + * The element is not in MCELEM. Punt, but assure that the + * selectivity cannot be more than minfreq / 2. + */ + return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); + } + } + + /* Current TSQuery node is an operator */ + switch (item->operator.oper) + { + case OP_NOT: + return 1.0 - tsquery_opr_selec(item + 1, operand, lookup, + length, minfreq); + + case OP_AND: + return + tsquery_opr_selec(item + 1, operand, lookup, length, minfreq) * + tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + + case OP_OR: + s1 = tsquery_opr_selec(item + 1, operand, lookup, length, minfreq); + s2 = tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + return s1 + s2 - s1 * s2; + + default: + elog(ERROR, "unrecognized operator: %d", item->operator.oper); + } + + /* never reached, keep compiler happy */ + return (Selectivity) DEFAULT_TS_SEL; +} + +/* + * Traverse the tsquery preorder, calculating selectivity as: + * + * selec(left_oper) * selec(right_oper) in AND nodes, + * + * selec(left_oper) + selec(right_oper) - + * selec(left_oper) * selec(right_oper) in OR nodes, + * + * 1 - select(oper) in NOT nodes + * + * freq[val] in VAL nodes, if the value is in MCELEM + * min(freq[MCELEM]) / 2 in VAL nodes, if it is not + * + * + * Implementation-wise, we sort the MCELEM array to use binary + * search on it. + */ +static Selectivity +mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers) +{ + float4 minfreq; + TextFreq *lookup; + Selectivity selec; + MemoryContext opr_context, + old_context; + int i; + + /* Grab the lowest frequency */ + minfreq = numbers[nnumbers - 1]; + + /* + * Compute selectivity in child context, because we will possibly do a lot + * of QueryOperand to text transformations, which palloc() memory + */ + opr_context = AllocSetContextCreate(CurrentMemoryContext, + "Text search restriction", + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); + old_context = MemoryContextSwitchTo(opr_context); + + + /* Construct the array to be qsort()'ed */ + Assert(nmcelem == nnumbers); + lookup = (TextFreq *) palloc(sizeof(TextFreq) * nmcelem); + for (i = 0; i < nmcelem; i++) + { + lookup[i].element = mcelem[i]; + lookup[i].frequency = numbers[i]; + } + + qsort(lookup, nmcelem, sizeof(TextFreq), compare_textfreq); + + selec = tsquery_opr_selec(GETQUERY(query), GETOPERAND(query), lookup, + nmcelem, minfreq); + + MemoryContextSwitchTo(old_context); + MemoryContextDelete(opr_context); + + return selec; +} + +static int +compare_textfreq(const void *e1, const void *e2) +{ + const TextFreq *t1 = (const TextFreq *) e1; + const TextFreq *t2 = (const TextFreq *) e2; + + return DatumGetInt32(DirectFunctionCall2(bttextcmp, + t1->element, t2->element)); +} + +/* + * tssel -- Selectivity of "@@" + */ +Datum +tssel(PG_FUNCTION_ARGS) +{ + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); + /* We skip arg #2, which is the operator OID - it's gonna be "@@" */ + List *args = (List *) PG_GETARG_POINTER(2); + int varRelid = PG_GETARG_INT32(3); + VariableStatData vardata; + Node *other; + bool varonleft; + Selectivity selec; + + /* + * If expression is not variable op something or something op variable, + * then punt and return a default estimate. + */ + if (!get_restriction_variable(root, args, varRelid, + &vardata, &other, &varonleft)) + { + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* + * Can't do anything useful if the something is not a constant, either. + */ + if (!IsA(other, Const)) + { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* The "@@" operator is strict, so might cope with NULL right away */ + if (((Const *) other)->constisnull) { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8((float8) 0.0); + } + + /* + * OK, there's a Var and a Const we're dealing with here. We need the Var + * to be a TSVector (or else we don't have any useful statistic for it). + */ + + if (vardata.vartype == TSVECTOROID) + { + /* tsvector @@ tsquery or the other way around */ + Assert(((Const *) other)->consttype == TSQUERYOID); + + selec = tsquerysel(&vardata, ((Const *) other)->constvalue); + } + else + { + /* The Var is something we don't have useful statistic for */ + selec = (Selectivity) DEFAULT_TS_SEL; + } + + ReleaseVariableStats(vardata); + + PG_RETURN_FLOAT8((float8) selec); +} + +static Selectivity +tsquerysel(VariableStatData *vardata, Datum constval) +{ + Selectivity selec; + + if (HeapTupleIsValid(vardata->statsTuple)) + { + TSQuery query; + Form_pg_statistic stats; + Datum *values; + int nvalues; + float4 *numbers; + int nnumbers; + + /* The caller made sure the const is a TSQuery, so get it now */ + query = DatumGetTSQuery(constval); + + stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple); + + /* MCELEM will be an array of Text elements for a tsvector column */ + if (get_attstatsslot(vardata->statsTuple, + TEXTOID, -1, + STATISTIC_KIND_MCELEM, InvalidOid, + &values, &nvalues, + &numbers, &nnumbers)) + { + /* + * There is a most-common-elements slot for the tsvector Var, so + * use that. + */ + + selec = mcelem_tsquery_selec(query, values, nvalues, + numbers, nnumbers); + free_attstatsslot(TEXTOID, values, nvalues, numbers, nnumbers); + } + else + { + /* No most-common-elements slot */ + selec = (Selectivity) DEFAULT_TS_SEL; + } + } + else + { + selec = (Selectivity) DEFAULT_TS_SEL; + } + + return selec; +} diff --git a/src/include/catalog/pg_operator.h b/src/include/catalog/pg_operator.h index 1c7f37d..451805d 100644 --- a/src/include/catalog/pg_operator.h +++ b/src/include/catalog/pg_operator.h @@ -915,10 +915,10 @@ DATA(insert OID = 3630 ( "<>" PGNSP PGUID b f f 3614 3614 16 3630 3629 ts DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); -DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq contsel contjoinsel )); -DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv contsel contjoinsel )); -DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq contsel contjoinsel )); -DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv contsel contjoinsel )); +DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq tssel contjoinsel )); +DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv tssel contjoinsel )); +DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq tssel contjoinsel )); +DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv tssel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 16ccb55..ef24f40 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -4321,6 +4321,8 @@ DESCR("GiST tsquery support"); DATA(insert OID = 3701 ( gtsquery_consistent PGNSP PGUID 12 1 0 0 f f t f i 5 16 "2281 2281 23 26 2281" _null__null_ _null_ gtsquery_consistent _null_ _null_ _null_ )); DESCR("GiST tsquery support"); +DATA(insert OID = 3687 ( tssel PGNSP PGUID 12 1 0 0 f f t f s 4 701 "2281 26 2281 23" _null_ _null_ _null_ tssel _null__null_ _null_ )); +DESCR("restriction selectivity of tsvector @@ tsquery"); DATA(insert OID = 3688 ( ts_typanalyze PGNSP PGUID 12 1 0 0 f f t f s 1 16 "2281" _null_ _null_ _null_ ts_typanalyze_null_ _null_ _null_ )); DESCR("tsvector typanalyze"); diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h index 06ca1f9..6208432 100644 --- a/src/include/tsearch/ts_type.h +++ b/src/include/tsearch/ts_type.h @@ -291,4 +291,17 @@ extern Datum tsquery_rewrite_query(PG_FUNCTION_ARGS); extern Datum tsq_mcontains(PG_FUNCTION_ARGS); extern Datum tsq_mcontained(PG_FUNCTION_ARGS); +/* + * The default text search selectivity is chosen to be smll enough to + * encourage indexscans for typical table densities. See selfuncs.h and + * DEFAULT_EQ_SEL for details. + */ +#define DEFAULT_TS_SEL 0.005 + +/* + * operator restriction function for tsvector @@ tsquery and + * tsquery @@ tsvector + */ +extern Datum tssel(PG_FUNCTION_ARGS); + #endif /* _PG_TSTYPE_H_ */
Jan Urbański wrote: > Another thing are cstring_to_text_with_len calls. I'm doing them so I > can use bttextcmp in bsearch(). I think I could come up with a dedicated > function to return text Datums and WordEntries (read: non-NULL > terminated strings with a given length). Just keep them as cstrings and use strcmp. We're only keeping the array sorted so that we can binary search it, so we don't need proper locale-dependent collation. Note that we already assume that two strings ('text's) are equal if and only if their binary representations are equal (texteq() uses strcmp). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Jan Urbański wrote: >> Another thing are cstring_to_text_with_len calls. I'm doing them so I >> can use bttextcmp in bsearch(). I think I could come up with a >> dedicated function to return text Datums and WordEntries (read: >> non-NULL terminated strings with a given length). > > Just keep them as cstrings and use strcmp. We're only keeping the array > sorted so that we can binary search it, so we don't need proper > locale-dependent collation. Note that we already assume that two strings > ('text's) are equal if and only if their binary representations are > equal (texteq() uses strcmp). OK, I got rid of cstring->text calls and memory contexts as I went through it. The only tiny ugliness is that there's one function used for qsort() and another for bsearch(), because I'm sorting an array of texts (from pg_statistic) and I'm binary searching for a lexeme (non-NULL terminated string with length). My medicore gprof skills got me: 0.00 0.22 5/5 OidFunctionCall4 [37] [38] 28.4 0.00 0.22 5 tssel [38] 0.00 0.17 5/5 get_restriction_variable [40] 0.03 0.01 5/10 pg_qsort [60] 0.00 0.00 5/5 get_attstatsslot [139] Hopefully that says that the qsort() overhead is small compared to munging through the planner Node. Revised version of the patch attached. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile index e20a4a2..ba728eb 100644 *** a/src/backend/tsearch/Makefile --- b/src/backend/tsearch/Makefile *************** *** 19,25 **** OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ ! to_tsany.o ts_typanalyze.o ts_utils.o include $(top_srcdir)/src/backend/common.mk --- 19,25 ---- OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ ! to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c new file mode 100644 index 0000000..4deb507 *** /dev/null --- b/src/backend/tsearch/ts_selfuncs.c *************** *** 0,-1 **** --- 1,334 ---- + /*------------------------------------------------------------------------- + * + * ts_selfuncs.c + * Selectivity functions for text search types. + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "miscadmin.h" /* for check_stack_depth() */ + #include "utils/memutils.h" + #include "utils/builtins.h" + #include "utils/syscache.h" + #include "utils/lsyscache.h" + #include "utils/selfuncs.h" + #include "catalog/pg_type.h" + #include "catalog/pg_statistic.h" + #include "nodes/nodes.h" + #include "tsearch/ts_type.h" + + /* lookup table type for binary searching through MCELEMs */ + typedef struct + { + Datum element; + float4 frequency; + } TextFreq; + + /* type of keys for bsearch()ing through an array of TextFreqs */ + typedef struct + { + char *lexeme; + int length; + } LexemeKey; + + static int + compare_two_textfreqs(const void *e1, const void *e2); + static int + compare_lexeme_textfreq(const void *e1, const void *e2); + + static Selectivity + tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq); + static Selectivity + mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers); + static double + tsquerysel(VariableStatData *vardata, Datum constval); + + + /* TSQuery traversal function */ + static Selectivity + tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq) + { + LexemeKey key; + TextFreq *searchres; + Selectivity s1, s2; + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + + if (item->type == QI_VAL) + { + QueryOperand *oper = (QueryOperand *) item; + + /* + * Prepare the key for bsearch(). + */ + key.lexeme = operand + oper->distance; + key.length = oper->length; + + searchres = (TextFreq *) bsearch(&key, lookup, length, + sizeof(TextFreq), compare_lexeme_textfreq); + if (searchres) + { + /* + * The element is in MCELEM. Return precise selectivity (or at + * least as precise, as ANALYZE could find out). + */ + return (Selectivity) searchres->frequency; + } + else + { + /* + * The element is not in MCELEM. Punt, but assure that the + * selectivity cannot be more than minfreq / 2. + */ + return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); + } + } + + /* Current TSQuery node is an operator */ + switch (item->operator.oper) + { + case OP_NOT: + return 1.0 - tsquery_opr_selec(item + 1, operand, lookup, + length, minfreq); + + case OP_AND: + return + tsquery_opr_selec(item + 1, operand, lookup, length, minfreq) * + tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + + case OP_OR: + s1 = tsquery_opr_selec(item + 1, operand, lookup, length, minfreq); + s2 = tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + return s1 + s2 - s1 * s2; + + default: + elog(ERROR, "unrecognized operator: %d", item->operator.oper); + } + + /* never reached, keep compiler happy */ + return (Selectivity) DEFAULT_TS_SEL; + } + + /* + * Traverse the tsquery preorder, calculating selectivity as: + * + * selec(left_oper) * selec(right_oper) in AND nodes, + * + * selec(left_oper) + selec(right_oper) - + * selec(left_oper) * selec(right_oper) in OR nodes, + * + * 1 - select(oper) in NOT nodes + * + * freq[val] in VAL nodes, if the value is in MCELEM + * min(freq[MCELEM]) / 2 in VAL nodes, if it is not + * + * + * Implementation-wise, we sort the MCELEM array to use binary + * search on it. + */ + static Selectivity + mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers) + { + float4 minfreq; + TextFreq *lookup; + Selectivity selec; + int i; + + /* Grab the lowest frequency */ + minfreq = numbers[nnumbers - 1]; + + /* Construct the array to be qsort()'ed */ + Assert(nmcelem == nnumbers); + lookup = (TextFreq *) palloc(sizeof(TextFreq) * nmcelem); + for (i = 0; i < nmcelem; i++) + { + lookup[i].element = mcelem[i]; + lookup[i].frequency = numbers[i]; + } + + qsort(lookup, nmcelem, sizeof(TextFreq), compare_two_textfreqs); + + selec = tsquery_opr_selec(GETQUERY(query), GETOPERAND(query), lookup, + nmcelem, minfreq); + + pfree(lookup); + + return selec; + } + + /* + * qsort() comparator for TextFreqs array. Sorts on the text datum. + * The comparision is byte-by-byte, because we already consider text datums + * equal only if their binary representations are equal and we don't really + * care about the result ordering, as long as it's the same as the one + * assumed by the bsearch() comparator. + */ + static int + compare_two_textfreqs(const void *e1, const void *e2) + { + const TextFreq *t1 = (const TextFreq *) e1; + const TextFreq *t2 = (const TextFreq *) e2; + + return DatumGetInt32(DirectFunctionCall2(bttext_pattern_cmp, + t1->element, t2->element)); + } + + /* + * bsearch() comparator for a lexeme (non-NULL terminated string with length) + * and a TextFreq. Use byte-for-byte comparision, because that's the one we + * qsort()ed on. + */ + static int + compare_lexeme_textfreq(const void *e1, const void *e2) + { + const LexemeKey *key; + const TextFreq *t; + text *element; + int result; + int len1, + len2; + + key = (const LexemeKey *) e1; + t = (const TextFreq *) e2; + element = DatumGetTextP(t->element); + + len1 = key->length; + len2 = VARSIZE_ANY_EXHDR(element); + + result = strncmp(key->lexeme, VARDATA_ANY(element), Min(len1, len2)); + + if (result != 0) + return result; + else if (len1 < len2) + return -1; + else if (len1 > len2) + return 1; + else + return 0; + } + + /* + * tssel -- Selectivity of "@@" + */ + Datum + tssel(PG_FUNCTION_ARGS) + { + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); + /* We skip arg #2, which is the operator OID - it's gonna be "@@" */ + List *args = (List *) PG_GETARG_POINTER(2); + int varRelid = PG_GETARG_INT32(3); + VariableStatData vardata; + Node *other; + bool varonleft; + Selectivity selec; + + /* + * If expression is not variable op something or something op variable, + * then punt and return a default estimate. + */ + if (!get_restriction_variable(root, args, varRelid, + &vardata, &other, &varonleft)) + { + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* + * Can't do anything useful if the something is not a constant, either. + */ + if (!IsA(other, Const)) + { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* The "@@" operator is strict, so might cope with NULL right away */ + if (((Const *) other)->constisnull) { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8((float8) 0.0); + } + + /* + * OK, there's a Var and a Const we're dealing with here. We need the Var + * to be a TSVector (or else we don't have any useful statistic for it). + */ + + if (vardata.vartype == TSVECTOROID) + { + /* tsvector @@ tsquery or the other way around */ + Assert(((Const *) other)->consttype == TSQUERYOID); + + selec = tsquerysel(&vardata, ((Const *) other)->constvalue); + } + else + { + /* The Var is something we don't have useful statistic for */ + selec = DEFAULT_TS_SEL; + } + + ReleaseVariableStats(vardata); + + PG_RETURN_FLOAT8((float8) selec); + } + + static Selectivity + tsquerysel(VariableStatData *vardata, Datum constval) + { + Selectivity selec; + + if (HeapTupleIsValid(vardata->statsTuple)) + { + TSQuery query; + Form_pg_statistic stats; + Datum *values; + int nvalues; + float4 *numbers; + int nnumbers; + + /* The caller made sure the const is a TSQuery, so get it now */ + query = DatumGetTSQuery(constval); + + stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple); + + /* MCELEM will be an array of Text elements for a tsvector column */ + if (get_attstatsslot(vardata->statsTuple, + TEXTOID, -1, + STATISTIC_KIND_MCELEM, InvalidOid, + &values, &nvalues, + &numbers, &nnumbers)) + { + /* + * There is a most-common-elements slot for the tsvector Var, so + * use that. + */ + + selec = mcelem_tsquery_selec(query, values, nvalues, + numbers, nnumbers); + free_attstatsslot(TEXTOID, values, nvalues, numbers, nnumbers); + } + else + { + /* No most-common-elements slot */ + selec = (Selectivity) DEFAULT_TS_SEL; + } + } + else + { + selec = (Selectivity) DEFAULT_TS_SEL; + } + + return selec; + } diff --git a/src/include/catalog/pg_operator.h b/src/include/catalog/pg_operator.h index 1c7f37d..451805d 100644 *** a/src/include/catalog/pg_operator.h --- b/src/include/catalog/pg_operator.h *************** *** 915,924 **** DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); ! DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq contsel contjoinsel )); ! DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv contsel contjoinsel )); ! DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq contsel contjoinsel )); ! DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv contsel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); --- 915,924 ---- DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); ! DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq tssel contjoinsel )); ! DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv tssel contjoinsel )); ! DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq tssel contjoinsel )); ! DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv tssel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 16ccb55..ef24f40 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** *** 4321,4326 **** --- 4321,4328 ---- DATA(insert OID = 3701 ( gtsquery_consistent PGNSP PGUID 12 1 0 0 f f t f i 5 16 "2281 2281 23 26 2281" _null__null_ _null_ gtsquery_consistent _null_ _null_ _null_ )); DESCR("GiST tsquery support"); + DATA(insert OID = 3687 ( tssel PGNSP PGUID 12 1 0 0 f f t f s 4 701 "2281 26 2281 23" _null_ _null_ _null_ tssel _null__null_ _null_ )); + DESCR("restriction selectivity of tsvector @@ tsquery"); DATA(insert OID = 3688 ( ts_typanalyze PGNSP PGUID 12 1 0 0 f f t f s 1 16 "2281" _null_ _null_ _null_ ts_typanalyze_null_ _null_ _null_ )); DESCR("tsvector typanalyze"); diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h index 06ca1f9..b253327 100644 *** a/src/include/tsearch/ts_type.h --- b/src/include/tsearch/ts_type.h *************** *** 155,160 **** --- 155,162 ---- extern Datum ts_typanalyze(PG_FUNCTION_ARGS); + extern Datum tssel(PG_FUNCTION_ARGS); + /* * TSQuery *************** *** 291,294 **** --- 293,309 ---- extern Datum tsq_mcontains(PG_FUNCTION_ARGS); extern Datum tsq_mcontained(PG_FUNCTION_ARGS); + /* + * The default text search selectivity is chosen to be smll enough to + * encourage indexscans for typical table densities. See selfuncs.h and + * DEFAULT_EQ_SEL for details. + */ + #define DEFAULT_TS_SEL 0.005 + + /* + * operator restriction function for tsvector @@ tsquery and + * tsquery @@ tsvector + */ + extern Datum tssel(PG_FUNCTION_ARGS); + #endif /* _PG_TSTYPE_H_ */
Jan Urbański wrote: > Heikki Linnakangas wrote: >> Jan Urbański wrote: >>> Another thing are cstring_to_text_with_len calls. I'm doing them so I >>> can use bttextcmp in bsearch(). I think I could come up with a >>> dedicated function to return text Datums and WordEntries (read: >>> non-NULL terminated strings with a given length). >> >> Just keep them as cstrings and use strcmp. We're only keeping the >> array sorted so that we can binary search it, so we don't need proper >> locale-dependent collation. Note that we already assume that two >> strings ('text's) are equal if and only if their binary >> representations are equal (texteq() uses strcmp). > > OK, I got rid of cstring->text calls and memory contexts as I went > through it. The only tiny ugliness is that there's one function used for > qsort() and another for bsearch(), because I'm sorting an array of texts > (from pg_statistic) and I'm binary searching for a lexeme (non-NULL > terminated string with length). It would be nice to clean that up a bit. I think you could convert the lexeme to a TextFreq, or make the TextFreq.element a "text *" instead of Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array for qsort). > My medicore gprof skills got me: > 0.00 0.22 5/5 OidFunctionCall4 [37] > [38] 28.4 0.00 0.22 5 tssel [38] > 0.00 0.17 5/5 get_restriction_variable [40] > 0.03 0.01 5/10 pg_qsort [60] > 0.00 0.00 5/5 get_attstatsslot [139] > > Hopefully that says that the qsort() overhead is small compared to > munging through the planner Node. I'd like to see a little bit more testing of that. I can't read gprof myself, so the above doesn't give me much confidence. I use oprofile, which I find is much simpler to use. I think the worst case scenario is with statistics_target set to maximum, with a simplest possible query and simplest possible tsquery. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Jan Urbański wrote: >> through it. The only tiny ugliness is that there's one function used >> for qsort() and another for bsearch(), because I'm sorting an array of >> texts (from pg_statistic) and I'm binary searching for a lexeme >> (non-NULL terminated string with length). > > It would be nice to clean that up a bit. I think you could convert the > lexeme to a TextFreq, or make the TextFreq.element a "text *" instead of > Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array > for qsort). Hm, making it a text * won't help, I think, because the problem is: - pg_statistics holds an array of text datums and anarray of float datums, ordered by the latter - a tsquery is a tree of WordEntries (lexemes), ie. non-NULL terminated strings The approach was to: (1) create an array of (text, float) pairs by zipping the two pg_statistics arrays into one (2) sort that array on text values (3) every time a frequency of a WordEntry needs to be determined,look it up using binary search in the sorted array So for (2) I needed a function that compares text with text (I reused bttext_pattern_cmp for that). And to do (3) I needed a function that compares text with WordEntries. I didn't want to convert WordEntries into "text *" values, because that would require a palloc(). Hmm, maybe I could build an array of (lexeme, float) in (1) instead, turning "text *" into a lexeme is very straightforward. Then I'd use the same function in (2) and (3) - cleaner. However, maybe I should just skip all that qsort() nonsense and use a hashtable? Another bold thought is: keep the pg_statistics arrays for tsvectors ordered by text datums, and not frequencies. Would've been awkward, 'cause people expect the most common frequencies array to be sorted and not the most common values/elements one, but it'd help a lot and simplify the code quite a bit. It would induce one extra qsort() in ts_typanalyze(), but would allow only bsearch()es in tssel(). >> My medicore gprof skills got me: >> [... nonsense ...] > I'd like to see a little bit more testing of that. I can't read gprof > myself, so the above doesn't give me much confidence. I use oprofile, > which I find is much simpler to use. > I think the worst case scenario is with statistics_target set to > maximum, with a simplest possible query and simplest possible tsquery. One kernel recompile later... I got oprofile running, here's the setup: $ pgbench -n -f tssel-bench.sql -c 10 -t 1000 postgres And here's tssel-bench.sql: explain select * from manuals where tsvector @@ to_tsquery('foo'); The "manuals" table was rather small (but that's irrelevant I think) and statistic_target for the "tsvector" column were set to 100. Obviously foo() isn't a most common element in my test data, so the bsearch()es always miss. The results are: samples % symbol name 101507 13.4461 internal_text_pattern_compare 91398 12.1070 bttext_pattern_cmp 82753 10.9618 pg_detoast_datum_packed 66434 8.8001 pg_qsort 54005 7.1537 DirectFunctionCall2 48925 6.4808 pglz_decompress 44931 5.9518 compare_two_textfreqs 40178 5.3222 AllocSetAlloc 26763 3.5451 AllocSetCheck 20839 2.7604 AtEOXact_CatCache 16057 2.1270 AllocSetFree 13772 1.8243 swapfunc 10001 1.3248 .plt 7859 1.0410 text_to_cstring 7556 1.0009 datumCopy So, maybe qsorting() every time you plan a query is not that cheap after all. I think hashing would also be an overkill. How do peole feel about storing the statistics sorted on values and not on frequencies? Cheers, Jan PS: I used a simple $ opreport --symbols /path/to/postgres are there any magic switches I need to add? J -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > 26763 3.5451 AllocSetCheck Make sure you disable assertions before profiling. Although I'm actually a bit surprised the overhead isn't more than 3.5%, I've seen much higher overheads on other tests, but it's still skewing the results. - Heikki
Heikki Linnakangas wrote: > Jan Urbański wrote: >> 26763 3.5451 AllocSetCheck > > Make sure you disable assertions before profiling. Awww, darn. OK, here goes another set of results, without casserts this time. === CVS HEAD === number of clients: 10 number of transactions per client: 100000 number of transactions actually processed: 1000000/1000000 tps = 6437.286494 (including connections establishing) tps = 6438.168927 (excluding connections establishing) samples % symbol name 220443 11.6613 AllocSetAlloc 79355 4.1978 base_yyparse 77230 4.0854 SearchCatCache 56011 2.9629 hash_search_with_hash_value 45946 2.4305 MemoryContextAllocZeroAligned 38577 2.0407 hash_any 36414 1.9263 MemoryContextAlloc 33060 1.7489 AllocSetFree 27218 1.4398 ScanKeywordLookup 25793 1.3644 base_yylex 20579 1.0886 hash_uint32 18867 0.9981 hash_seq_search 18293 0.9677 expression_tree_walker 17696 0.9361 copyObject 16979 0.8982 LockAcquire 14292 0.7560 MemoryContextAllocZero 13117 0.6939 SearchSysCache === ts_sel ==== number of clients: 10 number of transactions per client: 100000 number of transactions actually processed: 1000000/1000000 tps = 3216.753677 (including connections establishing) tps = 3216.996592 (excluding connections establishing) 942096 10.9130 internal_text_pattern_compare 809195 9.3735 bttext_pattern_cmp 659545 7.6400 pg_detoast_datum_packed 628114 7.2759 pg_qsort 603998 6.9966 AllocSetAlloc 581880 6.7403 pglz_decompress 467708 5.4178 DirectFunctionCall2 385854 4.4696 compare_two_textfreqs 160578 1.8601 AllocSetFree 128642 1.4902 swapfunc 112885 1.3076 MemoryContextAlloc 103388 1.1976 SearchCatCache 100387 1.1629 text_to_cstring 99004 1.1468 hash_search_with_hash_value 98444 1.1403 .plt 92664 1.0734 base_yyparse 88511 1.0253 errstart Not good... Shall I try sorting pg_statistics arrays on text values instead of frequencies? BTW: I just noticed some text_to_cstring calls, they came from elog(DEBUG1)s that I have in my code. But they couldn't have skewn the results much, could they? Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > Not good... Shall I try sorting pg_statistics arrays on text values > instead of frequencies? Yeah, I'd go with that. If you only do it for the new STATISTIC_KIND_MCV_ELEMENT statistics, you shouldn't need to change any other code. Hmm. There has been discussion on raising default_statistic_target, and one reason why we've been afraid to do so has been that it increases the cost of planning (there's some O(n^2) algorithms in there). Pre-sorting the STATISTIC_KIND_MCV array as well, and replacing the linear searches with binary searches would alleviate that, which would be nice. > BTW: I just noticed some text_to_cstring calls, they came from > elog(DEBUG1)s that I have in my code. But they couldn't have skewn the > results much, could they? Well, text_to_cstring was consuming 1.1% of the CPU time on its own, and presumably some of the AllocSetAlloc overhead is attributable to that as well. And perhaps some of the detoasting as well. Speaking of which, a lot of time seems to be spent on detoasting. I'd like to understand that a better. Where is the detoasting coming from? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Jan Urbański wrote: >> Not good... Shall I try sorting pg_statistics arrays on text values >> instead of frequencies? > > Yeah, I'd go with that. If you only do it for the new > STATISTIC_KIND_MCV_ELEMENT statistics, you shouldn't need to change any > other code. OK, will do. >> BTW: I just noticed some text_to_cstring calls, they came from >> elog(DEBUG1)s that I have in my code. But they couldn't have skewn the >> results much, could they? > > Well, text_to_cstring was consuming 1.1% of the CPU time on its own, and > presumably some of the AllocSetAlloc overhead is attributable to that as > well. And perhaps some of the detoasting as well. > > Speaking of which, a lot of time seems to be spent on detoasting. I'd > like to understand that a better. Where is the detoasting coming from? Hmm, maybe bttext_pattern_cmp does some detoasting? It calls PG_GETARG_TEXT_PP(), which in turn calls pg_detoast_datum_packed(). Oh, and also I think that compare_lexeme_textfreq() uses DatumGetTextP() and that also does detoasting. The root of all evil could by keeping a Datum in the TextFreq array, and not a "text *", which is something you pointed out earlier and I apparently didn't understand. So right now the idea is to: (1) pre-sort STATISTIC_KIND_MCELEM values (2) build an array of pointers to detoasted valuesin tssel() (3) use binary search when looking for MCELEMs during tsquery analysis Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > So right now the idea is to: > (1) pre-sort STATISTIC_KIND_MCELEM values > (2) build an array of pointers to detoasted values in tssel() > (3) use binary search when looking for MCELEMs during tsquery analysis Sounds like a plan. In (2), it's even better to detoast the values lazily. For a typical one-word tsquery, the binary search will only look at a small portion of the elements. Another thing is, how significant is the time spent in tssel() anyway, compared to actually running the query? You ran pgbench on EXPLAIN, which is good to see where in tssel() the time is spent, but if the time spent in tssel() is say 1% of the total execution time, there's no point optimizing it further. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Heikki Linnakangas wrote: >> Speaking of which, a lot of time seems to be spent on detoasting. I'd like to >> understand that a better. Where is the detoasting coming from? > > Hmm, maybe bttext_pattern_cmp does some detoasting? It calls > PG_GETARG_TEXT_PP(), which in turn calls pg_detoast_datum_packed(). Oh, and > also I think that compare_lexeme_textfreq() uses DatumGetTextP() and that also > does detoasting. DatumGetTextP() will detoast packed data (ie, 1-byte length headers) whereas DatumGetTextPP will only detoast compressed or externally stored data. I suspect you're seeing the former. > The root of all evil could by keeping a Datum in the TextFreq array, and not > a "text *", which is something you pointed out earlier and I apparently > didn't understand. Well it doesn't really matter which type. If you store Datums which are already detoasted then the DatumGetTextP and DatumGetTextPP will just be noops anyways. If you store packed data (from DatumGetTextPP) then it's probably safer to store it as Datums so if you need to pass it to any functions which don't expect packed data they'll untoast it. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Heikki Linnakangas wrote: > Jan Urbański wrote: >> So right now the idea is to: >> (1) pre-sort STATISTIC_KIND_MCELEM values >> (2) build an array of pointers to detoasted values in tssel() >> (3) use binary search when looking for MCELEMs during tsquery analysis > > Sounds like a plan. In (2), it's even better to detoast the values > lazily. For a typical one-word tsquery, the binary search will only look > at a small portion of the elements. Hm, how can I do that? Toast is still a bit black magic to me... Do you mean I should stick to having Datums in TextFreq? And use DatumGetTextP in bsearch() (assuming I'll get rid of qsort())? I wanted to avoid that, so I won't detoast the same value multiple times, but it's true: a binary search won't touch most elements. > Another thing is, how significant is the time spent in tssel() anyway, > compared to actually running the query? You ran pgbench on EXPLAIN, > which is good to see where in tssel() the time is spent, but if the time > spent in tssel() is say 1% of the total execution time, there's no point > optimizing it further. Changed to the pgbench script to select * from manual where tsvector @@ to_tsquery('foo'); and the parameters to pgbench -n -f tssel-bench.sql -t 1000 postgres and got number of clients: 1 number of transactions per client: 1000 number of transactions actually processed: 1000/1000 tps = 12.238282 (including connections establishing) tps = 12.238606 (excluding connections establishing) samples % symbol name 174731 31.6200 pglz_decompress 88105 15.9438 tsvectorout 17280 3.1271 pg_mblen 13623 2.4653 AllocSetAlloc 13059 2.3632 hash_search_with_hash_value 10845 1.9626 pg_utf_mblen 10335 1.8703 internal_text_pattern_compare 9196 1.6641 index_getnext 9102 1.6471 bttext_pattern_cmp 8075 1.4613 pg_detoast_datum_packed 7437 1.3458 LWLockAcquire 7066 1.2787 hash_any 6811 1.2325 AllocSetFree 6623 1.1985 pg_qsort 6439 1.1652 LWLockRelease 5793 1.0483 DirectFunctionCall2 5322 0.9631 _bt_compare 4664 0.8440 tsCompareString 4636 0.8389 .plt 4539 0.8214 compare_two_textfreqs But I think I'll go with pre-sorting anyway, it feels cleaner and neater. -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Heikki Linnakangas wrote: > Jan Urbański wrote: >> So right now the idea is to: >> (1) pre-sort STATISTIC_KIND_MCELEM values >> (2) build an array of pointers to detoasted values in tssel() >> (3) use binary search when looking for MCELEMs during tsquery analysis > > Sounds like a plan. In (2), it's even better to detoast the values > lazily. For a typical one-word tsquery, the binary search will only look > at a small portion of the elements. Here's another version. Most common lexemes get sorted before storing in pg_statistic. The ordering is on length first and value second. That way we can avoid strncmp() calls when the lexemes have different lengths (and lexemes know their lengths, so the data is readily available). Also, in the binary search routine during selectivity estimation we can sometimes avoid detoasting (I think) Datums from the pg_statistic MCELEM array. See comments in code. Pre-sorting introduced one problem (see XXX in code): it's not easy anymore to get the minimal frequency of MCELEM values. I was using it to assert that the selectivity of a tsquery node containing a lexeme not in MCELEM is no more that min(MCELEM freqs) / 2. That's only significant when the minimum frequency is less than DEFAULT_TS_SEL * 2, so I'm kind of inclined to ignore it and maybe drop a comment in the code that this may be a potential problem. If nothing is fundamentally broken with this, I'll repeat my profiling tests to see if anything has been gained. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile index e20a4a2..ba728eb 100644 --- a/src/backend/tsearch/Makefile +++ b/src/backend/tsearch/Makefile @@ -19,7 +19,7 @@ DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \ OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ - to_tsany.o ts_typanalyze.o ts_utils.o + to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c new file mode 100644 index 0000000..0fadc60 --- /dev/null +++ b/src/backend/tsearch/ts_selfuncs.c @@ -0,0 +1,320 @@ +/*------------------------------------------------------------------------- + * + * ts_selfuncs.c + * Selectivity functions for text search types. + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "miscadmin.h" /* for check_stack_depth() */ +#include "utils/memutils.h" +#include "utils/builtins.h" +#include "utils/syscache.h" +#include "utils/lsyscache.h" +#include "utils/selfuncs.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "nodes/nodes.h" +#include "tsearch/ts_type.h" + +/* lookup table type for binary searching through MCELEMs */ +typedef struct +{ + Datum element; + float4 frequency; +} TextFreq; + +/* type of keys for bsearch()ing through an array of TextFreqs */ +typedef struct +{ + char *lexeme; + int length; +} LexemeKey; + +static int +compare_lexeme_textfreq(const void *e1, const void *e2); + +static Selectivity +tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq); +static Selectivity +mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers); +static double +tsquerysel(VariableStatData *vardata, Datum constval); + + +/* TSQuery traversal function */ +static Selectivity +tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq) +{ + LexemeKey key; + TextFreq *searchres; + Selectivity s1, s2; + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + + if (item->type == QI_VAL) + { + QueryOperand *oper = (QueryOperand *) item; + + /* + * Prepare the key for bsearch(). + */ + key.lexeme = operand + oper->distance; + key.length = oper->length; + + searchres = (TextFreq *) bsearch(&key, lookup, length, + sizeof(TextFreq), compare_lexeme_textfreq); + + if (searchres) + { + /* + * The element is in MCELEM. Return precise selectivity (or at + * least as precise, as ANALYZE could find out). + */ + return (Selectivity) searchres->frequency; + } + else + { + /* + * The element is not in MCELEM. Punt, but assert that the + * selectivity cannot be more than minfreq / 2. + */ + return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); + } + } + + /* Current TSQuery node is an operator */ + switch (item->operator.oper) + { + case OP_NOT: + return 1.0 - tsquery_opr_selec(item + 1, operand, lookup, + length, minfreq); + + case OP_AND: + return + tsquery_opr_selec(item + 1, operand, lookup, length, minfreq) * + tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + + case OP_OR: + s1 = tsquery_opr_selec(item + 1, operand, lookup, length, minfreq); + s2 = tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + return s1 + s2 - s1 * s2; + + default: + elog(ERROR, "unrecognized operator: %d", item->operator.oper); + } + + /* never reached, keep compiler happy */ + return (Selectivity) DEFAULT_TS_SEL; +} + +/* + * Traverse the tsquery preorder, calculating selectivity as: + * + * selec(left_oper) * selec(right_oper) in AND nodes, + * + * selec(left_oper) + selec(right_oper) - + * selec(left_oper) * selec(right_oper) in OR nodes, + * + * 1 - select(oper) in NOT nodes + * + * freq[val] in VAL nodes, if the value is in MCELEM + * min(freq[MCELEM]) / 2 in VAL nodes, if it is not + * + * + * The MCELEM array is already sorted (see ts_typanalyze.c), so we can use + * binary search for determining freq[MCELEM]. + */ +static Selectivity +mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers) +{ + float4 minfreq; + TextFreq *lookup; + Selectivity selec; + int i; + + /* Grab the lowest frequency */ + + /* + XXX This is no longer easy, need to go through the whole list. Plug it + with 2.0 for the moment (so that minfreq / 2 is 1.0) + + minfreq = numbers[nnumbers - 1]; + */ + minfreq = 2.0; + + /* Construct the array for binary search */ + Assert(nmcelem == nnumbers); + lookup = (TextFreq *) palloc(sizeof(TextFreq) * nmcelem); + for (i = 0; i < nmcelem; i++) + { + lookup[i].element = mcelem[i]; + lookup[i].frequency = numbers[i]; + } + + selec = tsquery_opr_selec(GETQUERY(query), GETOPERAND(query), lookup, + nmcelem, minfreq); + + pfree(lookup); + + return selec; +} + +/* + * bsearch() comparator for a lexeme (non-NULL terminated string with length) + * and a TextFreq. Use length, then byte-for-byte comparision, because that's + * how ANALYZE code sorted data before storing it in a statistic tuple. + * See ts_typanalyze.c for details. + */ +static int +compare_lexeme_textfreq(const void *e1, const void *e2) +{ + const LexemeKey *key; + const TextFreq *t; + text *element; + int len1, + len2; + + key = (const LexemeKey *) e1; + t = (const TextFreq *) e2; + + len1 = key->length; + len2 = VARSIZE_ANY_EXHDR(t->element); + + /* Compare lengths first, possibly avoiding pointless detoasting */ + if (len1 > len2) + return 1; + else if (len1 < len2) + return -1; + + /* Fall back on byte-for-byte comparision, detoasting the Datum first */ + element = DatumGetTextP(t->element); + + return strncmp(key->lexeme, VARDATA_ANY(element), len1); +} + +/* + * tssel -- Selectivity of "@@" + */ +Datum +tssel(PG_FUNCTION_ARGS) +{ + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); + /* We skip arg #2, which is the operator OID - it's gonna be "@@" */ + List *args = (List *) PG_GETARG_POINTER(2); + int varRelid = PG_GETARG_INT32(3); + VariableStatData vardata; + Node *other; + bool varonleft; + Selectivity selec; + + /* + * If expression is not variable op something or something op variable, + * then punt and return a default estimate. + */ + if (!get_restriction_variable(root, args, varRelid, + &vardata, &other, &varonleft)) + { + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* + * Can't do anything useful if the something is not a constant, either. + */ + if (!IsA(other, Const)) + { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* The "@@" operator is strict, so might cope with NULL right away */ + if (((Const *) other)->constisnull) { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8((float8) 0.0); + } + + /* + * OK, there's a Var and a Const we're dealing with here. We need the Var + * to be a TSVector (or else we don't have any useful statistic for it). + */ + + if (vardata.vartype == TSVECTOROID) + { + /* tsvector @@ tsquery or the other way around */ + Assert(((Const *) other)->consttype == TSQUERYOID); + + selec = tsquerysel(&vardata, ((Const *) other)->constvalue); + } + else + { + /* The Var is something we don't have useful statistic for */ + selec = DEFAULT_TS_SEL; + } + + ReleaseVariableStats(vardata); + + PG_RETURN_FLOAT8((float8) selec); +} + +static Selectivity +tsquerysel(VariableStatData *vardata, Datum constval) +{ + Selectivity selec; + + if (HeapTupleIsValid(vardata->statsTuple)) + { + TSQuery query; + Form_pg_statistic stats; + Datum *values; + int nvalues; + float4 *numbers; + int nnumbers; + + /* The caller made sure the const is a TSQuery, so get it now */ + query = DatumGetTSQuery(constval); + + stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple); + + /* MCELEM will be an array of Text elements for a tsvector column */ + if (get_attstatsslot(vardata->statsTuple, + TEXTOID, -1, + STATISTIC_KIND_MCELEM, InvalidOid, + &values, &nvalues, + &numbers, &nnumbers)) + { + /* + * There is a most-common-elements slot for the tsvector Var, so + * use that. + */ + + selec = mcelem_tsquery_selec(query, values, nvalues, + numbers, nnumbers); + free_attstatsslot(TEXTOID, values, nvalues, numbers, nnumbers); + } + else + { + /* No most-common-elements slot */ + selec = (Selectivity) DEFAULT_TS_SEL; + } + } + else + { + selec = (Selectivity) DEFAULT_TS_SEL; + } + + return selec; +} diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c index 965e758..201c4ab 100644 --- a/src/backend/tsearch/ts_typanalyze.c +++ b/src/backend/tsearch/ts_typanalyze.c @@ -43,8 +43,9 @@ static void compute_tsvector_stats(VacAttrStats *stats, static void prune_lexemes_hashtable(HTAB *lexemes_tab, int b_current); static uint32 lexeme_hash(const void *key, Size keysize); static int lexeme_match(const void *key1, const void *key2, Size keysize); -static int trackitem_compare_desc(const void *e1, const void *e2); - +static int lexeme_compare(const void *key1, const void *key2); +static int trackitem_compare_frequencies_desc(const void *e1, const void *e2); +static int trackitem_compare_lexemes(const void *e1, const void *e2); /* * ts_typanalyze -- a custom typanalyze function for tsvector columns @@ -273,7 +274,7 @@ compute_tsvector_stats(VacAttrStats *stats, Assert(i == track_len); qsort(sort_table, track_len, sizeof(TrackItem *), - trackitem_compare_desc); + trackitem_compare_frequencies_desc); /* Suppress any single-occurrence items */ while (track_len > 0) @@ -287,6 +288,22 @@ compute_tsvector_stats(VacAttrStats *stats, if (num_mcelem > track_len) num_mcelem = track_len; + /* + * We want to store statistics sorted on the lexeme value using first + * length, then byte-for-byte comparision. The reason for doing length + * comparision first is that we don't care about the ordering as long + * as it's consistent and comparing lengths first gives us a chance to + * avoid a strncmp() call. + * + * This is different from what we do with scalar statistics -- they get + * sorted on frequencies. The rationale is that we usually search + * through most common elements looking for a specific value, so we can + * grab its frequency. When values are presorted we can employ binary + * search for that. See ts_selfuncs.c for a real usage scenario. + */ + qsort(sort_table, num_mcelem, sizeof(TrackItem *), + trackitem_compare_lexemes); + /* Generate MCELEM slot entry */ if (num_mcelem > 0) { @@ -379,25 +396,49 @@ lexeme_hash(const void *key, Size keysize) static int lexeme_match(const void *key1, const void *key2, Size keysize) { - const LexemeHashKey *d1 = (const LexemeHashKey *) key1; - const LexemeHashKey *d2 = (const LexemeHashKey *) key2; + /* The keysize parameter is superfluous, the keys store their lengths */ + return lexeme_compare(key1, key2); +} - /* The lexemes need to have the same length, and be memcmp-equal */ - if (d1->length == d2->length && - memcmp(d1->lexeme, d2->lexeme, d1->length) == 0) - return 0; - else +/* + * Comparision function for lexemes. + */ +static int +lexeme_compare(const void *key1, const void *key2) +{ + const LexemeHashKey *d1 = (const LexemeHashKey *) key1; + const LexemeHashKey *d2 = (const LexemeHashKey *) key2; + + /* First, compare by length */ + if (d1->length > d2->length) return 1; + else if (d1->length < d2->length) + return -1; + else + /* Lengths are equal, do a byte-for-byte comparision */ + return strncmp(d1->lexeme, d2->lexeme, d1->length); } /* - * qsort() comparator for TrackItems - LC style (descending sort) + * qsort() comparator for sorting TrackItems on frequencies (descending sort) */ static int -trackitem_compare_desc(const void *e1, const void *e2) +trackitem_compare_frequencies_desc(const void *e1, const void *e2) { const TrackItem * const *t1 = (const TrackItem * const *) e1; const TrackItem * const *t2 = (const TrackItem * const *) e2; return (*t2)->frequency - (*t1)->frequency; } + +/* + * qsort() comparator for sorting TrackItem on lexemes + */ +static int +trackitem_compare_lexemes(const void *e1, const void *e2) +{ + const TrackItem * const *t1 = (const TrackItem * const *) e1; + const TrackItem * const *t2 = (const TrackItem * const *) e2; + + return lexeme_compare(&(*t1)->key, &(*t2)->key); +} diff --git a/src/include/catalog/pg_operator.h b/src/include/catalog/pg_operator.h index 1c7f37d..451805d 100644 --- a/src/include/catalog/pg_operator.h +++ b/src/include/catalog/pg_operator.h @@ -915,10 +915,10 @@ DATA(insert OID = 3630 ( "<>" PGNSP PGUID b f f 3614 3614 16 3630 3629 ts DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); -DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq contsel contjoinsel )); -DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv contsel contjoinsel )); -DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq contsel contjoinsel )); -DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv contsel contjoinsel )); +DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq tssel contjoinsel )); +DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv tssel contjoinsel )); +DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq tssel contjoinsel )); +DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv tssel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 16ccb55..ef24f40 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -4321,6 +4321,8 @@ DESCR("GiST tsquery support"); DATA(insert OID = 3701 ( gtsquery_consistent PGNSP PGUID 12 1 0 0 f f t f i 5 16 "2281 2281 23 26 2281" _null__null_ _null_ gtsquery_consistent _null_ _null_ _null_ )); DESCR("GiST tsquery support"); +DATA(insert OID = 3687 ( tssel PGNSP PGUID 12 1 0 0 f f t f s 4 701 "2281 26 2281 23" _null_ _null_ _null_ tssel _null__null_ _null_ )); +DESCR("restriction selectivity of tsvector @@ tsquery"); DATA(insert OID = 3688 ( ts_typanalyze PGNSP PGUID 12 1 0 0 f f t f s 1 16 "2281" _null_ _null_ _null_ ts_typanalyze_null_ _null_ _null_ )); DESCR("tsvector typanalyze"); diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h index 06ca1f9..b253327 100644 --- a/src/include/tsearch/ts_type.h +++ b/src/include/tsearch/ts_type.h @@ -155,6 +155,8 @@ extern Datum ts_rankcd_wttf(PG_FUNCTION_ARGS); extern Datum ts_typanalyze(PG_FUNCTION_ARGS); +extern Datum tssel(PG_FUNCTION_ARGS); + /* * TSQuery @@ -291,4 +293,17 @@ extern Datum tsquery_rewrite_query(PG_FUNCTION_ARGS); extern Datum tsq_mcontains(PG_FUNCTION_ARGS); extern Datum tsq_mcontained(PG_FUNCTION_ARGS); +/* + * The default text search selectivity is chosen to be smll enough to + * encourage indexscans for typical table densities. See selfuncs.h and + * DEFAULT_EQ_SEL for details. + */ +#define DEFAULT_TS_SEL 0.005 + +/* + * operator restriction function for tsvector @@ tsquery and + * tsquery @@ tsvector + */ +extern Datum tssel(PG_FUNCTION_ARGS); + #endif /* _PG_TSTYPE_H_ */
Jan Urbański wrote: > Heikki Linnakangas wrote: >> Jan Urbański wrote: >>> So right now the idea is to: >>> (1) pre-sort STATISTIC_KIND_MCELEM values >>> (2) build an array of pointers to detoasted values in tssel() >>> (3) use binary search when looking for MCELEMs during tsquery analysis >> >> Sounds like a plan. In (2), it's even better to detoast the values >> lazily. For a typical one-word tsquery, the binary search will only >> look at a small portion of the elements. > > Here's another version. Context diff this time, always forget to convert them... -- Jan Urbanski GPG key ID: E583D7D2 ouden estin *** a/src/backend/tsearch/Makefile --- b/src/backend/tsearch/Makefile *************** *** 19,25 **** DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \ OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ ! to_tsany.o ts_typanalyze.o ts_utils.o include $(top_srcdir)/src/backend/common.mk --- 19,25 ---- OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ ! to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o include $(top_srcdir)/src/backend/common.mk *** /dev/null --- b/src/backend/tsearch/ts_selfuncs.c *************** *** 0 **** --- 1,320 ---- + /*------------------------------------------------------------------------- + * + * ts_selfuncs.c + * Selectivity functions for text search types. + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "miscadmin.h" /* for check_stack_depth() */ + #include "utils/memutils.h" + #include "utils/builtins.h" + #include "utils/syscache.h" + #include "utils/lsyscache.h" + #include "utils/selfuncs.h" + #include "catalog/pg_type.h" + #include "catalog/pg_statistic.h" + #include "nodes/nodes.h" + #include "tsearch/ts_type.h" + + /* lookup table type for binary searching through MCELEMs */ + typedef struct + { + Datum element; + float4 frequency; + } TextFreq; + + /* type of keys for bsearch()ing through an array of TextFreqs */ + typedef struct + { + char *lexeme; + int length; + } LexemeKey; + + static int + compare_lexeme_textfreq(const void *e1, const void *e2); + + static Selectivity + tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq); + static Selectivity + mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers); + static double + tsquerysel(VariableStatData *vardata, Datum constval); + + + /* TSQuery traversal function */ + static Selectivity + tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq) + { + LexemeKey key; + TextFreq *searchres; + Selectivity s1, s2; + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + + if (item->type == QI_VAL) + { + QueryOperand *oper = (QueryOperand *) item; + + /* + * Prepare the key for bsearch(). + */ + key.lexeme = operand + oper->distance; + key.length = oper->length; + + searchres = (TextFreq *) bsearch(&key, lookup, length, + sizeof(TextFreq), compare_lexeme_textfreq); + + if (searchres) + { + /* + * The element is in MCELEM. Return precise selectivity (or at + * least as precise, as ANALYZE could find out). + */ + return (Selectivity) searchres->frequency; + } + else + { + /* + * The element is not in MCELEM. Punt, but assert that the + * selectivity cannot be more than minfreq / 2. + */ + return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); + } + } + + /* Current TSQuery node is an operator */ + switch (item->operator.oper) + { + case OP_NOT: + return 1.0 - tsquery_opr_selec(item + 1, operand, lookup, + length, minfreq); + + case OP_AND: + return + tsquery_opr_selec(item + 1, operand, lookup, length, minfreq) * + tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + + case OP_OR: + s1 = tsquery_opr_selec(item + 1, operand, lookup, length, minfreq); + s2 = tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + return s1 + s2 - s1 * s2; + + default: + elog(ERROR, "unrecognized operator: %d", item->operator.oper); + } + + /* never reached, keep compiler happy */ + return (Selectivity) DEFAULT_TS_SEL; + } + + /* + * Traverse the tsquery preorder, calculating selectivity as: + * + * selec(left_oper) * selec(right_oper) in AND nodes, + * + * selec(left_oper) + selec(right_oper) - + * selec(left_oper) * selec(right_oper) in OR nodes, + * + * 1 - select(oper) in NOT nodes + * + * freq[val] in VAL nodes, if the value is in MCELEM + * min(freq[MCELEM]) / 2 in VAL nodes, if it is not + * + * + * The MCELEM array is already sorted (see ts_typanalyze.c), so we can use + * binary search for determining freq[MCELEM]. + */ + static Selectivity + mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers) + { + float4 minfreq; + TextFreq *lookup; + Selectivity selec; + int i; + + /* Grab the lowest frequency */ + + /* + XXX This is no longer easy, need to go through the whole list. Plug it + with 2.0 for the moment (so that minfreq / 2 is 1.0) + + minfreq = numbers[nnumbers - 1]; + */ + minfreq = 2.0; + + /* Construct the array for binary search */ + Assert(nmcelem == nnumbers); + lookup = (TextFreq *) palloc(sizeof(TextFreq) * nmcelem); + for (i = 0; i < nmcelem; i++) + { + lookup[i].element = mcelem[i]; + lookup[i].frequency = numbers[i]; + } + + selec = tsquery_opr_selec(GETQUERY(query), GETOPERAND(query), lookup, + nmcelem, minfreq); + + pfree(lookup); + + return selec; + } + + /* + * bsearch() comparator for a lexeme (non-NULL terminated string with length) + * and a TextFreq. Use length, then byte-for-byte comparision, because that's + * how ANALYZE code sorted data before storing it in a statistic tuple. + * See ts_typanalyze.c for details. + */ + static int + compare_lexeme_textfreq(const void *e1, const void *e2) + { + const LexemeKey *key; + const TextFreq *t; + text *element; + int len1, + len2; + + key = (const LexemeKey *) e1; + t = (const TextFreq *) e2; + + len1 = key->length; + len2 = VARSIZE_ANY_EXHDR(t->element); + + /* Compare lengths first, possibly avoiding pointless detoasting */ + if (len1 > len2) + return 1; + else if (len1 < len2) + return -1; + + /* Fall back on byte-for-byte comparision, detoasting the Datum first */ + element = DatumGetTextP(t->element); + + return strncmp(key->lexeme, VARDATA_ANY(element), len1); + } + + /* + * tssel -- Selectivity of "@@" + */ + Datum + tssel(PG_FUNCTION_ARGS) + { + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); + /* We skip arg #2, which is the operator OID - it's gonna be "@@" */ + List *args = (List *) PG_GETARG_POINTER(2); + int varRelid = PG_GETARG_INT32(3); + VariableStatData vardata; + Node *other; + bool varonleft; + Selectivity selec; + + /* + * If expression is not variable op something or something op variable, + * then punt and return a default estimate. + */ + if (!get_restriction_variable(root, args, varRelid, + &vardata, &other, &varonleft)) + { + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* + * Can't do anything useful if the something is not a constant, either. + */ + if (!IsA(other, Const)) + { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* The "@@" operator is strict, so might cope with NULL right away */ + if (((Const *) other)->constisnull) { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8((float8) 0.0); + } + + /* + * OK, there's a Var and a Const we're dealing with here. We need the Var + * to be a TSVector (or else we don't have any useful statistic for it). + */ + + if (vardata.vartype == TSVECTOROID) + { + /* tsvector @@ tsquery or the other way around */ + Assert(((Const *) other)->consttype == TSQUERYOID); + + selec = tsquerysel(&vardata, ((Const *) other)->constvalue); + } + else + { + /* The Var is something we don't have useful statistic for */ + selec = DEFAULT_TS_SEL; + } + + ReleaseVariableStats(vardata); + + PG_RETURN_FLOAT8((float8) selec); + } + + static Selectivity + tsquerysel(VariableStatData *vardata, Datum constval) + { + Selectivity selec; + + if (HeapTupleIsValid(vardata->statsTuple)) + { + TSQuery query; + Form_pg_statistic stats; + Datum *values; + int nvalues; + float4 *numbers; + int nnumbers; + + /* The caller made sure the const is a TSQuery, so get it now */ + query = DatumGetTSQuery(constval); + + stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple); + + /* MCELEM will be an array of Text elements for a tsvector column */ + if (get_attstatsslot(vardata->statsTuple, + TEXTOID, -1, + STATISTIC_KIND_MCELEM, InvalidOid, + &values, &nvalues, + &numbers, &nnumbers)) + { + /* + * There is a most-common-elements slot for the tsvector Var, so + * use that. + */ + + selec = mcelem_tsquery_selec(query, values, nvalues, + numbers, nnumbers); + free_attstatsslot(TEXTOID, values, nvalues, numbers, nnumbers); + } + else + { + /* No most-common-elements slot */ + selec = (Selectivity) DEFAULT_TS_SEL; + } + } + else + { + selec = (Selectivity) DEFAULT_TS_SEL; + } + + return selec; + } *** a/src/backend/tsearch/ts_typanalyze.c --- b/src/backend/tsearch/ts_typanalyze.c *************** *** 43,50 **** static void compute_tsvector_stats(VacAttrStats *stats, static void prune_lexemes_hashtable(HTAB *lexemes_tab, int b_current); static uint32 lexeme_hash(const void *key, Size keysize); static int lexeme_match(const void *key1, const void *key2, Size keysize); ! static int trackitem_compare_desc(const void *e1, const void *e2); ! /* * ts_typanalyze -- a custom typanalyze function for tsvector columns --- 43,51 ---- static void prune_lexemes_hashtable(HTAB *lexemes_tab, int b_current); static uint32 lexeme_hash(const void *key, Size keysize); static int lexeme_match(const void *key1, const void *key2, Size keysize); ! static int lexeme_compare(const void *key1, const void *key2); ! static int trackitem_compare_frequencies_desc(const void *e1, const void *e2); ! static int trackitem_compare_lexemes(const void *e1, const void *e2); /* * ts_typanalyze -- a custom typanalyze function for tsvector columns *************** *** 273,279 **** compute_tsvector_stats(VacAttrStats *stats, Assert(i == track_len); qsort(sort_table, track_len, sizeof(TrackItem *), ! trackitem_compare_desc); /* Suppress any single-occurrence items */ while (track_len > 0) --- 274,280 ---- Assert(i == track_len); qsort(sort_table, track_len, sizeof(TrackItem *), ! trackitem_compare_frequencies_desc); /* Suppress any single-occurrence items */ while (track_len > 0) *************** *** 287,292 **** compute_tsvector_stats(VacAttrStats *stats, --- 288,309 ---- if (num_mcelem > track_len) num_mcelem = track_len; + /* + * We want to store statistics sorted on the lexeme value using first + * length, then byte-for-byte comparision. The reason for doing length + * comparision first is that we don't care about the ordering as long + * as it's consistent and comparing lengths first gives us a chance to + * avoid a strncmp() call. + * + * This is different from what we do with scalar statistics -- they get + * sorted on frequencies. The rationale is that we usually search + * through most common elements looking for a specific value, so we can + * grab its frequency. When values are presorted we can employ binary + * search for that. See ts_selfuncs.c for a real usage scenario. + */ + qsort(sort_table, num_mcelem, sizeof(TrackItem *), + trackitem_compare_lexemes); + /* Generate MCELEM slot entry */ if (num_mcelem > 0) { *************** *** 379,403 **** lexeme_hash(const void *key, Size keysize) static int lexeme_match(const void *key1, const void *key2, Size keysize) { ! const LexemeHashKey *d1 = (const LexemeHashKey *) key1; ! const LexemeHashKey *d2 = (const LexemeHashKey *) key2; ! /* The lexemes need to have the same length, and be memcmp-equal */ ! if (d1->length == d2->length && ! memcmp(d1->lexeme, d2->lexeme, d1->length) == 0) ! return 0; ! else return 1; } /* ! * qsort() comparator for TrackItems - LC style (descending sort) */ static int ! trackitem_compare_desc(const void *e1, const void *e2) { const TrackItem * const *t1 = (const TrackItem * const *) e1; const TrackItem * const *t2 = (const TrackItem * const *) e2; return (*t2)->frequency - (*t1)->frequency; } --- 396,444 ---- static int lexeme_match(const void *key1, const void *key2, Size keysize) { ! /* The keysize parameter is superfluous, the keys store their lengths */ ! return lexeme_compare(key1, key2); ! } ! /* ! * Comparision function for lexemes. ! */ ! static int ! lexeme_compare(const void *key1, const void *key2) ! { ! const LexemeHashKey *d1 = (const LexemeHashKey *) key1; ! const LexemeHashKey *d2 = (const LexemeHashKey *) key2; ! ! /* First, compare by length */ ! if (d1->length > d2->length) return 1; + else if (d1->length < d2->length) + return -1; + else + /* Lengths are equal, do a byte-for-byte comparision */ + return strncmp(d1->lexeme, d2->lexeme, d1->length); } /* ! * qsort() comparator for sorting TrackItems on frequencies (descending sort) */ static int ! trackitem_compare_frequencies_desc(const void *e1, const void *e2) { const TrackItem * const *t1 = (const TrackItem * const *) e1; const TrackItem * const *t2 = (const TrackItem * const *) e2; return (*t2)->frequency - (*t1)->frequency; } + + /* + * qsort() comparator for sorting TrackItem on lexemes + */ + static int + trackitem_compare_lexemes(const void *e1, const void *e2) + { + const TrackItem * const *t1 = (const TrackItem * const *) e1; + const TrackItem * const *t2 = (const TrackItem * const *) e2; + + return lexeme_compare(&(*t1)->key, &(*t2)->key); + } *** a/src/include/catalog/pg_operator.h --- b/src/include/catalog/pg_operator.h *************** *** 915,924 **** DATA(insert OID = 3630 ( "<>" PGNSP PGUID b f f 3614 3614 16 3630 3629 ts DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); ! DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq contsel contjoinsel )); ! DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv contsel contjoinsel )); ! DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq contsel contjoinsel )); ! DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv contsel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); --- 915,924 ---- DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); ! DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq tssel contjoinsel )); ! DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv tssel contjoinsel )); ! DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq tssel contjoinsel )); ! DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv tssel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** *** 4321,4326 **** DESCR("GiST tsquery support"); --- 4321,4328 ---- DATA(insert OID = 3701 ( gtsquery_consistent PGNSP PGUID 12 1 0 0 f f t f i 5 16 "2281 2281 23 26 2281" _null__null_ _null_ gtsquery_consistent _null_ _null_ _null_ )); DESCR("GiST tsquery support"); + DATA(insert OID = 3687 ( tssel PGNSP PGUID 12 1 0 0 f f t f s 4 701 "2281 26 2281 23" _null_ _null_ _null_ tssel _null__null_ _null_ )); + DESCR("restriction selectivity of tsvector @@ tsquery"); DATA(insert OID = 3688 ( ts_typanalyze PGNSP PGUID 12 1 0 0 f f t f s 1 16 "2281" _null_ _null_ _null_ ts_typanalyze_null_ _null_ _null_ )); DESCR("tsvector typanalyze"); *** a/src/include/tsearch/ts_type.h --- b/src/include/tsearch/ts_type.h *************** *** 155,160 **** extern Datum ts_rankcd_wttf(PG_FUNCTION_ARGS); --- 155,162 ---- extern Datum ts_typanalyze(PG_FUNCTION_ARGS); + extern Datum tssel(PG_FUNCTION_ARGS); + /* * TSQuery *************** *** 291,294 **** extern Datum tsquery_rewrite_query(PG_FUNCTION_ARGS); --- 293,309 ---- extern Datum tsq_mcontains(PG_FUNCTION_ARGS); extern Datum tsq_mcontained(PG_FUNCTION_ARGS); + /* + * The default text search selectivity is chosen to be smll enough to + * encourage indexscans for typical table densities. See selfuncs.h and + * DEFAULT_EQ_SEL for details. + */ + #define DEFAULT_TS_SEL 0.005 + + /* + * operator restriction function for tsvector @@ tsquery and + * tsquery @@ tsvector + */ + extern Datum tssel(PG_FUNCTION_ARGS); + #endif /* _PG_TSTYPE_H_ */
Jan Urbański wrote: > Heikki Linnakangas wrote: >> Sounds like a plan. In (2), it's even better to detoast the values >> lazily. For a typical one-word tsquery, the binary search will only >> look at a small portion of the elements. > > Hm, how can I do that? Toast is still a bit black magic to me... Do you > mean I should stick to having Datums in TextFreq? Store both the Datum and the text *. If the latter is NULL, then grab the datum, detoast and store the result in the text *. Next time you need to look at it, it's already detoasted. I don't know the code so I have no idea if this is applicable. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Jan Urbański wrote: >> Heikki Linnakangas wrote: > >>> Sounds like a plan. In (2), it's even better to detoast the values >>> lazily. For a typical one-word tsquery, the binary search will only >>> look at a small portion of the elements. >> Hm, how can I do that? Toast is still a bit black magic to me... Do you >> mean I should stick to having Datums in TextFreq? > > Store both the Datum and the text *. If the latter is NULL, then grab > the datum, detoast and store the result in the text *. Next time you > need to look at it, it's already detoasted. Yeah, I got that idea, but then I thought the chances of touching the same element during binary search twice were very small. Especially now when the detoasting occurs only when we hit a text Datum that has the same length as the sought lexeme. Still, I can do it if people feel like it. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > Yeah, I got that idea, but then I thought the chances of touching the > same element during binary search twice were very small. Especially now > when the detoasting occurs only when we hit a text Datum that has the > same length as the sought lexeme. > Still, I can do it if people feel like it. Actually, in that light it sounds pretty useless. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, 2008-08-14 at 22:27 +0200, Jan Urbański wrote: > Jan Urbański wrote: > + * ts_selfuncs.c Not sure why this is in its own file, but if it must be could we please put it in a file called selfuncs_ts.c so it is similar to the existing filename? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-08-14 at 22:27 +0200, Jan Urbański wrote: >> Jan Urbański wrote: > >> + * ts_selfuncs.c > > Not sure why this is in its own file I couldn't decide where to put it, so I came up with this. > put it in a file called selfuncs_ts.c so it is similar to the existing > filename? I followed the pattern of ts_parse.c, ts_utils.c and so on. Also, I see geo_selfuncs.c. No big deal, though, I can move it. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Simon Riggs wrote: >> put it in a file called selfuncs_ts.c so it is similar to the existing >> filename? > I followed the pattern of ts_parse.c, ts_utils.c and so on. > Also, I see geo_selfuncs.c. No big deal, though, I can move it. Given the precedent of geo_selfuncs.c, I think you were right the first time. A more interesting question is whether it should just get folded into selfuncs.c ... regards, tom lane
Tom Lane wrote: > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: >> Simon Riggs wrote: >>> put it in a file called selfuncs_ts.c so it is similar to the existing >>> filename? > >> I followed the pattern of ts_parse.c, ts_utils.c and so on. >> Also, I see geo_selfuncs.c. No big deal, though, I can move it. > > Given the precedent of geo_selfuncs.c, I think you were right the > first time. A more interesting question is whether it should just > get folded into selfuncs.c ... selfuncs.c is a 5.8k lines beast, I felt a bit intimidated when first opened it. The code in ts_selfuncs.c relies strongly on what the code in ts_typanalyze.c does and that was another reason for putting in in its own file next to ts_typanalyze.c. I don't really care to be honest, might as well stick it into selfuncs.c. -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > Tom Lane wrote: >> Jan Urbański <j.urbanski@students.mimuw.edu.pl> >> writes: >>> Simon Riggs wrote: >>>> put it in a file called selfuncs_ts.c so it is similar to the existing >>>> filename? >> >>> I followed the pattern of ts_parse.c, ts_utils.c and so on. >>> Also, I see geo_selfuncs.c. No big deal, though, I can move it. >> >> Given the precedent of geo_selfuncs.c, I think you were right the >> first time. A more interesting question is whether it should just >> get folded into selfuncs.c ... > > selfuncs.c is a 5.8k lines beast, I felt a bit intimidated when first > opened it. The code in ts_selfuncs.c relies strongly on what the code in > ts_typanalyze.c does and that was another reason for putting in in its > own file next to ts_typanalyze.c. I don't really care to be honest, > might as well stick it into selfuncs.c. I would leave the code in ts_selfuncs.c like you did. The stuff in selfuncs.c is pretty generic, not related to any specific data type. With the exception of the regex and LIKE selectivity functions, but those should rather be moved to a separate file, say pattern_selfuncs.c. There's also plenty of datatype-specific code in convert_to_scalar() and its subroutines, but even the comment there says that it's a hack. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-08-26 at 12:45 +0200, Jan Urbański wrote: > > put it in a file called selfuncs_ts.c so it is similar to the existing > > filename? > > I followed the pattern of ts_parse.c, ts_utils.c and so on. > Also, I see geo_selfuncs.c. No big deal, though, I can move it. No don't worry. You're right, the horse has already bolted. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Pre-sorting introduced one problem (see XXX in code): it's not easy > anymore to get the minimal frequency of MCELEM values. I was using it to > assert that the selectivity of a tsquery node containing a lexeme not in > MCELEM is no more that min(MCELEM freqs) / 2. That's only significant > when the minimum frequency is less than DEFAULT_TS_SEL * 2, so I'm kind > of inclined to ignore it and maybe drop a comment in the code that this > may be a potential problem. This is easily fixed: there is nothing saying that a pg_statistic slot's contents must contain the same numbers of Values and Numbers. Make the numbers array have one extra element and store the min frequency there. Maybe it'd be worth having 2 extra elements and dropping the max in, as well. I don't immediately have a use for it, but it'll be a lot harder to add it later if we don't put it in now. > If nothing is fundamentally broken with this, I'll repeat my profiling > tests to see if anything has been gained. I don't have much except minor stylistic gripes (like the ordering of the functions in ts_selfuncs.c seeming a bit random). One possibly performance-relevant point is to use DatumGetTextPP for detoasting; you've already paid the costs by using VARDATA_ANY etc, so you might as well get the benefit. Please fix the above and do the performance testing ... regards, tom lane
I wrote: > ... One possibly > performance-relevant point is to use DatumGetTextPP for detoasting; > you've already paid the costs by using VARDATA_ANY etc, so you might > as well get the benefit. Actually, wait a second. That code doesn't work at all on toasted data, because it's trying to use VARSIZE_ANY_EXHDR() before detoasting. That would give you the physical datum size (eg the size of the toast pointer), not the number you need. However, this is actually not a problem because we know that the data came from an array in pg_statistic, which means the individual members *can't be toasted*. At least they can't be compressed or out-of-line. We'd do that at the array level, it's not sensible to do it on an individual array member. I think that right at the moment the array stuff doesn't permit short headers either, but it would make sense to relax that someday. So I'd recommend that your code allow either regular or short headers, but not worry about compression or out-of-line storage. Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but forget the "detoasting" step. Maybe put inAssert(!VARATT_IS_COMPRESSED(datum) && !VARATT_IS_EXTERNAL(datum)) instead. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > I wrote: >> ... One possibly >> performance-relevant point is to use DatumGetTextPP for detoasting; >> you've already paid the costs by using VARDATA_ANY etc, so you might >> as well get the benefit. > > Actually, wait a second. That code doesn't work at all on toasted data, > because it's trying to use VARSIZE_ANY_EXHDR() before detoasting. > That would give you the physical datum size (eg the size of the toast > pointer), not the number you need. > > However, this is actually not a problem because we know that the data > came from an array in pg_statistic, which means the individual members > *can't be toasted*. At least they can't be compressed or out-of-line. > We'd do that at the array level, it's not sensible to do it on an > individual array member. > > I think that right at the moment the array stuff doesn't permit short > headers either, but it would make sense to relax that someday. So I'd > recommend that your code allow either regular or short headers, but not > worry about compression or out-of-line storage. > > Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but > forget the "detoasting" step. Maybe put in > Assert(!VARATT_IS_COMPRESSED(datum) && !VARATT_IS_EXTERNAL(datum)) > instead. Tom, Heikki, everyone, just to let you know - I haven't forgot about this, but right now I'm porting myself to another country, so things are quite hectic ;) I'll pick up the patch in a couple of weeks, so I'm sure it will be ready for the November CF. Thanks, Jan
ju219721@students.mimuw.edu.pl wrote: > Quoting Tom Lane <tgl@sss.pgh.pa.us>: > >> I wrote: >>> ... One possibly >>> performance-relevant point is to use DatumGetTextPP for detoasting; >>> you've already paid the costs by using VARDATA_ANY etc, so you might >>> as well get the benefit. >> >> Actually, wait a second. That code doesn't work at all on toasted data, >> because it's trying to use VARSIZE_ANY_EXHDR() before detoasting. >> That would give you the physical datum size (eg the size of the toast >> pointer), not the number you need. >> >> However, this is actually not a problem because we know that the data >> came from an array in pg_statistic, which means the individual members >> *can't be toasted*. At least they can't be compressed or out-of-line. >> We'd do that at the array level, it's not sensible to do it on an >> individual array member. >> >> I think that right at the moment the array stuff doesn't permit short >> headers either, but it would make sense to relax that someday. So I'd >> recommend that your code allow either regular or short headers, but not >> worry about compression or out-of-line storage. >> >> Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but >> forget the "detoasting" step. Maybe put in >> Assert(!VARATT_IS_COMPRESSED(datum) && !VARATT_IS_EXTERNAL(datum)) >> instead. Well whaddya know. It turned out that my new company has a 'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a full day to work on the patch. Attached is a version that stores the minimal and maximal frequencies in the Numbers array, has the aforementioned assertion and more nicely ordered functions in ts_selfuncs.c. I tested it with oprofile and pgbench -n -f tssel-bench.sql -t 1000 postgres with tssel-bench.sql containing select * from manuals where tsvector @@ to_tsquery('foo'); "manuals" has ~700 rows and 'foo' does not appear in any of the lexemes. The results are: === CVS HEAD === scaling factor: 1 query mode: simple number of clients: 1 number of transactions per client: 1000 number of transactions actually processed: 1000/1000 tps = 13.399584 (including connections establishing) tps = 13.399972 (excluding connections establishing) 74069 34.7779 pglz_decompress 38560 18.1052 tsvectorout 7688 3.6098 pg_mblen 5366 2.5195 hash_search_with_hash_value 4833 2.2693 pg_utf_mblen 4718 2.2153 AllocSetAlloc 4041 1.8974 index_getnext 3100 1.4556 LWLockAcquire 3056 1.4349 hash_any 2843 1.3349 LWLockRelease 2611 1.2260 AllocSetFree 2126 0.9982 tsCompareString 2121 0.9959 _bt_compare 1830 0.8592 LockAcquire 1517 0.7123 toast_fetch_datum 1503 0.7057 .plt 1338 0.6282 _bt_checkkeys 1332 0.6254 FunctionCall2 1233 0.5789 ReadBuffer_common 1185 0.5564 slot_deform_tuple 1157 0.5433 TParserGet 1123 0.5273 LockRelease === PATCH === transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 1 number of transactions per client: 1000 number of transactions actually processed: 1000/1000 tps = 13.309346 (including connections establishing) tps = 13.309761 (excluding connections establishing) 171514 35.0802 pglz_decompress 87231 17.8416 tsvectorout 17107 3.4989 pg_mblen 12514 2.5595 hash_search_with_hash_value 11124 2.2752 pg_utf_mblen 10739 2.1965 AllocSetAlloc 8534 1.7455 index_getnext 7460 1.5258 LWLockAcquire 6876 1.4064 LWLockRelease 6622 1.3544 hash_any 5773 1.1808 AllocSetFree 5210 1.0656 _bt_compare 4849 0.9918 tsCompareString 4043 0.8269 LockAcquire 3535 0.7230 .plt 3246 0.6639 _bt_checkkeys 3170 0.6484 toast_fetch_datum 3057 0.6253 FunctionCall2 2815 0.5758 ReadBuffer_common 2767 0.5659 TParserGet 2605 0.5328 slot_deform_tuple 2567 0.5250 MemoryContextAlloc Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin *** a/doc/src/sgml/catalogs.sgml --- b/doc/src/sgml/catalogs.sgml *************** *** 6664,6669 **** --- 6664,6671 ---- A list of the frequencies of the most common values or elements, i.e., number of occurrences of each divided by total number of rows. (NULL when <structfield>most_common_vals</structfield> is.) + For some datatypes such as <type>tsvector</>, it can also store some + additional information, i.e. be longer than the most_common_vals array. </entry> </row> *** a/src/backend/tsearch/Makefile --- b/src/backend/tsearch/Makefile *************** *** 19,25 **** DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \ OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ ! to_tsany.o ts_typanalyze.o ts_utils.o include $(top_srcdir)/src/backend/common.mk --- 19,25 ---- OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ dict_ispell.o regis.o spell.o \ ! to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o include $(top_srcdir)/src/backend/common.mk *** /dev/null --- b/src/backend/tsearch/ts_selfuncs.c *************** *** 0 **** --- 1,323 ---- + /*------------------------------------------------------------------------- + * + * ts_selfuncs.c + * Selectivity functions for text search types. + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "miscadmin.h" /* for check_stack_depth() */ + #include "utils/memutils.h" + #include "utils/builtins.h" + #include "utils/syscache.h" + #include "utils/lsyscache.h" + #include "utils/selfuncs.h" + #include "catalog/pg_type.h" + #include "catalog/pg_statistic.h" + #include "nodes/nodes.h" + #include "tsearch/ts_type.h" + + /* lookup table type for binary searching through MCELEMs */ + typedef struct + { + Datum element; + float4 frequency; + } TextFreq; + + /* type of keys for bsearch()ing through an array of TextFreqs */ + typedef struct + { + char *lexeme; + int length; + } LexemeKey; + + static double + tsquerysel(VariableStatData *vardata, Datum constval); + static Selectivity + mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers); + static Selectivity + tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq); + static int + compare_lexeme_textfreq(const void *e1, const void *e2); + + /* + * tssel -- Selectivity of "@@" + */ + Datum + tssel(PG_FUNCTION_ARGS) + { + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); + /* We skip arg #2, which is the operator OID - it's gonna be "@@" */ + List *args = (List *) PG_GETARG_POINTER(2); + int varRelid = PG_GETARG_INT32(3); + VariableStatData vardata; + Node *other; + bool varonleft; + Selectivity selec; + + /* + * If expression is not variable op something or something op variable, + * then punt and return a default estimate. + */ + if (!get_restriction_variable(root, args, varRelid, + &vardata, &other, &varonleft)) + { + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* + * Can't do anything useful if the something is not a constant, either. + */ + if (!IsA(other, Const)) + { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8(DEFAULT_TS_SEL); + } + + /* The "@@" operator is strict, so might cope with NULL right away */ + if (((Const *) other)->constisnull) { + ReleaseVariableStats(vardata); + PG_RETURN_FLOAT8((float8) 0.0); + } + + /* + * OK, there's a Var and a Const we're dealing with here. We need the Var + * to be a TSVector (or else we don't have any useful statistic for it). + */ + + if (vardata.vartype == TSVECTOROID) + { + /* tsvector @@ tsquery or the other way around */ + Assert(((Const *) other)->consttype == TSQUERYOID); + + selec = tsquerysel(&vardata, ((Const *) other)->constvalue); + } + else + { + /* The Var is something we don't have useful statistic for */ + selec = DEFAULT_TS_SEL; + } + + ReleaseVariableStats(vardata); + + PG_RETURN_FLOAT8((float8) selec); + } + + static Selectivity + tsquerysel(VariableStatData *vardata, Datum constval) + { + Selectivity selec; + + if (HeapTupleIsValid(vardata->statsTuple)) + { + TSQuery query; + Form_pg_statistic stats; + Datum *values; + int nvalues; + float4 *numbers; + int nnumbers; + + /* The caller made sure the const is a TSQuery, so get it now */ + query = DatumGetTSQuery(constval); + + stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple); + + /* MCELEM will be an array of Text elements for a tsvector column */ + if (get_attstatsslot(vardata->statsTuple, + TEXTOID, -1, + STATISTIC_KIND_MCELEM, InvalidOid, + &values, &nvalues, + &numbers, &nnumbers)) + { + /* + * There is a most-common-elements slot for the tsvector Var, so + * use that. + */ + + selec = mcelem_tsquery_selec(query, values, nvalues, + numbers, nnumbers); + free_attstatsslot(TEXTOID, values, nvalues, numbers, nnumbers); + } + else + { + /* No most-common-elements slot */ + selec = (Selectivity) DEFAULT_TS_SEL; + } + } + else + { + selec = (Selectivity) DEFAULT_TS_SEL; + } + + return selec; + } + + /* + * Traverse the tsquery preorder, calculating selectivity as: + * + * selec(left_oper) * selec(right_oper) in AND nodes, + * + * selec(left_oper) + selec(right_oper) - + * selec(left_oper) * selec(right_oper) in OR nodes, + * + * 1 - select(oper) in NOT nodes + * + * freq[val] in VAL nodes, if the value is in MCELEM + * min(freq[MCELEM]) / 2 in VAL nodes, if it is not + * + * + * The MCELEM array is already sorted (see ts_typanalyze.c), so we can use + * binary search for determining freq[MCELEM]. + */ + static Selectivity + mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem, + float4 *numbers, int nnumbers) + { + float4 minfreq; + TextFreq *lookup; + Selectivity selec; + int i; + + /* + * Grab the lowest frequency. compute_tsvector_stats() stored it for us in + * the one before the last cell of the Numbers array. See ts_typanalyze.c + */ + minfreq = numbers[nnumbers - 2]; + + /* + * Construct the array for binary search. There should be two more Numbers + * than Values, because the last two cells are taken for minimal and + * maximal frequency. + */ + Assert(nmcelem == nnumbers - 2); + lookup = (TextFreq *) palloc(sizeof(TextFreq) * nmcelem); + for (i = 0; i < nmcelem; i++) + { + lookup[i].element = mcelem[i]; + lookup[i].frequency = numbers[i]; + } + + selec = tsquery_opr_selec(GETQUERY(query), GETOPERAND(query), lookup, + nmcelem, minfreq); + + pfree(lookup); + + return selec; + } + + /* TSQuery traversal function */ + static Selectivity + tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup, + int length, float4 minfreq) + { + LexemeKey key; + TextFreq *searchres; + Selectivity s1, s2; + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + + if (item->type == QI_VAL) + { + QueryOperand *oper = (QueryOperand *) item; + + /* + * Prepare the key for bsearch(). + */ + key.lexeme = operand + oper->distance; + key.length = oper->length; + + searchres = (TextFreq *) bsearch(&key, lookup, length, + sizeof(TextFreq), compare_lexeme_textfreq); + + if (searchres) + { + /* + * The element is in MCELEM. Return precise selectivity (or at + * least as precise, as ANALYZE could find out). + */ + return (Selectivity) searchres->frequency; + } + else + { + /* + * The element is not in MCELEM. Punt, but assert that the + * selectivity cannot be more than minfreq / 2. + */ + return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); + } + } + + /* Current TSQuery node is an operator */ + switch (item->operator.oper) + { + case OP_NOT: + return 1.0 - tsquery_opr_selec(item + 1, operand, lookup, + length, minfreq); + + case OP_AND: + return + tsquery_opr_selec(item + 1, operand, lookup, length, minfreq) * + tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + + case OP_OR: + s1 = tsquery_opr_selec(item + 1, operand, lookup, length, minfreq); + s2 = tsquery_opr_selec(item + item->operator.left, operand, lookup, + length, minfreq); + return s1 + s2 - s1 * s2; + + default: + elog(ERROR, "unrecognized operator: %d", item->operator.oper); + } + + /* never reached, keep compiler happy */ + return (Selectivity) DEFAULT_TS_SEL; + } + + /* + * bsearch() comparator for a lexeme (non-NULL terminated string with length) + * and a TextFreq. Use length, then byte-for-byte comparision, because that's + * how ANALYZE code sorted data before storing it in a statistic tuple. + * See ts_typanalyze.c for details. + */ + static int + compare_lexeme_textfreq(const void *e1, const void *e2) + { + const LexemeKey *key; + const TextFreq *t; + text *element; + int len1, + len2; + + key = (const LexemeKey *) e1; + t = (const TextFreq *) e2; + + /* + * The text Datum came from an array, so it cannot be compressed + * or stored out-of-line -- it's safe to use VARSIZE_ANY*. + */ + Assert(!VARATT_IS_COMPRESSED(t->element) && !VARATT_IS_EXTERNAL(t->element)); + + len1 = key->length; + len2 = VARSIZE_ANY_EXHDR(t->element); + + /* Compare lengths first, possibly avoiding a strncmp call */ + if (len1 > len2) + return 1; + else if (len1 < len2) + return -1; + + /* Fall back on byte-for-byte comparision */ + element = DatumGetTextP(t->element); + return strncmp(key->lexeme, VARDATA_ANY(element), len1); + } *** a/src/backend/tsearch/ts_typanalyze.c --- b/src/backend/tsearch/ts_typanalyze.c *************** *** 43,50 **** static void compute_tsvector_stats(VacAttrStats *stats, static void prune_lexemes_hashtable(HTAB *lexemes_tab, int b_current); static uint32 lexeme_hash(const void *key, Size keysize); static int lexeme_match(const void *key1, const void *key2, Size keysize); ! static int trackitem_compare_desc(const void *e1, const void *e2); ! /* * ts_typanalyze -- a custom typanalyze function for tsvector columns --- 43,51 ---- static void prune_lexemes_hashtable(HTAB *lexemes_tab, int b_current); static uint32 lexeme_hash(const void *key, Size keysize); static int lexeme_match(const void *key1, const void *key2, Size keysize); ! static int lexeme_compare(const void *key1, const void *key2); ! static int trackitem_compare_frequencies_desc(const void *e1, const void *e2); ! static int trackitem_compare_lexemes(const void *e1, const void *e2); /* * ts_typanalyze -- a custom typanalyze function for tsvector columns *************** *** 247,252 **** compute_tsvector_stats(VacAttrStats *stats, --- 248,254 ---- int i; TrackItem **sort_table; int track_len; + int minfreq, maxfreq; stats->stats_valid = true; /* Do the simple null-frac and average width stats */ *************** *** 273,279 **** compute_tsvector_stats(VacAttrStats *stats, Assert(i == track_len); qsort(sort_table, track_len, sizeof(TrackItem *), ! trackitem_compare_desc); /* Suppress any single-occurrence items */ while (track_len > 0) --- 275,281 ---- Assert(i == track_len); qsort(sort_table, track_len, sizeof(TrackItem *), ! trackitem_compare_frequencies_desc); /* Suppress any single-occurrence items */ while (track_len > 0) *************** *** 287,292 **** compute_tsvector_stats(VacAttrStats *stats, --- 289,314 ---- if (num_mcelem > track_len) num_mcelem = track_len; + /* Grab the minimal and maximal frequencies that will get stored */ + minfreq = sort_table[num_mcelem]->frequency; + maxfreq = sort_table[0]->frequency; + + /* + * We want to store statistics sorted on the lexeme value using first + * length, then byte-for-byte comparision. The reason for doing length + * comparision first is that we don't care about the ordering as long + * as it's consistent and comparing lengths first gives us a chance to + * avoid a strncmp() call. + * + * This is different from what we do with scalar statistics -- they get + * sorted on frequencies. The rationale is that we usually search + * through most common elements looking for a specific value, so we can + * grab its frequency. When values are presorted we can employ binary + * search for that. See ts_selfuncs.c for a real usage scenario. + */ + qsort(sort_table, num_mcelem, sizeof(TrackItem *), + trackitem_compare_lexemes); + /* Generate MCELEM slot entry */ if (num_mcelem > 0) { *************** *** 296,303 **** compute_tsvector_stats(VacAttrStats *stats, /* Must copy the target values into anl_context */ old_context = MemoryContextSwitchTo(stats->anl_context); mcelem_values = (Datum *) palloc(num_mcelem * sizeof(Datum)); ! mcelem_freqs = (float4 *) palloc(num_mcelem * sizeof(float4)); for (i = 0; i < num_mcelem; i++) { --- 318,332 ---- /* Must copy the target values into anl_context */ old_context = MemoryContextSwitchTo(stats->anl_context); + + /* + * We sorted statistics on the lexeme value, but we want to be + * able to reach the minimal and maximal frequency without goind + * through all the values. We keep those two extra frequencies in + * two extra cells in mcelem_freqs. + */ mcelem_values = (Datum *) palloc(num_mcelem * sizeof(Datum)); ! mcelem_freqs = (float4 *) palloc((num_mcelem + 2) * sizeof(float4)); for (i = 0; i < num_mcelem; i++) { *************** *** 306,319 **** compute_tsvector_stats(VacAttrStats *stats, mcelem_values[i] = PointerGetDatum(cstring_to_text_with_len(item->key.lexeme, item->key.length)); ! mcelem_freqs[i] = (double) item->frequency / (double) nonnull_cnt; } MemoryContextSwitchTo(old_context); stats->stakind[0] = STATISTIC_KIND_MCELEM; stats->staop[0] = TextEqualOperator; stats->stanumbers[0] = mcelem_freqs; ! stats->numnumbers[0] = num_mcelem; stats->stavalues[0] = mcelem_values; stats->numvalues[0] = num_mcelem; /* We are storing text values */ --- 335,353 ---- mcelem_values[i] = PointerGetDatum(cstring_to_text_with_len(item->key.lexeme, item->key.length)); ! mcelem_freqs[i] = ! (double) item->frequency / ! (double) nonnull_cnt; } + mcelem_freqs[i++] = (double) minfreq / (double) nonnull_cnt; + mcelem_freqs[i] = (double) maxfreq / (double) nonnull_cnt; MemoryContextSwitchTo(old_context); stats->stakind[0] = STATISTIC_KIND_MCELEM; stats->staop[0] = TextEqualOperator; stats->stanumbers[0] = mcelem_freqs; ! /* See above comment about two extra frequency fields */ ! stats->numnumbers[0] = num_mcelem + 2; stats->stavalues[0] = mcelem_values; stats->numvalues[0] = num_mcelem; /* We are storing text values */ *************** *** 379,403 **** lexeme_hash(const void *key, Size keysize) static int lexeme_match(const void *key1, const void *key2, Size keysize) { ! const LexemeHashKey *d1 = (const LexemeHashKey *) key1; ! const LexemeHashKey *d2 = (const LexemeHashKey *) key2; ! /* The lexemes need to have the same length, and be memcmp-equal */ ! if (d1->length == d2->length && ! memcmp(d1->lexeme, d2->lexeme, d1->length) == 0) ! return 0; ! else return 1; } /* ! * qsort() comparator for TrackItems - LC style (descending sort) */ static int ! trackitem_compare_desc(const void *e1, const void *e2) { const TrackItem * const *t1 = (const TrackItem * const *) e1; const TrackItem * const *t2 = (const TrackItem * const *) e2; return (*t2)->frequency - (*t1)->frequency; } --- 413,461 ---- static int lexeme_match(const void *key1, const void *key2, Size keysize) { ! /* The keysize parameter is superfluous, the keys store their lengths */ ! return lexeme_compare(key1, key2); ! } ! /* ! * Comparision function for lexemes. ! */ ! static int ! lexeme_compare(const void *key1, const void *key2) ! { ! const LexemeHashKey *d1 = (const LexemeHashKey *) key1; ! const LexemeHashKey *d2 = (const LexemeHashKey *) key2; ! ! /* First, compare by length */ ! if (d1->length > d2->length) return 1; + else if (d1->length < d2->length) + return -1; + else + /* Lengths are equal, do a byte-for-byte comparision */ + return strncmp(d1->lexeme, d2->lexeme, d1->length); } /* ! * qsort() comparator for sorting TrackItems on frequencies (descending sort) */ static int ! trackitem_compare_frequencies_desc(const void *e1, const void *e2) { const TrackItem * const *t1 = (const TrackItem * const *) e1; const TrackItem * const *t2 = (const TrackItem * const *) e2; return (*t2)->frequency - (*t1)->frequency; } + + /* + * qsort() comparator for sorting TrackItem on lexemes + */ + static int + trackitem_compare_lexemes(const void *e1, const void *e2) + { + const TrackItem * const *t1 = (const TrackItem * const *) e1; + const TrackItem * const *t2 = (const TrackItem * const *) e2; + + return lexeme_compare(&(*t1)->key, &(*t2)->key); + } *** a/src/include/catalog/pg_operator.h --- b/src/include/catalog/pg_operator.h *************** *** 915,924 **** DATA(insert OID = 3630 ( "<>" PGNSP PGUID b f f 3614 3614 16 3630 3629 ts DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); ! DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq contsel contjoinsel )); ! DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv contsel contjoinsel )); ! DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq contsel contjoinsel )); ! DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv contsel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); --- 915,924 ---- DATA(insert OID = 3631 ( ">=" PGNSP PGUID b f f 3614 3614 16 3628 3627 tsvector_ge scalargtsel scalargtjoinsel)); DATA(insert OID = 3632 ( ">" PGNSP PGUID b f f 3614 3614 16 3627 3628 tsvector_gt scalargtsel scalargtjoinsel)); DATA(insert OID = 3633 ( "||" PGNSP PGUID b f f 3614 3614 3614 0 0 tsvector_concat - - )); ! DATA(insert OID = 3636 ( "@@" PGNSP PGUID b f f 3614 3615 16 3637 0 ts_match_vq tssel contjoinsel )); ! DATA(insert OID = 3637 ( "@@" PGNSP PGUID b f f 3615 3614 16 3636 0 ts_match_qv tssel contjoinsel )); ! DATA(insert OID = 3660 ( "@@@" PGNSP PGUID b f f 3614 3615 16 3661 0 ts_match_vq tssel contjoinsel )); ! DATA(insert OID = 3661 ( "@@@" PGNSP PGUID b f f 3615 3614 16 3660 0 ts_match_qv tssel contjoinsel )); DATA(insert OID = 3674 ( "<" PGNSP PGUID b f f 3615 3615 16 3679 3678 tsquery_lt scalarltsel scalarltjoinsel)); DATA(insert OID = 3675 ( "<=" PGNSP PGUID b f f 3615 3615 16 3678 3679 tsquery_le scalarltsel scalarltjoinsel)); DATA(insert OID = 3676 ( "=" PGNSP PGUID b t f 3615 3615 16 3676 3677 tsquery_eq eqsel eqjoinsel )); *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** *** 4434,4439 **** DESCR("GiST tsquery support"); --- 4434,4441 ---- DATA(insert OID = 3701 ( gtsquery_consistent PGNSP PGUID 12 1 0 0 f f t f i 5 16 "2281 2281 23 26 2281" _null__null_ _null_ gtsquery_consistent _null_ _null_ _null_ )); DESCR("GiST tsquery support"); + DATA(insert OID = 3687 ( tssel PGNSP PGUID 12 1 0 0 f f t f s 4 701 "2281 26 2281 23" _null_ _null_ _null_ tssel _null__null_ _null_ )); + DESCR("restriction selectivity of tsvector @@ tsquery"); DATA(insert OID = 3688 ( ts_typanalyze PGNSP PGUID 12 1 0 0 f f t f s 1 16 "2281" _null_ _null_ _null_ ts_typanalyze_null_ _null_ _null_ )); DESCR("tsvector typanalyze"); *** a/src/include/tsearch/ts_type.h --- b/src/include/tsearch/ts_type.h *************** *** 155,160 **** extern Datum ts_rankcd_wttf(PG_FUNCTION_ARGS); --- 155,162 ---- extern Datum ts_typanalyze(PG_FUNCTION_ARGS); + extern Datum tssel(PG_FUNCTION_ARGS); + /* * TSQuery *************** *** 291,294 **** extern Datum tsquery_rewrite_query(PG_FUNCTION_ARGS); --- 293,309 ---- extern Datum tsq_mcontains(PG_FUNCTION_ARGS); extern Datum tsq_mcontained(PG_FUNCTION_ARGS); + /* + * The default text search selectivity is chosen to be smll enough to + * encourage indexscans for typical table densities. See selfuncs.h and + * DEFAULT_EQ_SEL for details. + */ + #define DEFAULT_TS_SEL 0.005 + + /* + * operator restriction function for tsvector @@ tsquery and + * tsquery @@ tsvector + */ + extern Datum tssel(PG_FUNCTION_ARGS); + #endif /* _PG_TSTYPE_H_ */
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > ju219721@students.mimuw.edu.pl wrote: > Well whaddya know. It turned out that my new company has a > 'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a > full day to work on the patch. Hm, does their name start with G? > Attached is a version that stores the minimal and maximal frequencies in > the Numbers array, has the aforementioned assertion and more nicely > ordered functions in ts_selfuncs.c. Excellent, I'll get to work on this version. regards, tom lane
Tom Lane wrote: > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: >> ju219721@students.mimuw.edu.pl wrote: >> Well whaddya know. It turned out that my new company has a >> 'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a >> full day to work on the patch. > > Hm, does their name start with G? No ;) It's called Flumotion (http://www.flumotion.com/eng/). >> Attached is a version that stores the minimal and maximal frequencies in >> the Numbers array, has the aforementioned assertion and more nicely >> ordered functions in ts_selfuncs.c. > > Excellent, I'll get to work on this version. Great, thanks. Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Attached is a version that stores the minimal and maximal frequencies in > the Numbers array, has the aforementioned assertion and more nicely > ordered functions in ts_selfuncs.c. Applied with some small corrections. regards, tom lane