Thread: Function-based index not used in a simple query
Hi all, I have run into a problem trying to optimize a select from a single table. Due to the nature of the queries our frontends issue most of the time, we have created an index that reduces the processing time a lot. The problem (index not being used) arose when I tried to do a select with non-constant parameters to the WHERE condition. The table definition looks like this: CREATE TABLE vals ( timestamp timestamp NOT NULL, agent varchar(15) NOT NULL, var varchar(64) NOT NULL, val text NOT NULL); The index definition is based on a user-defined function: CREATE FUNCTION trunc_to_day(timestamp) RETURNS timestamp AS 'DECLARE tstamp ALIAS FOR $1;BEGIN RETURN date_trunc(''day'',tstamp);END;' LANGUAGE 'plpgsql'; CREATE INDEX vals_days ON vals (trunc_to_day(timestamp) timestamp_ops); A typical query looks like this (additional conditions removed from the WHERE condition as well as additional GROUP BY and ORDER BY clauses): SELECT * FROM vals WHERE trunc_to_day(timestamp) = '28.5.2000'; Explain on this query produces: Index Scan using vals_days on vals (cost=0.00..8.16 rows=10 width=44) Now, when I try to do the same with a slightly changed comparison expression in the WHERE clause, an optimizer decides to sequentially scan the table: SELECT * FROM vals WHERE trunc_to_day(timestamp) = trunc_to_day('28.5.2000'); Seq Scan on vals (cost=0.00..27.50 rows=10 width=44) Actually, the problem first appeared in a stored procedure: CREATE FUNCTION detector(timestamp, varchar) RETURNS float AS 'DECLARE check_time ALIAS FOR $1; check_agent ALIASFOR $2; from_time timestamp; to_time timestamp; from_day timestamp; to_day timestamp; rssi_var vars.var%TYPE; avg_rssi float;BEGIN from_time = check_time; from_day = trunc_to_day(from_time); to_time = check_time + ''1 day''::interval; to_day = trunc_to_day(to_time); SELECT INTO rssi_var var || ''%'' FROM vars WHERE name = ''brzAvrgRssi''; SELECT INTO avg_rssi AVG(val::float) FROM vals WHERE trunc_to_day(timestamp) BETWEEN from_day AND to_dayAND timestamp BETWEEN from_time AND to_time AND agent = check_agent AND var LIKE rssi_var; IF avg_rssi IS NULL THEN RAISE EXCEPTION ''There are no values for % and %.'', check_time,check_agent; END IF; RETURN avg_rssi;END;' LANGUAGE 'plpgsql'; Sorry for a lengthy listing, but I didn't want to omit something important accidentally. It seems the optimizer chooses to seq-scan whenever there is anything else than a simple string constant in the WHERE condition. After reading the thread ``Index not used in functions in 7.0'' two weeks ago in this list, I have experimented with typecasting extensively, trying to add ``::timestamp'' wherever possible to the query, but with no success. The PostgreSQL version is 7.0.0 on i686-pc-linux-gnu, compiled by gcc egcs-2.91.66. Thanks in advance for any advices! Orbis -- Rostislav Opocensky <orbis@pictus.org> <orbis@unreal.cz> +420 411 825144 Unreal Technology sro., Dobrin 118, 41301 Roudnice n. L. +420 411 825111
Rostislav Opocensky <orbis@pictus.org> writes: > SELECT * > FROM vals > WHERE trunc_to_day(timestamp) = trunc_to_day('28.5.2000'); > > Seq Scan on vals (cost=0.00..27.50 rows=10 width=44) The problem here is that the optimizer will only consider an indexscan for a clause that looks like index_key OP constant. It doesn't think that trunc_to_day('28.5.2000') is a constant, because you haven't told it that it can pre-evaluate that function call --- and for all it knows, the function might generate different values when evaluated at different rows of the table, rendering an indexscan incorrect. (Consider a function like random() ...) If you mark trunc_to_day() as a "cachable" function then the system will know it's safe to collapse the righthand side to a constant: CREATE FUNCTION trunc_to_day(timestamp) RETURNS timestamp AS 'DECLARE tstamp ALIAS FOR $1;BEGIN RETURN date_trunc(''day'',tstamp);END;' LANGUAGE 'plpgsql'WITH (isCachable); This is a little bit of a cheat for this function, because what isCachable actually implies is that the function's value depends on the presented arguments *and nothing else*. I believe date_trunc depends on the current TZ setting, so it's not really cachable. You could possibly get away with it for your application, but you might run into problems like this: 1. You write trunc_to_day('28.5.2000') in a table's column default expression, stored rule, or some such. The system decidesit can fold that expression down to a simple constant, giving you (say) 2000-05-20 00:00-04 --- ie, midnight in whatever timezone you are using. That value gets stored in the column or rule definition. 2. You then access the database while running with a different TZ setting. You wonder why you don't get the answers youexpected. Come to think of it, your index on trunc_to_day() is fundamentally risky because different backends running with different TZ settings would generate different index entries for the "same" data entry. For example, if you make an entry dated '28.5.2000' while running in one timezone, and then try to query with "= trunc_to_day('28.5.2000')" while running in another timezone, you won't match that entry if the index is used, because the timestamp generated on-the-fly from trunc_to_day('28.5.2000') won't equal the one stored in the index. (Hmm ... I wonder if that suggests that we shouldn't allow indexes to be built using functions that are not isCachable?) You could avoid some of these risks by having trunc_to_day return a 'date' rather than a 'timestamp', thereby removing at least some of the timezone dependency from the index definition. regards, tom lane
On Tue, 30 May 2000, Tom Lane wrote: > The problem here is that the optimizer will only consider an indexscan > for a clause that looks like index_key OP constant. It doesn't think > that trunc_to_day('28.5.2000') is a constant, because you haven't told > it that it can pre-evaluate that function call --- and for all it knows, [...] Dear Tom, thank you for the precise explanation for my problem! I'll consider having my index function return a `date'. Still one thing remains unclear to me: why the optimizer doesn't use an indexscan in the stored procedure I have attached to my previous post. The condition looks like WHERE trunc_to_day(timestamp) BETWEEN var1 AND var2. var1 and var2 get their values from calling the `volatile' function trunc_to_day, but from then on, their values can't be changed during the execution of the query. Is it possible to give the optimizer a hint about it? Best regards Orbis -- Rostislav Opocensky <orbis@pictus.org> <orbis@unreal.cz> +420 411 825144 Unreal Technology sro., Dobrin 118, 41301 Roudnice n. L. +420 411 825111
Rostislav Opocensky <orbis@pictus.org> writes: > On Tue, 30 May 2000, Tom Lane wrote: >> The problem here is that the optimizer will only consider an indexscan >> for a clause that looks like index_key OP constant. It doesn't think > I'll consider having my index function return a `date'. Still one thing > remains unclear to me: why the optimizer doesn't use an indexscan in the > stored procedure I have attached to my previous post. The condition looks > like WHERE trunc_to_day(timestamp) BETWEEN var1 AND var2. var1 and var2 > get their values from calling the `volatile' function trunc_to_day, but > from then on, their values can't be changed during the execution of the > query. Is it possible to give the optimizer a hint about it? Hmm, actually the optimizer should/does regard those as constants within subsequent queries (internally they are Params instead of Consts, but that's supposed to be OK). What I find here is that the optimizer does consider an indexscan for this query, but there's a bug in its selectivity estimation routine that causes it not to recognize the BETWEEN clause as being a range restriction --- and that means it produces a fairly high cost estimate for the indexscan. I still got an indexscan plan for a small test table, but on a larger table you might not get one. I've applied the attached patch for 7.0.1 --- if you are in a hurry, you may care to apply it to your local copy. It just tweaks the range- query recognizer to accept Param as well as Const nodes. regards, tom lane *** src/backend/optimizer/path/clausesel.c.orig Tue May 30 00:26:44 2000 --- src/backend/optimizer/path/clausesel.c Wed May 31 11:38:53 2000 *************** *** 120,129 **** Selectivity s2; /* ! * See if it looks like a restriction clause with a constant. (If ! * it's not a constant we can't really trust the selectivity!) NB: ! * for consistency of results, this fragment of code had better ! * match what clause_selectivity() would do. */ if (varRelid != 0 || NumRelids(clause) == 1) { --- 120,131 ---- Selectivity s2; /* ! * See if it looks like a restriction clause with a Const or Param ! * on one side. (Anything more complicated than that might not ! * behave in the simple way we are expecting.) ! * ! * NB: for consistency of results, this fragment of code had better ! * match what clause_selectivity() would do in the cases it handles. */ if (varRelid != 0 || NumRelids(clause) == 1) { *************** *** 134,174 **** get_relattval(clause, varRelid, &relidx, &attno, &constval, &flag); ! if (relidx != 0 && (flag & SEL_CONSTANT)) { /* if get_relattval succeeded, it must be an opclause */ ! Oid opno = ((Oper *) ((Expr *) clause)->oper)->opno; ! RegProcedure oprrest = get_oprrest(opno); ! if (!oprrest) ! s2 = (Selectivity) 0.5; ! else ! s2 = restriction_selectivity(oprrest, opno, ! getrelid(relidx, ! root->rtable), ! attno, ! constval, flag); ! ! /* ! * If we reach here, we have computed the same result that ! * clause_selectivity would, so we can just use s2 if it's ! * the wrong oprrest. But if it's the right oprrest, add ! * the clause to rqlist for later processing. ! */ ! switch (oprrest) { ! case F_SCALARLTSEL: ! addRangeClause(&rqlist, clause, flag, true, s2); ! break; ! case F_SCALARGTSEL: ! addRangeClause(&rqlist, clause, flag, false, s2); ! break; ! default: ! /* Just merge the selectivity in generically */ ! s1 = s1 * s2; ! break; } - continue; /* drop to loop bottom */ } } /* Not the right form, so treat it generically. */ --- 136,183 ---- get_relattval(clause, varRelid, &relidx, &attno, &constval, &flag); ! if (relidx != 0) { /* if get_relattval succeeded, it must be an opclause */ ! Var *other; ! other = (flag & SEL_RIGHT) ? get_rightop((Expr *) clause) : ! get_leftop((Expr *) clause); ! if (IsA(other, Const) || IsA(other, Param)) { ! Oid opno = ((Oper *) ((Expr *) clause)->oper)->opno; ! RegProcedure oprrest = get_oprrest(opno); ! ! if (!oprrest) ! s2 = (Selectivity) 0.5; ! else ! s2 = restriction_selectivity(oprrest, opno, ! getrelid(relidx, ! root->rtable), ! attno, ! constval, flag); ! ! /* ! * If we reach here, we have computed the same result that ! * clause_selectivity would, so we can just use s2 if it's ! * the wrong oprrest. But if it's the right oprrest, add ! * the clause to rqlist for later processing. ! */ ! switch (oprrest) ! { ! case F_SCALARLTSEL: ! addRangeClause(&rqlist, clause, flag, true, s2); ! break; ! case F_SCALARGTSEL: ! addRangeClause(&rqlist, clause, flag, false, s2); ! break; ! default: ! /* Just merge the selectivity in generically */ ! s1 = s1 * s2; ! break; ! } ! continue; /* drop to loop bottom */ } } } /* Not the right form, so treat it generically. */
hi, we have a weird situation here. we have a table of approx. 10k rows representing accumulated activity by specific customers. as information is gathered those customers rows are updated. the number of rows does not increase unless we get a new customer so that is not a factor. the table is defined as follows: Table "account_summary_02" Attribute | Type | Modifier -------------+-------------+----------bill_br_id | bigint | not nullcust_id | varchar(15) | not nullbtn_id | varchar(15) | not nullln_id | varchar(15) | not nullct_key | float8 | not nullas_quantity | float8 | not nullas_charges | float8 | not nullas_count | float8 | not null Index: account_summary_02_unq_idx the index is on the first 5 columns. here's the situation. after about 50,000 updates, which fly right along, the process begins to really bog down. we perform a vacuum analzye and it speeds right up again. my question is, is there a way to perform these updates, potentially 500k to 1 million in a day, without having to vacuum so frequently? maybe some setting or parameter to be changed? the update query is doing an index scan. mikeo
mikeo <mikeo@spectrumtelecorp.com> writes: > after about 50,000 updates, which fly right along, the process begins > to really bog down. we perform a vacuum analzye and it speeds right > up again. A plain "vacuum" should do the job in a bit less time. The problem is you need to reclaim the space occupied by deleted versions of rows. No way around that with the current storage manager: vacuum is the only way to get rid of the wasted space. regards, tom lane