Thread: best way to fetch next/prev record based on index
I am in a situation where I have to treat a table as logically ordered based on an index. Right now I'm doing this via queries, and a I need a better way to do it. Cursors do not meet my requirements, because they are always insensitive. Also, my performance requirements are extreme...I need 100% index usage. Currently, I use queries to do this. Unfortunately, the queries can get kind of complex because many if the indexes (keys, really) are over 3 or more columns in a table. So, for a table t with a three part key over columns a,b,c, the query to read the next value from t for given values a1, b1, c1 is select * from t where a >= a1 and (a > a1 or b >= b1) and (a > a1 or b > b1 or c > c1) In about 95% of cases, the planner correctly selects the index t(a,b,c) and uses it. However, the 5% remaining cases usually come at the worst time, when large tables and 3 or 4 part keys are involved. In those cases sometimes the planner applies the filter to a, but not b or c with a large performance hit. Manipulating statistics on the table does not seem to help. Interestingly, it is possible to rewrite the above query by switching and with or and >= with >. However when written that way, the planner almost never gets it right. My problem is deceptively simple: how you read the next record from a table based on a given set of values? In practice, this is difficult to implement. If anybody can suggest a alternative/better way to this, I'm all ears. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > So, for a table t with a three part key over columns a,b,c, the query to > read the next value from t for given values a1, b1, c1 is > select * from t where > a >= a1 and > (a > a1 or b >= b1) and > (a > a1 or b > b1 or c > c1) > In about 95% of cases, the planner correctly selects the index t(a,b,c) > and uses it. I'm surprised it's that good. Why not do select * from t where a >= a1 and b >= b1 and c >= c1 order by a,b,c limit 1 offset 1; which has a much more obvious translation to an indexscan. regards, tom lane
Greg Stark wrote: > > > do it for multi-column keys. It seems it would be nice if some syntax > > > similar to (a,b,c) > (a1,b1,c1) worked for this. > Hum. It would seem my intuition matches the SQL92 spec and Postgres gets > this > wrong. [...] > Even if Postgres did this right I'm not sure that would solve your index > woes. > I imagine the first thing Postgres would do is rewrite it into regular > scalar > expressions. Ideally the optimizer should be capable of then deducing from > the > scalar expressions that an index scan would be useful. Wow. For once, the standard is my friend. Well, what has to be done? :) Does pg do it the way it does for a reason? From the outside it seems like the planner would have an easier job if it can make a field by field comparison. Would a patch introducing the correct behavior (per the standard) be accepted? It seems pretty complicated (not to mention the planner issues). Merlin
> Interestingly, it is possible to rewrite the above query by switching > and with or and >= with >. However when written that way, the planner > almost never gets it right. Well, note it's still not really getting it right even in your case. It's doing an index scan on a>=a1 but if you have lots of values in your table where a=a1 and b<b1 then it's going to unnecessarily read through all of those. One thing that can help is to add ORDER BY a,b,c LIMIT 1 to your query. That will virtually guarantee that it uses an index scan, which will at least avoid making it scan all the records *after* finding the match. However it still doesn't seem to make Postgres use an Index Cond to allow it to do an instant lookup. I expected WHERE (a,b,c) > (a1,b1,c1) to work however it doesn't. It appears to mean a>a1 AND b>b1 AND c>c1 which isn't at all what you want. I imagine the standard dictates this meaning. > My problem is deceptively simple: how you read the next record from a > table based on a given set of values? In practice, this is difficult to > implement. If anybody can suggest a alternative/better way to this, I'm > all ears. I've done this a million times for simple integer keys, but I've never had to do it for multi-column keys. It seems it would be nice if some syntax similar to (a,b,c) > (a1,b1,c1) worked for this. -- greg
Hi, Merlin, On Tue, 27 Jul 2004 09:07:02 -0400 "Merlin Moncure" <merlin.moncure@rcsonline.com> wrote: > So, for a table t with a three part key over columns a,b,c, the query > to read the next value from t for given values a1, b1, c1 is > > select * from t where > a >= a1 and > (a > a1 or b >= b1) and > (a > a1 or b > b1 or c > c1) You mut not rely on such trickery to get any ordering, as the SQL data model contains no ordering, and a query optimizer is free to deliver you the tuples in any order it feels like. Why don't you add a 'ORDER BY a,b,c ASC' to your query? > Interestingly, it is possible to rewrite the above query by switching > and with or and >= with >. However when written that way, the planner > almost never gets it right. That's the reason why you cannot rely on any implicit ordering, the planner is free to rewrite a query as it likes as long as it delivers the same tuples, but in any order it wants. > My problem is deceptively simple: how you read the next record from a > table based on a given set of values? In practice, this is difficult > to implement. If anybody can suggest a alternative/better way to > this, I'm all ears. So you really want something like 'SELECT * FROM t WHERE a>=a1 AND b>=b1 AND c>=c1 ORDER BY a,b,c ASC LIMIT 1' HTH, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com
You only want one record to be returned? Tack a LIMIT 1 onto the end of the query. > My problem is deceptively simple: how you read the next record from a > table based on a given set of values? In practice, this is difficult to > implement. If anybody can suggest a alternative/better way to this, I'm > all ears.
Hi, Merlin, On Tue, 27 Jul 2004 16:13:25 +0200, I myself wrote: > You mut not Should be "must", not "mut" :-) > > My problem is deceptively simple: how you read the next record from > > a table based on a given set of values? In practice, this is > > difficult to implement. If anybody can suggest a alternative/better > > way to this, I'm all ears. > > So you really want something like > > 'SELECT * FROM t WHERE a>=a1 AND b>=b1 AND c>=c1 ORDER BY a,b,c ASC > LIMIT 1' Sorry, as you want the _next_, and I assume that a1, b1 and c1 are the current row's values, you should rather use something like: 'SELECT * FROM t WHERE a>=a1 AND b>=b1 AND c>=c1 ORDER BY a,b,c ASC LIMIT 1 OFFSET 1' HTH, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com
On Tue, 27 Jul 2004, Merlin Moncure wrote: > Greg Stark wrote: > > > > do it for multi-column keys. It seems it would be nice if some > syntax > > > > similar to (a,b,c) > (a1,b1,c1) worked for this. > > Hum. It would seem my intuition matches the SQL92 spec and Postgres > gets > > this > > wrong. > [...] > > Even if Postgres did this right I'm not sure that would solve your > index > > woes. > > I imagine the first thing Postgres would do is rewrite it into regular > > scalar > > expressions. Ideally the optimizer should be capable of then deducing > from > > the > > scalar expressions that an index scan would be useful. > > Wow. For once, the standard is my friend. Well, what has to be done? > :) Does pg do it the way it does for a reason? From the outside it > seems like the planner would have an easier job if it can make a field > by field comparison. > > Would a patch introducing the correct behavior (per the standard) be > accepted? It seems pretty complicated (not to mention the planner > issues). Given the comment on make_row_op, /* * XXX it's really wrong to generate a simple AND combination for < <= * > >=. We probably need to invent a new runtime node type to handle * those correctly. For the moment, though, keep on doing this ... */ I'd expect it'd be accepted.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Given the comment on make_row_op, > /* > * XXX it's really wrong to generate a simple AND combination for < <= > * > >=. We probably need to invent a new runtime node type to handle > * those correctly. For the moment, though, keep on doing this ... > */ > I'd expect it'd be accepted. Hm, this code is new. As of version 1.169 2004/04/18 it only accepted "=" and "<>" operators: /* Combining operators other than =/<> is dubious... */ if (row_length != 1 && strcmp(opname, "=") != 0 && strcmp(opname, "<>") != 0) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("row comparison cannot use operator %s", opname))); I think perhaps it's a bad idea to be introducing support for standard syntax until we can support the correct semantics. It will only mislead people and create backwards-compatibility headaches when we fix it to work properly. Removing <,<=,>,>= would be trivial. Patch (untested): --- parse_expr.c.~1.174.~ 2004-07-28 01:01:12.000000000 -0400 +++ parse_expr.c 2004-07-28 01:52:29.000000000 -0400 @@ -1695,11 +1695,7 @@ */ oprname = strVal(llast(opname)); - if ((strcmp(oprname, "=") == 0) || - (strcmp(oprname, "<") == 0) || - (strcmp(oprname, "<=") == 0) || - (strcmp(oprname, ">") == 0) || - (strcmp(oprname, ">=") == 0)) + if (strcmp(oprname, "=") == 0) { boolop = AND_EXPR; } Fixing it to write out complex boolean expressions wouldn't be too hard, but I'm not clear it would be worth it, since I suspect the end result would be as the comment indicates, to introduce a new runtime node. -- greg
Greg Stark <gsstark@MIT.EDU> writes: > Fixing it to write out complex boolean expressions wouldn't be too hard, but > I'm not clear it would be worth it, since I suspect the end result would be as > the comment indicates, to introduce a new runtime node. Just to prove that it isn't really all that hard, I took a stab at doing this. This is basically only my second attempt to write even trivial bits of server backend code so I certainly don't suggest anyone try using this code. In fact it doesn't quite compile, because I have a bit of confusion between the char *oprname and List *opname variables. Now I could clear that up, but I'm missing one piece of the puzzle. To make it work I do need a way to construct a List *opname from ">" or "=" and I don't know how to do that. I think that's all I'm missing, but perhaps in the morning I'll look at this code and wonder "what was I thinking?!" This approach won't get the optimizer to actually use an index for these comparisons, but it will fix the semantics to match the spec. Later we can either improve the optimizer to detect expressions like this (which I think would be cooler since some users may write them by hand and not use the row-expression approach, but I don't see how to do it), or introduce a new run-time node and have the optimizer handle it. But at least we won't have to worry about backwards-compatibility issues with the semantics changing. Oh, I tried to stick to the style, but sometimes I couldn't help myself. I suppose I would have to fix up the style the rest of the way if I got it working and you wanted a patch to apply. /* * Transform a "row op row" construct */ static Node * make_row_op(ParseState *pstate, List *opname, Node *ltree, Node *rtree) { Node *result = NULL; RowExpr *lrow, *rrow; List *largs, *rargs; char *oprname; /* Inputs are untransformed RowExprs */ lrow = (RowExpr *) transformExpr(pstate, ltree); rrow = (RowExpr *) transformExpr(pstate, rtree); Assert(IsA(lrow, RowExpr)); Assert(IsA(rrow, RowExpr)); largs = lrow->args; rargs = rrow->args; if (list_length(largs) != list_length(rargs)) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("unequal number of entries in row expression"))); oprname = strVal(llast(opname)); if (strcmp(oprname, "=") == 0) { result = make_row_op_simple(pstate, "=", largs, rargs); } else if (strcmp(oprname, "<>") == 0) { result = make_row_op_simple(pstate, "<>", largs, rargs); } else if ((strcmp(oprname, "<") == 0) || (strcmp(oprname, ">") == 0)) { result = make_row_op_complex(pstate, oprname, largs, rargs); } /* alternatively these last two could just create negated < and > * expressions. Which is better depends on whether the extra clause * confuses the optimizer more or less than having to push the NOTs down */ else if (strcmp(oprname, ">=") == 0) { Node *branch = make_row_op_simple(pstate, "=", largs, rargs); result = make_row_op_complex(pstate, ">", largs, rargs); result = (Node *) makeBoolExpr(OR_EXPR, list_make2(result, branch)); } else if (strcmp(oprname, "<=") == 0) { Node *branch = make_row_op_simple(pstate, "=", largs, rargs); result = make_row_op_complex(pstate, "<", largs, rargs); result = (Node *) makeBoolExpr(OR_EXPR, list_make2(result, branch)); } else { ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("operator %s is not supported for row expressions", oprname))); } return result; } /* * Handle something like * (A,B,C) = (X,Y,Z) * By constructing * (A=X) AND (B=Y) AND (C=Z) * */ static Node * make_row_op_simple(ParseState *pstate, char *oprname, List *largs, List *rargs) { ListCell *l, *r; BoolExprType boolop; Node *result; boolop = strcmp(oprname, "<>")==0 ? OR_EXPR : AND_EXPR; forboth(l, largs, r, rargs) { Node *larg = (Node *) lfirst(l); Node *rarg = (Node *) lfirst(r); Node *cmp; cmp = (Node *) make_op(pstate, opname, larg, rarg); cmp = coerce_to_boolean(pstate, cmp, "row comparison"); if (result == NULL) result = cmp; else result = (Node *) makeBoolExpr(boolop, list_make2(result, cmp)); } if (result == NULL) { /* zero-length rows? Generate constant TRUE or FALSE */ if (boolop == AND_EXPR) result = makeBoolConst(true, false); else result = makeBoolConst(false, false); } return result; } /* * Handles something like: * (A,B,C) > (X,Y,Z) * * By constructing something like: * ( ( A > X) OR (A=X AND B>Y) OR (A=X AND B=Y AND C>Z) ) * */ static Node * make_row_op_complex(ParseState *pstate, char *oprname, List *largs, List *rargs) { ListCell *l, *outer_l, *r, *outer_r; Node *result; forboth(outer_l, largs, outer_r, rargs) { Node *outer_larg = (Node *) lfirst(outer_l); Node *outer_rarg = (Node *) lfirst(outer_r); Node *branch = NULL; Node *cmp; /* all leading elements have to be equal */ forboth(l, largs, r, rargs) { Node *larg = (Node *) lfirst(l); Node *rarg = (Node *) lfirst(r); Node *cmp; if (larg == outer_larg) { break; } cmp = (Node *) make_op(pstate, "=", larg, rarg); cmp = coerce_to_boolean(pstate, cmp, "row comparison"); if (branch == NULL) branch = cmp; else branch = (Node *) makeBoolExpr(AND_EXPR, list_make2(branch, cmp)); } /* trailing element has to be strictly greater or less than */ cmp = (Node *) make_op(pstate, oprname, outer_larg, outer_rarg); cmp = coerce_to_boolean(pstate, cmp, "row comparison"); branch = branch==NULL ? cmp : (Node *) makeBoolExpr(AND_EXPR, list_make2(branch, cmp)); result = result==NULL ? branch : (Node *) makeBoolExpr(OR_EXPR, list_make2(result, branch)); } if (result == NULL) { /* zero-length rows? Generate constant FALSE */ result = makeBoolConst(true, false); } return result; } -- greg
Greg Stark <gsstark@mit.edu> writes: > Removing <,<=,>,>= would be trivial. ... and not backwards-compatible. If we did that then cases involving unlabeled row expressions would no longer work as they did in prior releases. For example select (1,2,3) < (4,5,6); is accepted by all releases back to 7.0, and probably much further (but 7.0 is the oldest I have handy to test). The only reason the code in parse_expr.c appears new is that the functionality used to be in gram.y. I'd like to see this fixed to comply with the spec, but given the lack of complaints about the existing behavior over so many years, ripping it out meanwhile doesn't seem appropriate. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > The only reason the code in parse_expr.c appears new is that the > functionality used to be in gram.y. Ah, that was what I was missing. Though it's odd since it seems there was code in parse_expr.c to handle the "=" case specially. > I'd like to see this fixed to comply with the spec, but given the lack > of complaints about the existing behavior over so many years, ripping > it out meanwhile doesn't seem appropriate. I tried my hand at this last night and think I did an ok first pass. But I'm missing one piece of the puzzle to get it to compile. What do I need to know to be able to construct a List* suitable for passing as the second arg to make_op() knowing only that I want to create a List* to represent "=" or "<" or so on? I also had another question I didn't ask in the other email. In the midst of a forboth() loop, how would I tell if I'm at the last element of the lists? Would lnext(l)==NULL do it? -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> The only reason the code in parse_expr.c appears new is that the >> functionality used to be in gram.y. > Ah, that was what I was missing. Though it's odd since it seems there was code > in parse_expr.c to handle the "=" case specially. IIRC, the case involving a subselect, eg ... WHERE (1,2) = ANY (SELECT a, b FROM foo) ... has always been handled in parse_expr.c, but cases involving simple rows were previously expanded in gram.y. One of the reasons I moved the logic over to parse_expr.c was the thought that it would be easier to do it right in parse_expr.c --- gram.y would not be allowed to look up related operators, which seems necessary to handle the construct per spec. > I tried my hand at this last night and think I did an ok first pass. The main issue in my mind is whether to invent a separate node type for row comparisons. This is probably a good idea for a number of reasons, the most obvious being that there's no way to avoid multiple evaluations of the subexpressions if you try to expand it into simple comparisons. Also it seems likely that the planner would find it easier to recognize the relationship to a multicolumn index than if the thing is expanded. (But teaching the planner and the index mechanisms themselves about this is going to be a major project in any case.) One thing I did not like about your first pass is that it makes unsupportable assumptions about there being a semantic relationship between operators named, say, '<' and '<='. Postgres used to have such bogosity in a number of places but we've managed to get rid of most of it. (Offhand I think the only remaining hard-wired assumption about operators of particular names having particular semantics is that the foreign key mechanisms assume '=' must be the right operator to compare keys with. Eventually we need to get rid of that too.) IMHO the right way to do this is to look up a suitable btree operator class and use the appropriate member operators of that class. (In a separate-node-type implementation, we'd probably ignore the operators as such altogether, and just call the btree comparison function of the opclass.) It's not entirely clear to me how to select the opclass when the initially given inputs are of different types, though. In the present code we leave it to oper() to do the right thing, including possibly coercing the inputs to matching types. Possibly we should still apply oper(), but then insist that the selected operator appear as a btree opclass member, comparable to the way we handle sort operators now. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > One thing I did not like about your first pass is that it makes > unsupportable assumptions about there being a semantic relationship > between operators named, say, '<' and '<='. Hm, I think I even had caught that issue on the mailing list previously. In that case though, it seems even the existing code is insufficient. Instead of testing whether the operator with strcmp against "=" and "<>" it should perhaps be looking for an operator class and the strategy number for the operator and its negator. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> One thing I did not like about your first pass is that it makes >> unsupportable assumptions about there being a semantic relationship >> between operators named, say, '<' and '<='. > In that case though, it seems even the existing code is insufficient. Well, yeah, we know the existing code is broken ;-) > Instead of testing whether the operator with strcmp against "=" and > "<>" it should perhaps be looking for an operator class and the > strategy number for the operator and its negator. Probably. You can find some relevant code in indxpath.c in the stuff that tries to determine whether partial indexes are relevant. I think that the ideal behavior is that we not look directly at the operator name at all. For example it's not too unreasonable to want to write (a,b) ~<~ (c,d) if you have an index that uses those non-locale-aware operators. We should find the operator that matches the name and input arguments, and then try to make sense of the operator semantics by matching it to btree opclasses. Note that it's possible to find multiple matches, for example if someone has installed a "reverse sort" opclass. I think we would want to prefer a match in the datatype's default opclass, if there is one, but otherwise we can probably follow the lead of the existing code and assume that any match is equally good. regards, tom lane