Thread: best way to fetch next/prev record based on index

best way to fetch next/prev record based on index

From

"Merlin Moncure"

Date:

27 July 2004, 10:07:05

I am in a situation where I have to treat a table as logically ordered
based on an index.  Right now I'm doing this via queries, and a I need a
better way to do it.  Cursors do not meet my requirements, because they
are always insensitive.  Also, my performance requirements are
extreme...I need 100% index usage.

Currently, I use queries to do this.  Unfortunately, the queries can get
kind of complex because many if the indexes (keys, really) are over 3 or
more columns in a table.

So, for a table t with a three part key over columns a,b,c, the query to
read the next value from t for given values a1, b1, c1 is

select * from t where
    a >= a1 and
     (a >  a1 or b >= b1) and
     (a >  a1 or b > b1 or c > c1)

In about 95% of cases, the planner correctly selects the index t(a,b,c)
and uses it.  However, the 5% remaining cases usually come at the worst
time, when large tables and 3 or 4 part keys are involved.  In those
cases sometimes the planner applies the filter to a, but not b or c with
a large performance hit.  Manipulating statistics on the table does not
seem to help.

Interestingly, it is possible to rewrite the above query by switching
and with or and >= with >.  However when written that way, the planner
almost never gets it right.

My problem is deceptively simple: how you read the next record from a
table based on a given set of values?  In practice, this is difficult to
implement.  If anybody can suggest a alternative/better way to this, I'm
all ears.

Merlin

Re: best way to fetch next/prev record based on index

From

Tom Lane

Date:

28 July 2004, 00:22:27

"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> So, for a table t with a three part key over columns a,b,c, the query to
> read the next value from t for given values a1, b1, c1 is

> select * from t where
>     a >= a1 and
>      (a >  a1 or b >= b1) and
>      (a >  a1 or b > b1 or c > c1)

> In about 95% of cases, the planner correctly selects the index t(a,b,c)
> and uses it.

I'm surprised it's that good.  Why not do

    select * from t where a >= a1 and b >= b1 and c >= c1
    order by a,b,c
    limit 1 offset 1;

which has a much more obvious translation to an indexscan.

            regards, tom lane

Re: best way to fetch next/prev record based on index

From

"Merlin Moncure"

Date:

28 July 2004, 00:26:23

Greg Stark wrote:
> > > do it for multi-column keys. It seems it would be nice if some
syntax
> > > similar to (a,b,c) > (a1,b1,c1) worked for this.
> Hum. It would seem my intuition matches the SQL92 spec and Postgres
gets
> this
> wrong.
[...]
> Even if Postgres did this right I'm not sure that would solve your
index
> woes.
> I imagine the first thing Postgres would do is rewrite it into regular
> scalar
> expressions. Ideally the optimizer should be capable of then deducing
from
> the
> scalar expressions that an index scan would be useful.

Wow.  For once, the standard is my friend.  Well, what has to be done?
:)  Does pg do it the way it does for a reason?  From the outside it
seems like the planner would have an easier job if it can make a field
by field comparison.

Would a patch introducing the correct behavior (per the standard) be
accepted?  It seems pretty complicated (not to mention the planner
issues).

Merlin

Re: best way to fetch next/prev record based on index

From

Greg Stark

Date:

28 July 2004, 00:30:50


> Interestingly, it is possible to rewrite the above query by switching
> and with or and >= with >.  However when written that way, the planner
> almost never gets it right.

Well, note it's still not really getting it right even in your case. It's
doing an index scan on a>=a1 but if you have lots of values in your table
where a=a1 and b<b1 then it's going to unnecessarily read through all of
those.


One thing that can help is to add ORDER BY a,b,c LIMIT 1 to your query. That
will virtually guarantee that it uses an index scan, which will at least avoid
making it scan all the records *after* finding the match. However it still
doesn't seem to make Postgres use an Index Cond to allow it to do an instant
lookup.

I expected WHERE (a,b,c) > (a1,b1,c1) to work however it doesn't. It appears
to mean a>a1 AND b>b1 AND c>c1 which isn't at all what you want. I imagine the
standard dictates this meaning.

> My problem is deceptively simple: how you read the next record from a
> table based on a given set of values?  In practice, this is difficult to
> implement.  If anybody can suggest a alternative/better way to this, I'm
> all ears.

I've done this a million times for simple integer keys, but I've never had to
do it for multi-column keys. It seems it would be nice if some syntax similar
to (a,b,c) > (a1,b1,c1) worked for this.

--
greg

Re: best way to fetch next/prev record based on index

From

Markus Schaber

Date:

28 July 2004, 00:31:16

Hi, Merlin,

On Tue, 27 Jul 2004 09:07:02 -0400
"Merlin Moncure" <merlin.moncure@rcsonline.com> wrote:

> So, for a table t with a three part key over columns a,b,c, the query
> to read the next value from t for given values a1, b1, c1 is
>
> select * from t where
>     a >= a1 and
>      (a >  a1 or b >= b1) and
>      (a >  a1 or b > b1 or c > c1)

You mut not rely on such trickery to get any ordering, as the SQL data
model contains no ordering, and a query optimizer is free to deliver you
the tuples in any order it feels like.

Why don't you add a 'ORDER BY a,b,c ASC' to your query?

> Interestingly, it is possible to rewrite the above query by switching
> and with or and >= with >.  However when written that way, the planner
> almost never gets it right.

That's the reason why you cannot rely on any implicit ordering, the
planner is free to rewrite a query as it likes as long as it delivers
the same tuples, but in any order it wants.

> My problem is deceptively simple: how you read the next record from a
> table based on a given set of values?  In practice, this is difficult
> to implement.  If anybody can suggest a alternative/better way to
> this, I'm all ears.

So you really want something like

'SELECT * FROM t WHERE a>=a1 AND b>=b1 AND c>=c1 ORDER BY a,b,c ASC LIMIT 1'


HTH,
Markus
--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: best way to fetch next/prev record based on index

From

Rod Taylor

Date:

28 July 2004, 00:34:13

You only want one record to be returned? Tack a LIMIT 1 onto the end of
the query.

> My problem is deceptively simple: how you read the next record from a
> table based on a given set of values?  In practice, this is difficult to
> implement.  If anybody can suggest a alternative/better way to this, I'm
> all ears.

Correction of best way to fetch next/prev record based on index

From

Markus Schaber

Date:

28 July 2004, 00:36:08

Hi, Merlin,

On Tue, 27 Jul 2004 16:13:25 +0200, I myself wrote:


> You mut not

Should be "must", not "mut" :-)

> > My problem is deceptively simple: how you read the next record from
> > a table based on a given set of values?  In practice, this is
> > difficult to implement.  If anybody can suggest a alternative/better
> > way to this, I'm all ears.
>
> So you really want something like
>
> 'SELECT * FROM t WHERE a>=a1 AND b>=b1 AND c>=c1 ORDER BY a,b,c ASC
> LIMIT 1'

Sorry, as you want the _next_, and I assume that a1, b1 and c1 are the
current row's values, you should rather use something like:

'SELECT * FROM t WHERE a>=a1 AND b>=b1 AND c>=c1 ORDER BY a,b,c ASC
LIMIT 1 OFFSET 1'

HTH,
Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: best way to fetch next/prev record based on index

From

Stephan Szabo

Date:

28 July 2004, 01:01:45

On Tue, 27 Jul 2004, Merlin Moncure wrote:

> Greg Stark wrote:
> > > > do it for multi-column keys. It seems it would be nice if some
> syntax
> > > > similar to (a,b,c) > (a1,b1,c1) worked for this.
> > Hum. It would seem my intuition matches the SQL92 spec and Postgres
> gets
> > this
> > wrong.
> [...]
> > Even if Postgres did this right I'm not sure that would solve your
> index
> > woes.
> > I imagine the first thing Postgres would do is rewrite it into regular
> > scalar
> > expressions. Ideally the optimizer should be capable of then deducing
> from
> > the
> > scalar expressions that an index scan would be useful.
>
> Wow.  For once, the standard is my friend.  Well, what has to be done?
> :)  Does pg do it the way it does for a reason?  From the outside it
> seems like the planner would have an easier job if it can make a field
> by field comparison.
>
> Would a patch introducing the correct behavior (per the standard) be
> accepted?  It seems pretty complicated (not to mention the planner
> issues).

Given the comment on make_row_op,
  /*
   * XXX it's really wrong to generate a simple AND combination for < <=
   * > >=.  We probably need to invent a new runtime node type to handle
   * those correctly.  For the moment, though, keep on doing this ...
   */
I'd expect it'd be accepted.

Re: best way to fetch next/prev record based on index

From

Greg Stark

Date:

28 July 2004, 02:53:55

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> Given the comment on make_row_op,
>   /*
>    * XXX it's really wrong to generate a simple AND combination for < <=
>    * > >=.  We probably need to invent a new runtime node type to handle
>    * those correctly.  For the moment, though, keep on doing this ...
>    */
> I'd expect it'd be accepted.

Hm, this code is new. As of version 1.169 2004/04/18 it only accepted "=" and
"<>" operators:

    /* Combining operators other than =/<> is dubious... */
    if (row_length != 1 &&
        strcmp(opname, "=") != 0 &&
        strcmp(opname, "<>") != 0)
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
          errmsg("row comparison cannot use operator %s",
                 opname)));

I think perhaps it's a bad idea to be introducing support for standard syntax
until we can support the correct semantics. It will only mislead people and
create backwards-compatibility headaches when we fix it to work properly.

Removing <,<=,>,>= would be trivial. Patch (untested):

--- parse_expr.c.~1.174.~    2004-07-28 01:01:12.000000000 -0400
+++ parse_expr.c    2004-07-28 01:52:29.000000000 -0400
@@ -1695,11 +1695,7 @@
      */
     oprname = strVal(llast(opname));

-    if ((strcmp(oprname, "=") == 0) ||
-        (strcmp(oprname, "<") == 0) ||
-        (strcmp(oprname, "<=") == 0) ||
-        (strcmp(oprname, ">") == 0) ||
-        (strcmp(oprname, ">=") == 0))
+    if (strcmp(oprname, "=") == 0)
     {
         boolop = AND_EXPR;
     }

Fixing it to write out complex boolean expressions wouldn't be too hard, but
I'm not clear it would be worth it, since I suspect the end result would be as
the comment indicates, to introduce a new runtime node.

--
greg

Re: best way to fetch next/prev record based on index

From

Greg Stark

Date:

28 July 2004, 04:15:02

Greg Stark <gsstark@MIT.EDU> writes:

> Fixing it to write out complex boolean expressions wouldn't be too hard, but
> I'm not clear it would be worth it, since I suspect the end result would be as
> the comment indicates, to introduce a new runtime node.

Just to prove that it isn't really all that hard, I took a stab at doing this.
This is basically only my second attempt to write even trivial bits of server
backend code so I certainly don't suggest anyone try using this code.

In fact it doesn't quite compile, because I have a bit of confusion between
the char *oprname and List *opname variables. Now I could clear that up, but
I'm missing one piece of the puzzle. To make it work I do need a way to
construct a List *opname from ">" or "=" and I don't know how to do that.

I think that's all I'm missing, but perhaps in the morning I'll look at this
code and wonder "what was I thinking?!"


This approach won't get the optimizer to actually use an index for these
comparisons, but it will fix the semantics to match the spec. Later we can
either improve the optimizer to detect expressions like this (which I think
would be cooler since some users may write them by hand and not use the
row-expression approach, but I don't see how to do it), or introduce a new
run-time node and have the optimizer handle it. But at least we won't have to
worry about backwards-compatibility issues with the semantics changing.

Oh, I tried to stick to the style, but sometimes I couldn't help myself. I
suppose I would have to fix up the style the rest of the way if I got it
working and you wanted a patch to apply.


/*
 * Transform a "row op row" construct
 */
static Node *
make_row_op(ParseState *pstate, List *opname, Node *ltree, Node *rtree)
{
    Node       *result = NULL;
    RowExpr    *lrow,
               *rrow;
    List       *largs,
               *rargs;
    char       *oprname;

    /* Inputs are untransformed RowExprs */
    lrow = (RowExpr *) transformExpr(pstate, ltree);
    rrow = (RowExpr *) transformExpr(pstate, rtree);
    Assert(IsA(lrow, RowExpr));
    Assert(IsA(rrow, RowExpr));
    largs = lrow->args;
    rargs = rrow->args;

    if (list_length(largs) != list_length(rargs))
        ereport(ERROR,
                (errcode(ERRCODE_SYNTAX_ERROR),
                 errmsg("unequal number of entries in row expression")));

    oprname = strVal(llast(opname));

    if (strcmp(oprname, "=") == 0)
    {
        result = make_row_op_simple(pstate, "=", largs, rargs);
    }

    else if (strcmp(oprname, "<>") == 0)
    {
        result = make_row_op_simple(pstate, "<>", largs, rargs);
    }

    else if ((strcmp(oprname, "<") == 0) ||
             (strcmp(oprname, ">") == 0))
    {
        result = make_row_op_complex(pstate, oprname, largs, rargs);
    }

    /* alternatively these last two could just create negated < and >
     * expressions. Which is better depends on whether the extra clause
     * confuses the optimizer more or less than having to push the NOTs down
     */

    else if (strcmp(oprname, ">=") == 0)
    {
        Node *branch = make_row_op_simple(pstate, "=", largs, rargs);
        result = make_row_op_complex(pstate, ">", largs, rargs);
        result = (Node *) makeBoolExpr(OR_EXPR, list_make2(result, branch));
    }

    else if (strcmp(oprname, "<=") == 0)
    {
        Node *branch = make_row_op_simple(pstate, "=", largs, rargs);
        result = make_row_op_complex(pstate, "<", largs, rargs);
        result = (Node *) makeBoolExpr(OR_EXPR, list_make2(result, branch));
    }


    else
    {
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("operator %s is not supported for row expressions",
                        oprname)));
    }

    return result;
}

/*
 * Handle something like
 * (A,B,C) = (X,Y,Z)
 * By constructing
 * (A=X) AND (B=Y) AND (C=Z)
 *
 */

static Node *
make_row_op_simple(ParseState *pstate, char *oprname,
                   List *largs, List *rargs)
{
    ListCell *l, *r;
    BoolExprType boolop;
    Node *result;

    boolop = strcmp(oprname, "<>")==0 ? OR_EXPR : AND_EXPR;

    forboth(l, largs, r, rargs)
    {
        Node    *larg = (Node *) lfirst(l);
        Node    *rarg = (Node *) lfirst(r);
        Node    *cmp;

        cmp = (Node *) make_op(pstate, opname, larg, rarg);
        cmp = coerce_to_boolean(pstate, cmp, "row comparison");
        if (result == NULL)
            result = cmp;
        else
            result = (Node *) makeBoolExpr(boolop,
                                           list_make2(result, cmp));
    }

    if (result == NULL)
    {
        /* zero-length rows?  Generate constant TRUE or FALSE */
        if (boolop == AND_EXPR)
            result = makeBoolConst(true, false);
        else
            result = makeBoolConst(false, false);
    }

    return result;
}


/*
 * Handles something like:
 * (A,B,C) > (X,Y,Z)
 *
 * By constructing something like:
 * ( ( A > X) OR (A=X AND B>Y) OR (A=X AND B=Y AND C>Z) )
 *
 */

static Node *
make_row_op_complex(ParseState *pstate, char *oprname,
                    List *largs, List *rargs)
{
    ListCell   *l, *outer_l,
               *r, *outer_r;
    Node *result;

    forboth(outer_l, largs, outer_r, rargs)
    {
        Node *outer_larg = (Node *) lfirst(outer_l);
        Node *outer_rarg = (Node *) lfirst(outer_r);
        Node *branch = NULL;
        Node *cmp;

        /* all leading elements have to be equal */
        forboth(l, largs, r, rargs)
        {
            Node    *larg = (Node *) lfirst(l);
            Node    *rarg = (Node *) lfirst(r);
            Node    *cmp;

            if (larg == outer_larg) {
                break;
            }

            cmp = (Node *) make_op(pstate, "=", larg, rarg);
            cmp = coerce_to_boolean(pstate, cmp, "row comparison");
            if (branch == NULL)
                branch = cmp;
            else
                branch = (Node *) makeBoolExpr(AND_EXPR,
                                               list_make2(branch, cmp));
        }

        /* trailing element has to be strictly greater or less than */

        cmp = (Node *) make_op(pstate, oprname, outer_larg, outer_rarg);
        cmp = coerce_to_boolean(pstate, cmp, "row comparison");
        branch = branch==NULL ? cmp : (Node *) makeBoolExpr(AND_EXPR, list_make2(branch, cmp));

        result = result==NULL ? branch : (Node *) makeBoolExpr(OR_EXPR, list_make2(result, branch));
    }

    if (result == NULL)
    {
        /* zero-length rows?  Generate constant FALSE */
        result = makeBoolConst(true, false);
    }

    return result;
}

--
greg

Re: best way to fetch next/prev record based on index

From

Tom Lane

Date:

28 July 2004, 11:07:28

Greg Stark <gsstark@mit.edu> writes:
> Removing <,<=,>,>= would be trivial.

... and not backwards-compatible.  If we did that then cases involving
unlabeled row expressions would no longer work as they did in prior
releases.  For example

    select (1,2,3) < (4,5,6);

is accepted by all releases back to 7.0, and probably much further (but
7.0 is the oldest I have handy to test).  The only reason the code in
parse_expr.c appears new is that the functionality used to be in gram.y.

I'd like to see this fixed to comply with the spec, but given the lack
of complaints about the existing behavior over so many years, ripping
it out meanwhile doesn't seem appropriate.

            regards, tom lane

Re: best way to fetch next/prev record based on index

From

Greg Stark

Date:

28 July 2004, 13:01:54

Tom Lane <tgl@sss.pgh.pa.us> writes:

> The only reason the code in parse_expr.c appears new is that the
> functionality used to be in gram.y.

Ah, that was what I was missing. Though it's odd since it seems there was code
in parse_expr.c to handle the "=" case specially.

> I'd like to see this fixed to comply with the spec, but given the lack
> of complaints about the existing behavior over so many years, ripping
> it out meanwhile doesn't seem appropriate.

I tried my hand at this last night and think I did an ok first pass. But I'm
missing one piece of the puzzle to get it to compile.

What do I need to know to be able to construct a List* suitable for passing as
the second arg to make_op() knowing only that I want to create a List* to
represent "=" or "<" or so on?

I also had another question I didn't ask in the other email. In the midst of a
forboth() loop, how would I tell if I'm at the last element of the lists?
Would lnext(l)==NULL do it?

--
greg

Re: best way to fetch next/prev record based on index

From

Tom Lane

Date:

28 July 2004, 13:56:55

Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> The only reason the code in parse_expr.c appears new is that the
>> functionality used to be in gram.y.

> Ah, that was what I was missing. Though it's odd since it seems there was code
> in parse_expr.c to handle the "=" case specially.

IIRC, the case involving a subselect, eg
    ... WHERE (1,2) = ANY (SELECT a, b FROM foo) ...
has always been handled in parse_expr.c, but cases involving simple
rows were previously expanded in gram.y.  One of the reasons I moved
the logic over to parse_expr.c was the thought that it would be easier
to do it right in parse_expr.c --- gram.y would not be allowed to look
up related operators, which seems necessary to handle the construct
per spec.

> I tried my hand at this last night and think I did an ok first pass.

The main issue in my mind is whether to invent a separate node type for
row comparisons.  This is probably a good idea for a number of reasons,
the most obvious being that there's no way to avoid multiple evaluations
of the subexpressions if you try to expand it into simple comparisons.
Also it seems likely that the planner would find it easier to recognize
the relationship to a multicolumn index than if the thing is expanded.
(But teaching the planner and the index mechanisms themselves about this
is going to be a major project in any case.)

One thing I did not like about your first pass is that it makes
unsupportable assumptions about there being a semantic relationship
between operators named, say, '<' and '<='.  Postgres used to have such
bogosity in a number of places but we've managed to get rid of most of
it.  (Offhand I think the only remaining hard-wired assumption about
operators of particular names having particular semantics is that the
foreign key mechanisms assume '=' must be the right operator to compare
keys with.  Eventually we need to get rid of that too.)

IMHO the right way to do this is to look up a suitable btree operator
class and use the appropriate member operators of that class.  (In a
separate-node-type implementation, we'd probably ignore the operators
as such altogether, and just call the btree comparison function of the
opclass.)  It's not entirely clear to me how to select the opclass when
the initially given inputs are of different types, though.  In the
present code we leave it to oper() to do the right thing, including
possibly coercing the inputs to matching types.  Possibly we should
still apply oper(), but then insist that the selected operator appear
as a btree opclass member, comparable to the way we handle sort
operators now.

            regards, tom lane

Re: best way to fetch next/prev record based on index

From

Greg Stark

Date:

28 July 2004, 19:54:50

Tom Lane <tgl@sss.pgh.pa.us> writes:

> One thing I did not like about your first pass is that it makes
> unsupportable assumptions about there being a semantic relationship
> between operators named, say, '<' and '<='.

Hm, I think I even had caught that issue on the mailing list previously.

In that case though, it seems even the existing code is insufficient. Instead
of testing whether the operator with strcmp against "=" and "<>" it should
perhaps be looking for an operator class and the strategy number for the
operator and its negator.

--
greg

Re: best way to fetch next/prev record based on index

From

Tom Lane

Date:

28 July 2004, 20:12:40

Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> One thing I did not like about your first pass is that it makes
>> unsupportable assumptions about there being a semantic relationship
>> between operators named, say, '<' and '<='.

> In that case though, it seems even the existing code is insufficient.

Well, yeah, we know the existing code is broken ;-)

> Instead of testing whether the operator with strcmp against "=" and
> "<>" it should perhaps be looking for an operator class and the
> strategy number for the operator and its negator.

Probably.  You can find some relevant code in indxpath.c in the stuff
that tries to determine whether partial indexes are relevant.

I think that the ideal behavior is that we not look directly at the
operator name at all.  For example it's not too unreasonable to want
to write (a,b) ~<~ (c,d) if you have an index that uses those
non-locale-aware operators.  We should find the operator that matches
the name and input arguments, and then try to make sense of the operator
semantics by matching it to btree opclasses.

Note that it's possible to find multiple matches, for example if someone
has installed a "reverse sort" opclass.  I think we would want to prefer
a match in the datatype's default opclass, if there is one, but
otherwise we can probably follow the lead of the existing code and
assume that any match is equally good.

            regards, tom lane