Thread: join from array or cursor

join from array or cursor

From
John DeSoi
Date:
Suppose I have an integer array (or cursor with one integer column)
which represents primary keys of some table. Is there a simple and
efficient way to return the rows of the table corresponding to the
primary key values and keep them in the same order as the array (or
cursor)? Seems like it should be easy, but I'm not seeing it.

Thanks,


John DeSoi, Ph.D.





Re: join from array or cursor

From
Scott Bailey
Date:
John DeSoi wrote:
> Suppose I have an integer array (or cursor with one integer column)
> which represents primary keys of some table. Is there a simple and
> efficient way to return the rows of the table corresponding to the
> primary key values and keep them in the same order as the array (or
> cursor)? Seems like it should be easy, but I'm not seeing it.
>
> Thanks,
>
>
> John DeSoi, Ph.D.

Matching the rows is easy. Getting them in the same order as the items
in the array will require an ORDER BY. The contrib package _int has an
idx() that you can use for that.

SELECT *
FROM foo
WHERE foo.id = ANY(myPkArray)
ORDER BY idx(myPkArray, id)


Re: join from array or cursor

From
Sam Mason
Date:
On Thu, Aug 20, 2009 at 11:15:12PM -0400, John DeSoi wrote:
> Suppose I have an integer array (or cursor with one integer column)
> which represents primary keys of some table. Is there a simple and
> efficient way to return the rows of the table corresponding to the
> primary key values and keep them in the same order as the array (or
> cursor)?

You could do something like:

  SELECT f.id
  FROM foo f, (
    SELECT i, myPkArray[i] AS elem
    FROM generate_series(array_lower(myPkArray,1),array_upper(myPkArray,1)) i) x
  WHERE f.id = x.elem
  ORDER BY x.i;

It may help to wrap the generate_series call into a function so you
don't have to refer to "myPkArray" so many times.

--
  Sam  http://samason.me.uk/

Re: join from array or cursor

From
John DeSoi
Date:
On Aug 21, 2009, at 7:26 AM, Sam Mason wrote:

> It may help to wrap the generate_series call into a function so you
> don't have to refer to "myPkArray" so many times.

Yes, this is the best I have come up with so far. I have a set
returning function which returns the key and the index number. The
implementation with a cursor looks like this:

SELECT * FROM cursor_pk('c1') c LEFT JOIN foo ON (c.pk = foo.pk) ORDER
BY c.idx;


An array function could return the same values. I need to do some
testing to see what the performance looks likes for each method.
Thanks for the suggestion.



John DeSoi, Ph.D.





Re: join from array or cursor

From
Greg Stark
Date:
On Fri, Aug 21, 2009 at 1:16 PM, John DeSoi<desoi@pgedit.com> wrote:
>
> Yes, this is the best I have come up with so far. I have a set returning
> function which returns the key and the index number. The implementation with
> a cursor looks like this:
>
> SELECT * FROM cursor_pk('c1') c LEFT JOIN foo ON (c.pk = foo.pk) ORDER BY
> c.idx;

This might be a circumstance where it's really best to just force the
planner to use a particular plan. There's no way to tell it "the order
of the array" without forcing an entirely unnecessary sort.

Something like
SET LOCAL enable_mergejoin = 'off';
SET LOCAL enable_hashjoin = 'off';
SELECT * from unnest(arr) as v(id) join tab using (id)
RESET ALL;

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: join from array or cursor

From
Greg Stark
Date:
On Fri, Aug 21, 2009 at 2:16 PM, Greg Stark<gsstark@mit.edu> wrote:
> On Fri, Aug 21, 2009 at 1:16 PM, John DeSoi<desoi@pgedit.com> wrote:
>>
>> Yes, this is the best I have come up with so far. I have a set returning
>> function which returns the key and the index number. The implementation with
>> a cursor looks like this:
>>
>> SELECT * FROM cursor_pk('c1') c LEFT JOIN foo ON (c.pk = foo.pk) ORDER BY
>> c.idx;
>
> This might be a circumstance where it's really best to just force the
> planner to use a particular plan. There's no way to tell it "the order
> of the array" without forcing an entirely unnecessary sort.

Of course immediately upon hitting send I did think of a way:

 SELECT (r).*
  FROM (SELECT (SELECT x FROM x WHERE a=id) AS r
          FROM unnest(array[1,2]) AS arr(id)
       ) AS subq;

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: join from array or cursor

From
Merlin Moncure
Date:
On Fri, Aug 21, 2009 at 9:22 AM, Greg Stark<gsstark@mit.edu> wrote:
> Of course immediately upon hitting send I did think of a way:
>
>  SELECT (r).*
>  FROM (SELECT (SELECT x FROM x WHERE a=id) AS r
>          FROM unnest(array[1,2]) AS arr(id)
>       ) AS subq;

nice use of composite type in select-list subquery.   I use this
approach often...it's a 'wonder trick'.

merlin

Re: join from array or cursor

From
Sam Mason
Date:
On Fri, Aug 21, 2009 at 02:22:54PM +0100, Greg Stark wrote:
>  SELECT (r).*
>   FROM (SELECT (SELECT x FROM x WHERE a=id) AS r
>           FROM unnest(array[1,2]) AS arr(id)
>        ) AS subq;

Shouldn't that second inner SELECT unnecessary?  I'd be tempted to
write:

  SELECT ((SELECT x FROM x WHERE x.a = arr.id)).*
  FROM unnest(array[1,2]) AS arr(id)

but PG throws this out for some reason.  Adding more brackets doesn't
seem to help, the following seems related:

  SELECT ((SELECT (1,2))).*;

The current grammar seems to require two sets of brackets, one for the
sub-select and another for pulling the value out of the record.  Not
quite sure why PG calls it indirection, but I guess that's how it's
implemented.  I can seem to work around it by doing:

  CREATE FUNCTION id(anyelement)
      RETURNS anyelement
      LANGUAGE sql AS $$
    SELECT $1; $$;

  SELECT (id((1,2))).*;

But this seems nasty and bumps up against the annoying "record type has
not been registered" that I hit all to often.  More fiddling gets to:

  CREATE TYPE foo AS ( i int, j int );

  SELECT (id((SELECT (1,2)::foo))).*;

or am I missing something obvious?

--
  Sam  http://samason.me.uk/

Re: join from array or cursor

From
Merlin Moncure
Date:
On Fri, Aug 21, 2009 at 10:17 AM, Sam Mason<sam@samason.me.uk> wrote:
> On Fri, Aug 21, 2009 at 02:22:54PM +0100, Greg Stark wrote:
>>  SELECT (r).*
>>   FROM (SELECT (SELECT x FROM x WHERE a=id) AS r
>>           FROM unnest(array[1,2]) AS arr(id)
>>        ) AS subq;
>
> Shouldn't that second inner SELECT unnecessary?  I'd be tempted to
> write:
>
>  SELECT ((SELECT x FROM x WHERE x.a = arr.id)).*
>  FROM unnest(array[1,2]) AS arr(id)
>
> but PG throws this out for some reason.  Adding more brackets doesn't
> seem to help, the following seems related:
>
>  SELECT ((SELECT (1,2))).*;
>
> The current grammar seems to require two sets of brackets, one for the
> sub-select and another for pulling the value out of the record.  Not
> quite sure why PG calls it indirection, but I guess that's how it's
> implemented.  I can seem to work around it by doing:
>
>  CREATE FUNCTION id(anyelement)
>      RETURNS anyelement
>      LANGUAGE sql AS $$
>    SELECT $1; $$;
>
>  SELECT (id((1,2))).*;
>
> But this seems nasty and bumps up against the annoying "record type has
> not been registered" that I hit all to often.  More fiddling gets to:
>
>  CREATE TYPE foo AS ( i int, j int );
>
>  SELECT (id((SELECT (1,2)::foo))).*;
>
> or am I missing something obvious?

I think that what you are bumping in to is that there is no real
definition of '*' in the query.  've griped about this a few times. If
type 't' has fields a,b,

select (t).* is expanded to select (t).a, (t).b.

This can lead to some weird situations.  If you have an aggregate
function that returns t, for example:

select (agg()).*;  will run the aggregate function twice (this is a
_huge_ gotcha!).  I think that '*' needs to be promoted somehow so
that it isn't expanded during parsing but has special meaning.

merlin

Re: join from array or cursor

From
Sam Mason
Date:
On Fri, Aug 21, 2009 at 10:49:52AM -0400, Merlin Moncure wrote:
> On Fri, Aug 21, 2009 at 10:17 AM, Sam Mason<sam@samason.me.uk> wrote:
> >   CREATE TYPE foo AS ( i int, j int );
> >
> >   SELECT (id((SELECT (1,2)::foo))).*;
> >
> > or am I missing something obvious?
>
> I think that what you are bumping in to is that there is no real
> definition of '*' in the query.  've griped about this a few times. If
> type 't' has fields a,b,
>
> select (t).* is expanded to select (t).a, (t).b.

OK, but that's not because of there being no "real definition" of *.
It seems perfectly well defined in the example above as a tuple of two
integers.  The thing that causes it to go wrong here is that PG doesn't
know that doing this expansion (as you noted above) can cause the query
to become very expensive.  PG should instead arrange that the expression
"t" is run exactly once and reuse the single result for all columns.

--
  Sam  http://samason.me.uk/

Re: join from array or cursor

From
Tom Lane
Date:
Sam Mason <sam@samason.me.uk> writes:
> ... PG should instead arrange that the expression
> "t" is run exactly once and reuse the single result for all columns.

We might be able to do that based on the row-returning-subselect
infrastructure being discussed over here:
http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us
Don't hold your breath though ...

            regards, tom lane

Re: join from array or cursor

From
Sam Mason
Date:
On Fri, Aug 21, 2009 at 12:05:51PM -0400, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > ... PG should instead arrange that the expression
> > "t" is run exactly once and reuse the single result for all columns.
>
> We might be able to do that based on the row-returning-subselect
> infrastructure being discussed over here:
> http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us

Huh, fun.  Not sure if I'm interpreting this correctly, but what would
the difference in semantics between:

  SELECT (SELECT 1,2);

and

  SELECT (SELECT (1,2));

PG seems to make the distinction somewhat blurry at the moment.  For
example, upthread I did:

  SELECT (id((SELECT (1,2)::foo))).*;

and got back two columns, and yet when I do what I think is equivalent:

  SELECT x.*
  FROM (SELECT (1,2)::foo) x;

I get back my tuple, and not the values inside my tuple.


Should I be posting this to -hackers?

--
  Sam  http://samason.me.uk/

Re: join from array or cursor

From
Merlin Moncure
Date:
On Fri, Aug 21, 2009 at 1:13 PM, Sam Mason<sam@samason.me.uk> wrote:
> On Fri, Aug 21, 2009 at 12:05:51PM -0400, Tom Lane wrote:
>> Sam Mason <sam@samason.me.uk> writes:
>> > ... PG should instead arrange that the expression
>> > "t" is run exactly once and reuse the single result for all columns.
>>
>> We might be able to do that based on the row-returning-subselect
>> infrastructure being discussed over here:
>> http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us
>
> Huh, fun.  Not sure if I'm interpreting this correctly, but what would
> the difference in semantics between:
>
>  SELECT (SELECT 1,2);
>
> and
>
>  SELECT (SELECT (1,2));


The first form is not allowed because subqueries used that way must
return only one column.  The second form works because the extra
parens constructs a row type which gets around that restriction.

> PG seems to make the distinction somewhat blurry at the moment.  For
> example, upthread I did:
>
>  SELECT (id((SELECT (1,2)::foo))).*;
>
> and got back two columns, and yet when I do what I think is equivalent:
>
>  SELECT x.*
>  FROM (SELECT (1,2)::foo) x;

That is _not_ equivalent.  You are asking for all the columns from the
table expression 's'...there is one column, a composite type with two
fields.  The equivalent would have been:

SELECT (x).foo.* ...


I think may have missed the point of what I was saying earlier.  In
any query, when you have an expression of (x).*, if x is a row type of
some kind, the expression is rewritten on the fly as:

(x),a, (x).b, ... for each field in x.  That's why (in say, a rule):

select (NEW).* is distinct from (OLD).*
is not future proofed against changes in the row type while:
select NEW is distinct from OLD

is.  in the first form the query is expanded on rule creation and will
not pick up extra columns if  they are added.  think of: .* as a macro
that is rewritten by the query parser.

select ((1,2,nextval('v_seq'))::foo).*; if it was allowed, in current
methods would probably expand into:

select ((1,2,nextval('v_seq'))::foo).a,
((1,2,nextval('v_seq'))::foo).b, ((1,2,nextval('v_seq'))::foo).c;

if foo had fields a,b,c.  Hilarity ensues :-).

What I was trying to say before is that maybe .* should not expand the
query that way, but should just give you the fields of that type and,
(very) controversially, do that in view and rule plans as well.

merlin

Re: join from array or cursor

From
Sam Mason
Date:
On Fri, Aug 21, 2009 at 01:58:34PM -0400, Merlin Moncure wrote:
> On Fri, Aug 21, 2009 at 1:13 PM, Sam Mason<sam@samason.me.uk> wrote:
> > On Fri, Aug 21, 2009 at 12:05:51PM -0400, Tom Lane wrote:
> >> We might be able to do that based on the row-returning-subselect
> >> infrastructure being discussed over here:
> >> http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us
> >
> > Not sure if I'm interpreting this correctly, but what would
> > the difference in semantics between:
> >
> >  SELECT (SELECT 1,2);
> >
> > and
> >
> >  SELECT (SELECT (1,2));
>
> The first form is not allowed because subqueries used that way must
> return only one column.  The second form works because the extra
> parens constructs a row type which gets around that restriction.

I was under the impression that the message Tom was pointing to was
about lifting this restriction.  I believe the use case was of being
able to do:

  UPDATE foo SET (a,b,c) = (SELECT 1,2,3);

if I'm reading things correctly.  I was expecting this to be generally
available where any sub-select was supported and hence I was wondering
what its semantics would be.  It seemed like an interesting corner case
and I'd not seen it discussed in the linked threads.

If I'm not reading things correctly then the rest is a mute point.

--
  Sam  http://samason.me.uk/

Re: join from array or cursor

From
John DeSoi
Date:
On Aug 21, 2009, at 9:22 AM, Greg Stark wrote:

> Of course immediately upon hitting send I did think of a way:
>
> SELECT (r).*
>  FROM (SELECT (SELECT x FROM x WHERE a=id) AS r
>          FROM unnest(array[1,2]) AS arr(id)
>       ) AS subq;

Thanks to all for the interesting insights and discussion. Where in
the docs can I learn about writing queries like that :).

While it avoids the sort of my method, it appears to be almost 5 times
slower (about 4000 keys in the cursor, Postgres 8.4.0):

EXPLAIN ANALYZE SELECT (r).*
  FROM (SELECT (SELECT "work" FROM "work" WHERE dbid=id) AS r
          FROM cursor_pk('c1') AS arr(id)
       ) AS subq;


Function Scan on cursor_pk arr  (cost=0.00..116011.72 rows=1000
width=4) (actual time=13.561..249.916 rows=4308 loops=1)
   SubPlan 1
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.003..0.003 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 2
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 3
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 4
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 5
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 6
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 7
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 8
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 9
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 10
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 11
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 12
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 13
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
   SubPlan 14
     ->  Index Scan using work_pkey on work  (cost=0.00..8.27 rows=1
width=32) (actual time=0.002..0.002 rows=1 loops=4308)
           Index Cond: (dbid = $0)
Total runtime: 250.739 ms



EXPLAIN ANALYZE SELECT * FROM cursor_pk('c1') c LEFT JOIN "work" ON
(c.pk = "work".dbid) order by c.idx;

Sort  (cost=771.23..773.73 rows=1000 width=375) (actual
time=36.058..38.392 rows=4308 loops=1)
   Sort Key: c.idx
   Sort Method:  external merge  Disk: 1656kB
   ->  Merge Right Join  (cost=309.83..721.40 rows=1000 width=375)
(actual time=15.447..22.293 rows=4308 loops=1)
         Merge Cond: (work.dbid = c.pk)
         ->  Index Scan using work_pkey on work  (cost=0.00..385.80
rows=4308 width=367) (actual time=0.020..2.078 rows=4308 loops=1)
         ->  Sort  (cost=309.83..312.33 rows=1000 width=8) (actual
time=15.420..15.946 rows=4308 loops=1)
               Sort Key: c.pk
               Sort Method:  quicksort  Memory: 297kB
               ->  Function Scan on cursor_pk_order c
(cost=0.00..260.00 rows=1000 width=8) (actual time=12.672..13.073
rows=4308 loops=1)
Total runtime: 51.886 ms


Thanks for any further suggestions.



John DeSoi, Ph.D.





Re: join from array or cursor

From
Greg Stark
Date:
On Sun, Aug 23, 2009 at 1:30 AM, John DeSoi<desoi@pgedit.com> wrote:
> While it avoids the sort of my method, it appears to be almost 5 times
> slower (about 4000 keys in the cursor, Postgres 8.4.0):
>
>
> Function Scan on cursor_pk arr  (cost=0.00..116011.72 rows=1000 width=4)
> (actual time=13.561..249.916 rows=4308 loops=1)
>  SubPlan 1
>  SubPlan 2
>  SubPlan 3
>  ...

Ugh, I guess using a subquery didn't work around the problem of the
(r).* getting expanded into multiple columns. This is starting to be a
more annoying limitation than I realized.

This also means when we do things like

select (x).* from (select bt_page_items(...))

or

select (h).* from (select  heap_page_items(...))

It's actually calling bt_page_items() repeatedly, once for every
column in the output record?  Bleagh.

--
greg
http://mit.edu/~gsstark/resume.pdf