Thread: join from array or cursor
Suppose I have an integer array (or cursor with one integer column) which represents primary keys of some table. Is there a simple and efficient way to return the rows of the table corresponding to the primary key values and keep them in the same order as the array (or cursor)? Seems like it should be easy, but I'm not seeing it. Thanks, John DeSoi, Ph.D.
John DeSoi wrote: > Suppose I have an integer array (or cursor with one integer column) > which represents primary keys of some table. Is there a simple and > efficient way to return the rows of the table corresponding to the > primary key values and keep them in the same order as the array (or > cursor)? Seems like it should be easy, but I'm not seeing it. > > Thanks, > > > John DeSoi, Ph.D. Matching the rows is easy. Getting them in the same order as the items in the array will require an ORDER BY. The contrib package _int has an idx() that you can use for that. SELECT * FROM foo WHERE foo.id = ANY(myPkArray) ORDER BY idx(myPkArray, id)
On Thu, Aug 20, 2009 at 11:15:12PM -0400, John DeSoi wrote: > Suppose I have an integer array (or cursor with one integer column) > which represents primary keys of some table. Is there a simple and > efficient way to return the rows of the table corresponding to the > primary key values and keep them in the same order as the array (or > cursor)? You could do something like: SELECT f.id FROM foo f, ( SELECT i, myPkArray[i] AS elem FROM generate_series(array_lower(myPkArray,1),array_upper(myPkArray,1)) i) x WHERE f.id = x.elem ORDER BY x.i; It may help to wrap the generate_series call into a function so you don't have to refer to "myPkArray" so many times. -- Sam http://samason.me.uk/
On Aug 21, 2009, at 7:26 AM, Sam Mason wrote: > It may help to wrap the generate_series call into a function so you > don't have to refer to "myPkArray" so many times. Yes, this is the best I have come up with so far. I have a set returning function which returns the key and the index number. The implementation with a cursor looks like this: SELECT * FROM cursor_pk('c1') c LEFT JOIN foo ON (c.pk = foo.pk) ORDER BY c.idx; An array function could return the same values. I need to do some testing to see what the performance looks likes for each method. Thanks for the suggestion. John DeSoi, Ph.D.
On Fri, Aug 21, 2009 at 1:16 PM, John DeSoi<desoi@pgedit.com> wrote: > > Yes, this is the best I have come up with so far. I have a set returning > function which returns the key and the index number. The implementation with > a cursor looks like this: > > SELECT * FROM cursor_pk('c1') c LEFT JOIN foo ON (c.pk = foo.pk) ORDER BY > c.idx; This might be a circumstance where it's really best to just force the planner to use a particular plan. There's no way to tell it "the order of the array" without forcing an entirely unnecessary sort. Something like SET LOCAL enable_mergejoin = 'off'; SET LOCAL enable_hashjoin = 'off'; SELECT * from unnest(arr) as v(id) join tab using (id) RESET ALL; -- greg http://mit.edu/~gsstark/resume.pdf
On Fri, Aug 21, 2009 at 2:16 PM, Greg Stark<gsstark@mit.edu> wrote: > On Fri, Aug 21, 2009 at 1:16 PM, John DeSoi<desoi@pgedit.com> wrote: >> >> Yes, this is the best I have come up with so far. I have a set returning >> function which returns the key and the index number. The implementation with >> a cursor looks like this: >> >> SELECT * FROM cursor_pk('c1') c LEFT JOIN foo ON (c.pk = foo.pk) ORDER BY >> c.idx; > > This might be a circumstance where it's really best to just force the > planner to use a particular plan. There's no way to tell it "the order > of the array" without forcing an entirely unnecessary sort. Of course immediately upon hitting send I did think of a way: SELECT (r).* FROM (SELECT (SELECT x FROM x WHERE a=id) AS r FROM unnest(array[1,2]) AS arr(id) ) AS subq; -- greg http://mit.edu/~gsstark/resume.pdf
On Fri, Aug 21, 2009 at 9:22 AM, Greg Stark<gsstark@mit.edu> wrote: > Of course immediately upon hitting send I did think of a way: > > SELECT (r).* > FROM (SELECT (SELECT x FROM x WHERE a=id) AS r > FROM unnest(array[1,2]) AS arr(id) > ) AS subq; nice use of composite type in select-list subquery. I use this approach often...it's a 'wonder trick'. merlin
On Fri, Aug 21, 2009 at 02:22:54PM +0100, Greg Stark wrote: > SELECT (r).* > FROM (SELECT (SELECT x FROM x WHERE a=id) AS r > FROM unnest(array[1,2]) AS arr(id) > ) AS subq; Shouldn't that second inner SELECT unnecessary? I'd be tempted to write: SELECT ((SELECT x FROM x WHERE x.a = arr.id)).* FROM unnest(array[1,2]) AS arr(id) but PG throws this out for some reason. Adding more brackets doesn't seem to help, the following seems related: SELECT ((SELECT (1,2))).*; The current grammar seems to require two sets of brackets, one for the sub-select and another for pulling the value out of the record. Not quite sure why PG calls it indirection, but I guess that's how it's implemented. I can seem to work around it by doing: CREATE FUNCTION id(anyelement) RETURNS anyelement LANGUAGE sql AS $$ SELECT $1; $$; SELECT (id((1,2))).*; But this seems nasty and bumps up against the annoying "record type has not been registered" that I hit all to often. More fiddling gets to: CREATE TYPE foo AS ( i int, j int ); SELECT (id((SELECT (1,2)::foo))).*; or am I missing something obvious? -- Sam http://samason.me.uk/
On Fri, Aug 21, 2009 at 10:17 AM, Sam Mason<sam@samason.me.uk> wrote: > On Fri, Aug 21, 2009 at 02:22:54PM +0100, Greg Stark wrote: >> SELECT (r).* >> FROM (SELECT (SELECT x FROM x WHERE a=id) AS r >> FROM unnest(array[1,2]) AS arr(id) >> ) AS subq; > > Shouldn't that second inner SELECT unnecessary? I'd be tempted to > write: > > SELECT ((SELECT x FROM x WHERE x.a = arr.id)).* > FROM unnest(array[1,2]) AS arr(id) > > but PG throws this out for some reason. Adding more brackets doesn't > seem to help, the following seems related: > > SELECT ((SELECT (1,2))).*; > > The current grammar seems to require two sets of brackets, one for the > sub-select and another for pulling the value out of the record. Not > quite sure why PG calls it indirection, but I guess that's how it's > implemented. I can seem to work around it by doing: > > CREATE FUNCTION id(anyelement) > RETURNS anyelement > LANGUAGE sql AS $$ > SELECT $1; $$; > > SELECT (id((1,2))).*; > > But this seems nasty and bumps up against the annoying "record type has > not been registered" that I hit all to often. More fiddling gets to: > > CREATE TYPE foo AS ( i int, j int ); > > SELECT (id((SELECT (1,2)::foo))).*; > > or am I missing something obvious? I think that what you are bumping in to is that there is no real definition of '*' in the query. 've griped about this a few times. If type 't' has fields a,b, select (t).* is expanded to select (t).a, (t).b. This can lead to some weird situations. If you have an aggregate function that returns t, for example: select (agg()).*; will run the aggregate function twice (this is a _huge_ gotcha!). I think that '*' needs to be promoted somehow so that it isn't expanded during parsing but has special meaning. merlin
On Fri, Aug 21, 2009 at 10:49:52AM -0400, Merlin Moncure wrote: > On Fri, Aug 21, 2009 at 10:17 AM, Sam Mason<sam@samason.me.uk> wrote: > > CREATE TYPE foo AS ( i int, j int ); > > > > SELECT (id((SELECT (1,2)::foo))).*; > > > > or am I missing something obvious? > > I think that what you are bumping in to is that there is no real > definition of '*' in the query. 've griped about this a few times. If > type 't' has fields a,b, > > select (t).* is expanded to select (t).a, (t).b. OK, but that's not because of there being no "real definition" of *. It seems perfectly well defined in the example above as a tuple of two integers. The thing that causes it to go wrong here is that PG doesn't know that doing this expansion (as you noted above) can cause the query to become very expensive. PG should instead arrange that the expression "t" is run exactly once and reuse the single result for all columns. -- Sam http://samason.me.uk/
Sam Mason <sam@samason.me.uk> writes: > ... PG should instead arrange that the expression > "t" is run exactly once and reuse the single result for all columns. We might be able to do that based on the row-returning-subselect infrastructure being discussed over here: http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us Don't hold your breath though ... regards, tom lane
On Fri, Aug 21, 2009 at 12:05:51PM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > ... PG should instead arrange that the expression > > "t" is run exactly once and reuse the single result for all columns. > > We might be able to do that based on the row-returning-subselect > infrastructure being discussed over here: > http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us Huh, fun. Not sure if I'm interpreting this correctly, but what would the difference in semantics between: SELECT (SELECT 1,2); and SELECT (SELECT (1,2)); PG seems to make the distinction somewhat blurry at the moment. For example, upthread I did: SELECT (id((SELECT (1,2)::foo))).*; and got back two columns, and yet when I do what I think is equivalent: SELECT x.* FROM (SELECT (1,2)::foo) x; I get back my tuple, and not the values inside my tuple. Should I be posting this to -hackers? -- Sam http://samason.me.uk/
On Fri, Aug 21, 2009 at 1:13 PM, Sam Mason<sam@samason.me.uk> wrote: > On Fri, Aug 21, 2009 at 12:05:51PM -0400, Tom Lane wrote: >> Sam Mason <sam@samason.me.uk> writes: >> > ... PG should instead arrange that the expression >> > "t" is run exactly once and reuse the single result for all columns. >> >> We might be able to do that based on the row-returning-subselect >> infrastructure being discussed over here: >> http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us > > Huh, fun. Not sure if I'm interpreting this correctly, but what would > the difference in semantics between: > > SELECT (SELECT 1,2); > > and > > SELECT (SELECT (1,2)); The first form is not allowed because subqueries used that way must return only one column. The second form works because the extra parens constructs a row type which gets around that restriction. > PG seems to make the distinction somewhat blurry at the moment. For > example, upthread I did: > > SELECT (id((SELECT (1,2)::foo))).*; > > and got back two columns, and yet when I do what I think is equivalent: > > SELECT x.* > FROM (SELECT (1,2)::foo) x; That is _not_ equivalent. You are asking for all the columns from the table expression 's'...there is one column, a composite type with two fields. The equivalent would have been: SELECT (x).foo.* ... I think may have missed the point of what I was saying earlier. In any query, when you have an expression of (x).*, if x is a row type of some kind, the expression is rewritten on the fly as: (x),a, (x).b, ... for each field in x. That's why (in say, a rule): select (NEW).* is distinct from (OLD).* is not future proofed against changes in the row type while: select NEW is distinct from OLD is. in the first form the query is expanded on rule creation and will not pick up extra columns if they are added. think of: .* as a macro that is rewritten by the query parser. select ((1,2,nextval('v_seq'))::foo).*; if it was allowed, in current methods would probably expand into: select ((1,2,nextval('v_seq'))::foo).a, ((1,2,nextval('v_seq'))::foo).b, ((1,2,nextval('v_seq'))::foo).c; if foo had fields a,b,c. Hilarity ensues :-). What I was trying to say before is that maybe .* should not expand the query that way, but should just give you the fields of that type and, (very) controversially, do that in view and rule plans as well. merlin
On Fri, Aug 21, 2009 at 01:58:34PM -0400, Merlin Moncure wrote: > On Fri, Aug 21, 2009 at 1:13 PM, Sam Mason<sam@samason.me.uk> wrote: > > On Fri, Aug 21, 2009 at 12:05:51PM -0400, Tom Lane wrote: > >> We might be able to do that based on the row-returning-subselect > >> infrastructure being discussed over here: > >> http://archives.postgresql.org/message-id/4087.1250867036@sss.pgh.pa.us > > > > Not sure if I'm interpreting this correctly, but what would > > the difference in semantics between: > > > > SELECT (SELECT 1,2); > > > > and > > > > SELECT (SELECT (1,2)); > > The first form is not allowed because subqueries used that way must > return only one column. The second form works because the extra > parens constructs a row type which gets around that restriction. I was under the impression that the message Tom was pointing to was about lifting this restriction. I believe the use case was of being able to do: UPDATE foo SET (a,b,c) = (SELECT 1,2,3); if I'm reading things correctly. I was expecting this to be generally available where any sub-select was supported and hence I was wondering what its semantics would be. It seemed like an interesting corner case and I'd not seen it discussed in the linked threads. If I'm not reading things correctly then the rest is a mute point. -- Sam http://samason.me.uk/
On Aug 21, 2009, at 9:22 AM, Greg Stark wrote: > Of course immediately upon hitting send I did think of a way: > > SELECT (r).* > FROM (SELECT (SELECT x FROM x WHERE a=id) AS r > FROM unnest(array[1,2]) AS arr(id) > ) AS subq; Thanks to all for the interesting insights and discussion. Where in the docs can I learn about writing queries like that :). While it avoids the sort of my method, it appears to be almost 5 times slower (about 4000 keys in the cursor, Postgres 8.4.0): EXPLAIN ANALYZE SELECT (r).* FROM (SELECT (SELECT "work" FROM "work" WHERE dbid=id) AS r FROM cursor_pk('c1') AS arr(id) ) AS subq; Function Scan on cursor_pk arr (cost=0.00..116011.72 rows=1000 width=4) (actual time=13.561..249.916 rows=4308 loops=1) SubPlan 1 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 2 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 3 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 4 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 5 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 6 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 7 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 8 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 9 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 10 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 11 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 12 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 13 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) SubPlan 14 -> Index Scan using work_pkey on work (cost=0.00..8.27 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=4308) Index Cond: (dbid = $0) Total runtime: 250.739 ms EXPLAIN ANALYZE SELECT * FROM cursor_pk('c1') c LEFT JOIN "work" ON (c.pk = "work".dbid) order by c.idx; Sort (cost=771.23..773.73 rows=1000 width=375) (actual time=36.058..38.392 rows=4308 loops=1) Sort Key: c.idx Sort Method: external merge Disk: 1656kB -> Merge Right Join (cost=309.83..721.40 rows=1000 width=375) (actual time=15.447..22.293 rows=4308 loops=1) Merge Cond: (work.dbid = c.pk) -> Index Scan using work_pkey on work (cost=0.00..385.80 rows=4308 width=367) (actual time=0.020..2.078 rows=4308 loops=1) -> Sort (cost=309.83..312.33 rows=1000 width=8) (actual time=15.420..15.946 rows=4308 loops=1) Sort Key: c.pk Sort Method: quicksort Memory: 297kB -> Function Scan on cursor_pk_order c (cost=0.00..260.00 rows=1000 width=8) (actual time=12.672..13.073 rows=4308 loops=1) Total runtime: 51.886 ms Thanks for any further suggestions. John DeSoi, Ph.D.
On Sun, Aug 23, 2009 at 1:30 AM, John DeSoi<desoi@pgedit.com> wrote: > While it avoids the sort of my method, it appears to be almost 5 times > slower (about 4000 keys in the cursor, Postgres 8.4.0): > > > Function Scan on cursor_pk arr (cost=0.00..116011.72 rows=1000 width=4) > (actual time=13.561..249.916 rows=4308 loops=1) > SubPlan 1 > SubPlan 2 > SubPlan 3 > ... Ugh, I guess using a subquery didn't work around the problem of the (r).* getting expanded into multiple columns. This is starting to be a more annoying limitation than I realized. This also means when we do things like select (x).* from (select bt_page_items(...)) or select (h).* from (select heap_page_items(...)) It's actually calling bt_page_items() repeatedly, once for every column in the output record? Bleagh. -- greg http://mit.edu/~gsstark/resume.pdf