Thread: Retrieving multiple columns from a subquery

Retrieving multiple columns from a subquery

From
Chris Hanks
Date:
Hello -

I have two tables:

CREATE TABLE users
(
  id serial NOT NULL,
  created_at timestamp with time zone NOT NULL,
  last_seen_at timestamp with time zone NOT NULL,
  -- some other columns...
)

CREATE TABLE emails
(
  user_id integer NOT NULL,
  address text NOT NULL,
  created_at timestamp with time zone NOT NULL,
  confirmed_at timestamp with time zone,
  -- some other columns...
  CONSTRAINT emails_user_id_fkey FOREIGN KEY (user_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
)

The gist is that a single user can be related to multiple emails, and some email addresses are confirmed (they've clicked a link I've sent there, so I know it's valid) and some aren't.

Routinely, when I'm fetching users from the db I also want to get the best email address for each user. That is, the email address that they've confirmed the most recently, or failing that, the one that they created the most recently. I've been doing this via a subselect:

SELECT *,
  (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email"
FROM "users"

I like the subquery approach because I can use my ORM to easily append it to whatever query I'm running against the users table (whether I'm looking up one user or many), without having to do an explicit join and trim out the unnecessary rows. Also, in the future I'm planning on adding additional subqueries to get (for example) each user's current subscription status, and I'm afraid that the joins will get ungainly. Besides, I find subqueries much easier to reason about than joins.

My problem is that now I need to get not only the best email's address, but whether it is confirmed (whether confirmed_at is not null). My first attempt was to simply repeat the subquery:

SELECT *,
  (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email",
  ((SELECT "confirmed_at" FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) IS NOT NULL) AS "best_email_confirmed"
FROM "users"

I had hoped Postgres would recognize that the two subqueries were identical, but judging from the explain output from my development database it's not, and it's simply running the subquery twice instead:

"Seq Scan on users  (cost=0.00..333.65 rows=13 width=81)"
"  SubPlan 1"
"    ->  Limit  (cost=12.79..12.79 rows=1 width=48)"
"          ->  Sort  (cost=12.79..12.80 rows=5 width=48)"
"                Sort Key: public.emails.confirmed_at, public.emails.created_at"
"                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5 width=48)"
"                      Recheck Cond: (user_id = users.id)"
"                      ->  Bitmap Index Scan on emails_pkey  (cost=0.00..4.29 rows=5 width=0)"
"                            Index Cond: (user_id = users.id)"
"  SubPlan 2"
"    ->  Limit  (cost=12.79..12.79 rows=1 width=16)"
"          ->  Sort  (cost=12.79..12.80 rows=5 width=16)"
"                Sort Key: public.emails.confirmed_at, public.emails.created_at"
"                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5 width=16)"
"                      Recheck Cond: (user_id = users.id)"
"                      ->  Bitmap Index Scan on emails_pkey  (cost=0.00..4.29 rows=5 width=0)"
"                            Index Cond: (user_id = users.id)"

It would be ideal if I could pull both results from the same subquery, something like:

SELECT *,
  (SELECT "address", "confirmed_at" IS NOT NULL FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS ("best_email", "best_email_confirmed")
FROM "users"

But this isn't valid syntax. I tried putting the subquery under a FROM clause, but it won't work with my "user_id" = "id" condition, and throws "ERROR:  subquery in FROM cannot refer to other relations of same query level". I think CTEs might be an answer, but I'm stuck on 8.3 for the foreseeable future, which doesn't support them.

Does anyone have any suggestions?

Re: Retrieving multiple columns from a subquery

From
Chris Hanks
Date:
Nothing? Are subqueries just not meant to be used this way?


On Wed, May 9, 2012 at 9:42 AM, Chris Hanks <christopher.m.hanks@gmail.com> wrote:
Hello -

I have two tables:

CREATE TABLE users
(
  id serial NOT NULL,
  created_at timestamp with time zone NOT NULL,
  last_seen_at timestamp with time zone NOT NULL,
  -- some other columns...
)

CREATE TABLE emails
(
  user_id integer NOT NULL,
  address text NOT NULL,
  created_at timestamp with time zone NOT NULL,
  confirmed_at timestamp with time zone,
  -- some other columns...
  CONSTRAINT emails_user_id_fkey FOREIGN KEY (user_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
)

The gist is that a single user can be related to multiple emails, and some email addresses are confirmed (they've clicked a link I've sent there, so I know it's valid) and some aren't.

Routinely, when I'm fetching users from the db I also want to get the best email address for each user. That is, the email address that they've confirmed the most recently, or failing that, the one that they created the most recently. I've been doing this via a subselect:

SELECT *,
  (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email"
FROM "users"

I like the subquery approach because I can use my ORM to easily append it to whatever query I'm running against the users table (whether I'm looking up one user or many), without having to do an explicit join and trim out the unnecessary rows. Also, in the future I'm planning on adding additional subqueries to get (for example) each user's current subscription status, and I'm afraid that the joins will get ungainly. Besides, I find subqueries much easier to reason about than joins.

My problem is that now I need to get not only the best email's address, but whether it is confirmed (whether confirmed_at is not null). My first attempt was to simply repeat the subquery:

SELECT *,
  (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email",
  ((SELECT "confirmed_at" FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) IS NOT NULL) AS "best_email_confirmed"
FROM "users"

I had hoped Postgres would recognize that the two subqueries were identical, but judging from the explain output from my development database it's not, and it's simply running the subquery twice instead:

"Seq Scan on users  (cost=0.00..333.65 rows=13 width=81)"
"  SubPlan 1"
"    ->  Limit  (cost=12.79..12.79 rows=1 width=48)"
"          ->  Sort  (cost=12.79..12.80 rows=5 width=48)"
"                Sort Key: public.emails.confirmed_at, public.emails.created_at"
"                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5 width=48)"
"                      Recheck Cond: (user_id = users.id)"
"                      ->  Bitmap Index Scan on emails_pkey  (cost=0.00..4.29 rows=5 width=0)"
"                            Index Cond: (user_id = users.id)"
"  SubPlan 2"
"    ->  Limit  (cost=12.79..12.79 rows=1 width=16)"
"          ->  Sort  (cost=12.79..12.80 rows=5 width=16)"
"                Sort Key: public.emails.confirmed_at, public.emails.created_at"
"                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5 width=16)"
"                      Recheck Cond: (user_id = users.id)"
"                      ->  Bitmap Index Scan on emails_pkey  (cost=0.00..4.29 rows=5 width=0)"
"                            Index Cond: (user_id = users.id)"

It would be ideal if I could pull both results from the same subquery, something like:

SELECT *,
  (SELECT "address", "confirmed_at" IS NOT NULL FROM "emails" WHERE ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS ("best_email", "best_email_confirmed")
FROM "users"

But this isn't valid syntax. I tried putting the subquery under a FROM clause, but it won't work with my "user_id" = "id" condition, and throws "ERROR:  subquery in FROM cannot refer to other relations of same query level". I think CTEs might be an answer, but I'm stuck on 8.3 for the foreseeable future, which doesn't support them.

Does anyone have any suggestions?

Re: Retrieving multiple columns from a subquery

From
Merlin Moncure
Date:
On Wed, May 9, 2012 at 11:42 AM, Chris Hanks
<christopher.m.hanks@gmail.com> wrote:
> Hello -
>
> I have two tables:
>
> CREATE TABLE users
> (
>   id serial NOT NULL,
>   created_at timestamp with time zone NOT NULL,
>   last_seen_at timestamp with time zone NOT NULL,
>   -- some other columns...
> )
>
> CREATE TABLE emails
> (
>   user_id integer NOT NULL,
>   address text NOT NULL,
>   created_at timestamp with time zone NOT NULL,
>   confirmed_at timestamp with time zone,
>   -- some other columns...
>   CONSTRAINT emails_user_id_fkey FOREIGN KEY (user_id)
>       REFERENCES users (id) MATCH SIMPLE
>       ON UPDATE CASCADE ON DELETE CASCADE
> )
>
> The gist is that a single user can be related to multiple emails, and some
> email addresses are confirmed (they've clicked a link I've sent there, so I
> know it's valid) and some aren't.
>
> Routinely, when I'm fetching users from the db I also want to get the best
> email address for each user. That is, the email address that they've
> confirmed the most recently, or failing that, the one that they created the
> most recently. I've been doing this via a subselect:
>
> SELECT *,
>   (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
> "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email"
> FROM "users"
>
> I like the subquery approach because I can use my ORM to easily append it to
> whatever query I'm running against the users table (whether I'm looking up
> one user or many), without having to do an explicit join and trim out the
> unnecessary rows. Also, in the future I'm planning on adding additional
> subqueries to get (for example) each user's current subscription status, and
> I'm afraid that the joins will get ungainly. Besides, I find subqueries much
> easier to reason about than joins.
>
> My problem is that now I need to get not only the best email's address, but
> whether it is confirmed (whether confirmed_at is not null). My first attempt
> was to simply repeat the subquery:
>
> SELECT *,
>   (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
> "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email",
>   ((SELECT "confirmed_at" FROM "emails" WHERE ("user_id" = "id") ORDER BY
> "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) IS NOT NULL) AS
> "best_email_confirmed"
> FROM "users"
>
> I had hoped Postgres would recognize that the two subqueries were identical,
> but judging from the explain output from my development database it's not,
> and it's simply running the subquery twice instead:
>
> "Seq Scan on users  (cost=0.00..333.65 rows=13 width=81)"
> "  SubPlan 1"
> "    ->  Limit  (cost=12.79..12.79 rows=1 width=48)"
> "          ->  Sort  (cost=12.79..12.80 rows=5 width=48)"
> "                Sort Key: public.emails.confirmed_at,
> public.emails.created_at"
> "                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5
> width=48)"
> "                      Recheck Cond: (user_id = users.id)"
> "                      ->  Bitmap Index Scan on emails_pkey
>  (cost=0.00..4.29 rows=5 width=0)"
> "                            Index Cond: (user_id = users.id)"
> "  SubPlan 2"
> "    ->  Limit  (cost=12.79..12.79 rows=1 width=16)"
> "          ->  Sort  (cost=12.79..12.80 rows=5 width=16)"
> "                Sort Key: public.emails.confirmed_at,
> public.emails.created_at"
> "                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5
> width=16)"
> "                      Recheck Cond: (user_id = users.id)"
> "                      ->  Bitmap Index Scan on emails_pkey
>  (cost=0.00..4.29 rows=5 width=0)"
> "                            Index Cond: (user_id = users.id)"
>
> It would be ideal if I could pull both results from the same subquery,
> something like:
>
> SELECT *,
>   (SELECT "address", "confirmed_at" IS NOT NULL FROM "emails" WHERE
> ("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at"
> DESC LIMIT 1) AS ("best_email", "best_email_confirmed")
> FROM "users"
>
> But this isn't valid syntax. I tried putting the subquery under a FROM
> clause, but it won't work with my "user_id" = "id" condition, and throws
> "ERROR:  subquery in FROM cannot refer to other relations of same query
> level". I think CTEs might be an answer, but I'm stuck on 8.3 for the
> foreseeable future, which doesn't support them.
>
> Does anyone have any suggestions?

try this.  sometimes, but not always, it will avoid the extra subplans:

SELECT u.*, (emails).* FROM
(
   SELECT *,
   (SELECT e FROM "emails" e WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS emails,
  FROM "users" u
) q

merlin

Re: Retrieving multiple columns from a subquery

From
Tom Lane
Date:
Chris Hanks <christopher.m.hanks@gmail.com> writes:
> Nothing? Are subqueries just not meant to be used this way?

The SQL standard says not ;-).

You could approximate it like this:

    select ..., (select row(x,y,z) from ...), ... from ...;

as long as you don't mind pulling the composite-value output syntax
apart.  This avoids the single-output-column syntactic restriction
by cramming all the values into one column.

[ thinks for a bit... ]  It seems like you ought to be able to get PG
to pull the composite values apart again, with something like

    select ..., (x).*, ... from
    (select ..., (select row(x,y,z) from ...) as x, ...
     from ... offset 0) ss;

but when I try this I get
    ERROR:  record type has not been registered
That's a bug, probably, but dunno how hard to fix.  In the meantime you
could work around it by casting the row() expression to a named
composite type; which might be a good idea anyway since there's no other
obvious way to control the column names that will be exposed by the
(x).* expansion.

            regards, tom lane

Re: Retrieving multiple columns from a subquery

From
Chris Hanks
Date:


On Mon, May 14, 2012 at 8:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Chris Hanks <christopher.m.hanks@gmail.com> writes:
> Nothing? Are subqueries just not meant to be used this way?

The SQL standard says not ;-).

You could approximate it like this:

       select ..., (select row(x,y,z) from ...), ... from ...;

as long as you don't mind pulling the composite-value output syntax
apart.  This avoids the single-output-column syntactic restriction
by cramming all the values into one column.

[ thinks for a bit... ]  It seems like you ought to be able to get PG
to pull the composite values apart again, with something like

       select ..., (x).*, ... from
       (select ..., (select row(x,y,z) from ...) as x, ...
        from ... offset 0) ss;

but when I try this I get
       ERROR:  record type has not been registered
That's a bug, probably, but dunno how hard to fix.  In the meantime you
could work around it by casting the row() expression to a named
composite type; which might be a good idea anyway since there's no other
obvious way to control the column names that will be exposed by the
(x).* expansion.

                       regards, tom lane


Thanks, I tried playing with the row function a bit. It gave me the idea to try:

SELECT *, (SELECT ARRAY[address, (confirmed_at is not null)::text]
  FROM "emails"
  WHERE ("user_id" = "id")
  ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) as best_email
FROM "users"

Since my ORM already handles Postgres arrays for me, this winds up being a bit easier to handle in my app. It's a bit ugly, but it works. I'll keep the idea of the named composite type around in case I need to revisit this later, though.

Thanks again for the advice!