Thread: Controlling complexity in queries

Controlling complexity in queries

From

Robert James

Date:

11 December 2011, 23:46:30

I have a very long query.  Due to the planner and good indexing, it
runs quite fast.  But it's so long, it's quite hard to follow.

I'm trying to break it up into pieces, but am running up against
limits of SQL.  Can you help me with any of these problems?

1.
SELECT
<complicated expression A with sub expression B> AS A,
<complicated expression C with sub expression B> AS C,
<complicated expression D with sub expression B> AS D
...

I'd like to be able to extract the common subexpression B and give it
a name (called "output_name" in the docs).  But there's no way then to
reference it from the SELECT clause.  Any workarounds?

2. complicated join and subquery
I'd like to extract subparts of this which are conceptually cohesive
and make them VIEWs.  The problem is that they depend on parameters
(in the ON and WHERE clauses), and VIEWs don't allow parameters.  I
could use set returning functions, but, besides the headache involved,
I've found that these tend to stop the planner from peering inside
them, and hence ruin performance.

Is there a recommend solution?

Re: Controlling complexity in queries

From

David Johnston

Date:

12 December 2011, 00:16:15

Inlined.

David J.

On Dec 11, 2011, at 19:46, Robert James <srobertjames@gmail.com> wrote:

> I have a very long query.  Due to the planner and good indexing, it
> runs quite fast.  But it's so long, it's quite hard to follow.
>
> I'm trying to break it up into pieces, but am running up against
> limits of SQL.  Can you help me with any of these problems?
>
> 1.
> SELECT
> <complicated expression A with sub expression B> AS A,
> <complicated expression C with sub expression B> AS C,
> <complicated expression D with sub expression B> AS D
> ...
>
> I'd like to be able to extract the common subexpression B and give it
> a name (called "output_name" in the docs).  But there's no way then to
> reference it from the SELECT clause.  Any workarounds?

Use a WITH clause on the SELECT statement.

>
> 2. complicated join and subquery
> I'd like to extract subparts of this which are conceptually cohesive
> and make them VIEWs.  The problem is that they depend on parameters
> (in the ON and WHERE clauses), and VIEWs don't allow parameters.  I
> could use set returning functions, but, besides the headache involved,
> I've found that these tend to stop the planner from peering inside
> them, and hence ruin performance.
>
> Is there a recommend solution?

Use the VIEWs and let the planner optimize based upon the calling statement's WHERE clause.     ON clause parameters
area different story than the WHERE clause so see if you can avoid them. 

In the end if it is a critical area some degree of trail-and-error is useful; custom materialized views may also work.

>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Controlling complexity in queries

From

Craig Ringer

Date:

12 December 2011, 02:13:06

On 12/12/2011 09:15 AM, David Johnston wrote:
> Use a WITH clause on the SELECT statement.
Note that WITH is an optimisation fence, so if you're relying on Pg
pushing WHERE clauses down into subqueries or anything like that you may
find that your query runs a LOT slower when broken up as WITH expressions.

There's been talk of a Pg extension that allows optimisation through
WITH, but it's not currently possible.

Another option is to wrap things up in SQL functions or views.

--
Craig Ringer

Re: Controlling complexity in queries

From

Merlin Moncure

Date:

12 December 2011, 14:43:41

On Sun, Dec 11, 2011 at 9:10 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 12/12/2011 09:15 AM, David Johnston wrote:
>>
>> Use a WITH clause on the SELECT statement.
>
> Note that WITH is an optimisation fence, so if you're relying on Pg pushing
> WHERE clauses down into subqueries or anything like that you may find that
> your query runs a LOT slower when broken up as WITH expressions.
>
> There's been talk of a Pg extension that allows optimisation through WITH,
> but it's not currently possible.
>
> Another option is to wrap things up in SQL functions or views.

A note about that:  abstracting via views vs functions is a completely
different approach.  Views will not significantly change the way your
query works now -- they are inlined as macros and the final query is
going to be more or less the same as your hand rolled one.

Breaking your large queries into functions OTOH can make significant
changes to the plan, often to the worse.  This is because functions,
especially complicated plpgsql set returning ones with procedural
logic, are black boxes to the sql optimizer.  The upshot of this is
that functions tend to encourage nestloop style plans because the
function has to be serially executed.

Functions (also WITH) are great in that they can provide very high
levels of abstraction when composing complex queries, but there is a
price in the sense that you are taking away some of the database's
ability to plan and optimize the query.  I prefer views unless there
is a good reason not to use them.

In the end, the performance of your queries is going to be directly
related to how well you map the problem into relational logic...the
database thinks relationally, so you (the OP) should learn to do so as
well.

merlin

Re: Controlling complexity in queries

From

Jay Levitt

Date:

13 December 2011, 21:27:59

Merlin Moncure wrote:
> Breaking your large queries into functions OTOH can make significant
> changes to the plan, often to the worse.

As an end-user, I think this is an area where PostgreSQL could really stand
out (that and the moon launch).  In Rails-land, you don't have The DBA that
writes queries. You have a developer and an ORM, and when they step outside
that ORM to do the cool things SQL can do, all their DRY habits fall apart,
because it's 1990 again and you can either write clear code or fast code but
not both.

Postgres has made huge inroads with its optimizer.  A few years ago, for web
startups picking their relational DB, the performant answer was "MySQL for
fast, simple, mostly-read queries, PostgreSQL when you grow up."  That's
become "Probably Postgres", and with 9.2's COUNT(*), it's become "Definitely
Postgres".

But having to write one big query for performance feels exactly like having
to write one big C function with unrolled loops.  I'm currently taking a
well-factored, function-based query and turning it INTO what Robert James is
trying to get OUT of: a monolithic query.

> In the end, the performance of your queries is going to be directly
> related to how well you map the problem into relational logic

It's not just that, though; it's quite possible to think relationally and
still fall down. There are plenty of cases where the human eye can see that
a modular function can be inlined, but the optimizer can't.  I have a
pathological case: a query against a database with just a few thousand users
takes 1.5 seconds on fast hardware, because it ends up scanning a cartesian
product to get 16 rows, even before you get to the nested loops. In fact,
most of the time the optimizer does a great job of inlining all my
set-returning functions, once 9.0.6/9.1.2 rolled out.

I've seen at least three equally ominous pieces that would have to happen to
allow DRY, composable SQL:

1. Optional optimization of non-recursive WITH
2. Optional pushdown of WHERE clauses into GROUP BY[1]
3. LATERAL

AFAIK, none of these are on anyone's short-term to-do list, and I'm sure
none are easy.

[1] Since this is my current favorite problem, the pathological case is:

select questions.id
from questions
join (
   select u.id
   from users as u
   group by u.id
) as s
on s.id = questions.user_id
where questions.id = 1;

With users.id as a primary key, it's obvious that this can return only one
row, but it has to scan the users table to get there.  See the "Subjquery in
a JOIN not getting restricted?" thread on pgsql-performance for Tom's
explanation of why that's a hard problem to solve.

Re: Controlling complexity in queries

From

Merlin Moncure

Date:

14 December 2011, 14:33:21

On Tue, Dec 13, 2011 at 4:27 PM, Jay Levitt <jay.levitt@gmail.com> wrote:
> Merlin Moncure wrote:
>>
>> Breaking your large queries into functions OTOH can make significant
>> changes to the plan, often to the worse.
>
>
> As an end-user, I think this is an area where PostgreSQL could really stand
> out (that and the moon launch).  In Rails-land, you don't have The DBA that
> writes queries. You have a developer and an ORM, and when they step outside
> that ORM to do the cool things SQL can do, all their DRY habits fall apart,
> because it's 1990 again and you can either write clear code or fast code but
> not both.

As far as ORMs are concerned, I'll take 1990 over 2011 all day long.
The heavy emphasis on procedural/OO coding tactics to model business
applications IMNSHO is and always been a total catastrophe.  The
reasons for that are numerous: insane defect rates, brittle functional
relationships, poor concurrency model etc.  Enterprises move off SQL
because they hate paying the dollars top SQL talent demands only to
pay the ultimate price when development gets crushed under the weight
of maintaining all that crappy code.  This sad state of affairs has
been encouraged by some of the top software vendors.

> But having to write one big query for performance feels exactly like having
> to write one big C function with unrolled loops.  I'm currently taking a
> well-factored, function-based query and turning it INTO what Robert James is
> trying to get OUT of: a monolithic query.

SQL has a very powerful abstraction feature: it's called a view.  Good
use of views is a key design feature for complex databases.

Functions are generally not a good choice for query abstraction unless:
*) you are working with scalars (string manipulation etc)
*) you need non relational features like plpgsql exception
handling/notice printing, etc
*) this particular operation is not really optimizable anyways and you
want to wrap it (WITH RECURSIVE for example)
*) your function is inline-able (generally, a one liner sql function
that is stable or immutable)
etc

>> In the end, the performance of your queries is going to be directly
>> related to how well you map the problem into relational logic
>
>
> It's not just that, though; it's quite possible to think relationally and
> still fall down. There are plenty of cases where the human eye can see that
> a modular function can be inlined, but the optimizer can't.  I have a
> pathological case: a query against a database with just a few thousand users
> takes 1.5 seconds on fast hardware, because it ends up scanning a cartesian
> product to get 16 rows, even before you get to the nested loops. In fact,
> most of the time the optimizer does a great job of inlining all my
> set-returning functions, once 9.0.6/9.1.2 rolled out.
>
> I've seen at least three equally ominous pieces that would have to happen to
> allow DRY, composable SQL:
>
> 1. Optional optimization of non-recursive WITH
> 2. Optional pushdown of WHERE clauses into GROUP BY[1]
> 3. LATERAL
>
> AFAIK, none of these are on anyone's short-term to-do list, and I'm sure
> none are easy.
>
> [1] Since this is my current favorite problem, the pathological case is:
>
> select questions.id
> from questions
> join (
>  select u.id
>  from users as u
>  group by u.id
> ) as s
> on s.id = questions.user_id
> where questions.id = 1;
>
> With users.id as a primary key, it's obvious that this can return only one
> row, but it has to scan the users table to get there.  See the "Subjquery in
> a JOIN not getting restricted?" thread on pgsql-performance for Tom's
> explanation of why that's a hard problem to solve.

Yeah -- here and there you run into difficult to optimize queries.
(For my part, I'd just have converted that to WHERE EXISTS for the
semi-join).

merlin

Re: Controlling complexity in queries

From

Jay Levitt

Date:

14 December 2011, 19:23:50

Merlin Moncure wrote:
> SQL has a very powerful abstraction feature: it's called a view.  Good
> use of views is a key design feature for complex databases.
>
> Functions are generally not a good choice for query abstraction unless:
One more:

* You want contextual queries.

(I guess this is a special case of "you need non relational features".)

In my case, I want all queries against content to be filtered by their
relevance to the current user. That can't go into a view, because views
don't have parameters; I need a computed column that may be different every
time I run the query, and depends on a piece of information (the current
user ID) that Postgres can't know.

Relevance itself is a weighed average of a bunch of factors which *also*
need the user ID, like "how similar are you to the user who authored that
content?"

As far as I can tell, the only way to accomplish this is through pure-SQL
functions, or by hand-crafting the ultimate SQL query that accomplishes
those functions.

It's much easier to work with

select content.*, relevance(content.id, current_user.id) as r
order by r

than the underlying query.  I'm not doing fancy OO stuff; I'm
compartmentalizing the knowledge of "what is relevance", "what is
similarity", and "how do I fetch a collection of content".  I'm not even in
1990. I'm in (quick Google) 1946, when subroutines were invented:

   "We also wish to be able to arrange for the splitting up of operations
   into subsidiary operations."

If views had parameters, I'd use views, but I suspect that a parameterized
view would be very much like a pure-SQL set-returning function, ya?

I'd love to find a better way to do this.  Having just read Thinking In
Sets, I am sure a *real* SQL programmer would create a view that cross joins
every content row with every user, materialize it, and then restrict your
eventual query with the user id.  But reading that book always leaves me
with the same two questions: "Why is Joe Celko yelling at me? And
what's SNOBOL?"

>> [1] Since this is my current favorite problem, the pathological case is:
>>
>> select questions.id
>> from questions
>> join (
>>   select u.id
>>   from users as u
>>   group by u.id
>> ) as s
>> on s.id = questions.user_id
>> where questions.id = 1;
>>
>> With users.id as a primary key, it's obvious that this can return only one
>> row, but it has to scan the users table to get there.  See the "Subjquery in
>> a JOIN not getting restricted?" thread on pgsql-performance for Tom's
>> explanation of why that's a hard problem to solve.
>
> Yeah -- here and there you run into difficult to optimize queries.
> (For my part, I'd just have converted that to WHERE EXISTS for the
> semi-join).

I think I'm about to learn a very important relational-algebra
equivalence... could you elaborate?

Jay

Re: Controlling complexity in queries

From

Alban Hertroys

Date:

14 December 2011, 21:24:41

>>> [1] Since this is my current favorite problem, the pathological case is:
>>>
>>> select questions.id
>>> from questions
>>> join (
>>>  select u.id
>>>  from users as u
>>>  group by u.id
>>> ) as s
>>> on s.id = questions.user_id
>>> where questions.id = 1;
>>>
>>> With users.id as a primary key, it's obvious that this can return only one
>>> row, but it has to scan the users table to get there.  See the "Subjquery in
>>> a JOIN not getting restricted?" thread on pgsql-performance for Tom's
>>> explanation of why that's a hard problem to solve.
>>
>> Yeah -- here and there you run into difficult to optimize queries.
>> (For my part, I'd just have converted that to WHERE EXISTS for the
>> semi-join).
>
> I think I'm about to learn a very important relational-algebra equivalence... could you elaborate?


You could write that as:

select questions.id
from questions as q
where exists (select 1 from users as u where u.id = q.user_id)
and questions.id = 1;

That's basically what you are doing, checking that a user with a given id from the questions table exists in the users
table.Writing it as WHERE EXISTS is a matter of "phrasing the question" more accurately, which gives the query planner
ahint that for your answer a single hit is sufficient - no need to check whether there are other matches after the
firstone. 

That said, wouldn't a foreign key constraint help you even better? If questions.user_id is required to refer to an
existingusers.id (by an FK constraint), than the check in the query becomes moot. 

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Re: Controlling complexity in queries

From

Jay Levitt

Date:

15 December 2011, 03:11:55

Alban Hertroys wrote:
>> select questions.id from questions
>> join (
>>  select u.id from users as u
>>  group by u.id
>> ) as s
>> on s.id = questions.user_id
>> where questions.id = 1;

> You could write that as:
>
> select questions.id
> from questions as q
> where exists (select 1 from users as u where u.id = q.user_id)
> and questions.id = 1;
>
> That's basically what you are doing, checking that a user with a given id
 > from the questions table exists in the users table.
 >
 > That said, wouldn't a foreign key constraint help you even better? If
 > questions.user_id is required to refer to an existing users.id (by an FK
 > constraint), than the check in the query becomes moot.

Ahh, I see.. yes, this query is just the smallest possible query that
exhibits the same not-using-the-index behavior as the real query, which
needs columns from both questions and users, and thus needs the join.  (And
it has aggregates, and needs the GROUP BY too.) There already is a
constraint, questions.user_id always refers to a real users.id, etc.

This is actually a great case where relational thinking does NOT map well to
functional composability; as Tom Lane pointed out, the solution is just "add
the WHERE clause to the subquery too."  But the subquery is in a function
that doesn't *know* it's being restricted, and (to me) shouldn't have to
know; that's what the optimizer does for a living.

FWIW, and this may help the OP, my plan for tackling the "but I want
readability AND performance" issue is to

1. write a monolithic, optimized, incomprehensible version of the query
2. maintain the pretty functions alongside it
3. Write unit tests that confirm that the output of #1 and #2 is identical.

Kinda like how gcc builds gcc and verifies that the output is the same as
gcc building gcc building gcc.

Jay

Re: Controlling complexity in queries

From

Harald Fuchs

Date:

15 December 2011, 08:37:30

Jay Levitt <jay.levitt@gmail.com> writes:

> * You want contextual queries.
>
> (I guess this is a special case of "you need non relational features".)
>
> In my case, I want all queries against content to be filtered by their
> relevance to the current user. That can't go into a view, because
> views don't have parameters; I need a computed column that may be
> different every time I run the query, and depends on a piece of
> information (the current user ID) that Postgres can't know.

How about the following:

  CREATE TABLE test1 (
    id serial NOT NULL,
    username text NOT NULL,
    value text NOT NULL,
    PRIMARY KEY (id)
  );

  COPY test1 (username, value) FROM stdin DELIMITER '|';
  user1|user1_1
  user1|user1_2
  user2|user2_1
  user2|user2_2
  user2|user2_3
  \.

  CREATE VIEW test1v AS
  SELECT id, username, value
  FROM test1
  WHERE username = current_user;

Here the result of "SELECT * FROM test1v" depends on who issued the query.

Re: Controlling complexity in queries

From

Bèrto ëd Sèra

Date:

15 December 2011, 08:53:18

Hi,

Here the result of "SELECT * FROM test1v" depends on who issued the query.

As a more general case, I sometimes load parameters into a utility table, and use them to dynamically restrict the view's output. Downside: it's a multistatement operation... however, when wrapping complex queries into functions gives the planner a bad time, this will solve the issue.

Bèrto

--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music.