Re: statement_timeout affects query results fetching? - Mailing list pgsql-hackers

From Shay Rojansky
Subject Re: statement_timeout affects query results fetching?
Date
Msg-id CADT4RqBmvNOURiLH0v0rnfsK8Z9EE_o_KiV=+duBTDxsH1FSLw@mail.gmail.com
Whole thread Raw
In response to Re: statement_timeout affects query results fetching?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: statement_timeout affects query results fetching?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Thanks for the explanation Robert, that makes total sense. However, it seems like the utility of PG's statement_timeout is much more limited than I thought.

In case you're interested, I dug a little further and it seems that Microsoft's client for SQL Server implements the following timeout (source):

cumulative time-out (for all network packets that are read during the invocation of a method) for all network reads during command execution or processing of the results. A time-out can still occur after the first row is returned, and does not include user processing time, only network read time.

Since it doesn't seem possible to have a clean query-processing-only timeout at the backend, we may be better off doing something similar to the above and enforce timeouts on the client only. Any further thoughts on this would be appreciated.

On Sun, Aug 9, 2015 at 5:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Aug 8, 2015 at 11:30 AM, Shay Rojansky <roji@roji.org> wrote:
> the entire row in memory (imagine rows with megabyte-sized columns). This
> makes sense to me; Tom, doesn't the libpq behavior you describe of absorbing
> the result set as fast as possible mean that a lot of memory is wasted on
> the client side?

It sure does.

> I can definitely appreciate the complexity of changing this behavior. I
> think that some usage cases (such as mine) would benefit from a timeout on
> the time until the first row is sent, this would allow to put an upper cap
> on stuff like query complexity, for example.

Unfortunately, it would not do any such thing.  It's possible for the
first row to be returned really really fast and then for an arbitrary
amount of time to pass and computation to happen before all the rows
are returned.  A plan can have a startup cost of zero and a run cost
of a billion (or a trillion).  This kind of scenario isn't even
particularly uncommon.  You just need a plan that looks like this:

Nested Loop
-> Nested Loop
  -> Nested Loop
    -> Seq Scan
    -> Index Scan
  -> Index Scan
-> Index Scan

You can just keep pushing more nested loop/index scans on there and
the first row will still pop out quite fast.  But if the seq-scanned
table is large, generating the whole result set can take a long, long
time.

Even worse, you could have a case like this:

SELECT somefunc(a) FROM foo;

Now suppose somefunc is normally very quick, but if a = 42 then it
does pg_sleep() in a loop until the world ends.   You're going to have
however many rows of foo have a != 42 pop out near-instantaneously,
and then it will go into the tank and not come out until the meaning
of life, the universe, and everything is finally revealed.

That second case is a little extreme, and a little artificial, but the
point is this: just as you don't want the client to have to buffer the
results in memory, the server doesn't either.  It's not the case that
the server computes the rows and sends them to you.  Each one is sent
to you as it is computed, and in the normal case, at the time the
first row is sent, only a small percentage of the total work of the
query has been performed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: Zhaomo Yang
Date:
Subject: Re: CREATE POLICY and RETURNING
Next
From: David Rowley
Date:
Subject: Re: Moving SS_finalize_plan processing to the end of planning